Wednesday, November 4, 2015

Understanding Correlation

Correlation Overview

So far, we've talked about Means, Standard Deviation, z-Score, Chi-square, t-tests, and ANOVA.
Remember that, depending on the type of measurement for the independent variable (IV) and dependent variables (DV), we use certain tests.

Specifically,--If the IV is nominal and the DV is nominal, we use chi-square. 

If the IV is nominal and the DV is interval/ratio, we use t-test.
If the IV is interval/ratio and the DV is interval/ratio, we us correlation.

Correlation is the single most common statistical test in mass media research. 


Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. The relationship isn't perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5'5'' is less than the average weight of people 5'6'', and their average weight is less than that of people 5'7'', etc. Correlation can tell you just how much of the variation in peoples' weights is related to their heights. 


Although this correlation is fairly obvious your data may contain unsuspected correlations. You may also suspect there are correlations, but don't know which are the strongest. An intelligent correlation analysis can lead to a greater understanding of your data.
Like all statistical techniques, correlation is only appropriate for certain kinds of data. 

Correlation works for quantifiable data in which numbers are meaningful, usually quantities of some sort. It cannot be used for purely categorical data, such as gender, brands purchased, or favorite color


Rating Scales
Rating scales are a controversial middle case. The numbers in rating scales have meaning, but that meaning isn't very precise. They are not like quantities. With a quantity (such as dollars), the difference  between 1 and 2 is exactly the same as between 2 and 3. With a rating scale, that isn't really the case. You can be sure that your respondents think a rating of 2 is between a rating of 1 and a rating of 3, but you cannot be sure they think it is exactly halfway between. This is especially true if you labeled the mid-points of your scale (you cannot assume "good" is exactly half way between "excellent" and "fair").

Most statisticians say you cannot use correlations with rating scales, because the mathematics of the technique assume the differences between numbers are exactly equal. Nevertheless, many survey researchers do use correlations with rating scales, because the results usually reflect the real world. The position of this class is that you can use correlations with rating scales, but you should do so with care. When working with quantities, correlations provide precise measurements. When working with rating scales, correlations provide general indications.
The main result of a correlation is called the correlation coefficient (or "r"). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.
While correlation coefficients are normally reported as r = (a value between -1 and +1), squaring them makes them easier to understand. The square of the coefficient (or r squared) is equal to the percent of the variation in one variable that is related to the variation in the other. After squaring r, ignore the decimal point. An r of .5 means 25% of the variation is related (.5 squared =.25). An r value of .7 means 49% of the variance is related (.7 squared = .49).
A correlation report can also show a second result of each test - statistical significance. In this case, the significance level will tell you how likely it is that the correlations reported may be due to chance in the form of random sampling error. If you are working with small sample sizes, choose a report format that includes the significance level. This format also reports the sample size.
A key thing to remember when working with correlations is never to assume a correlation means that a change in one variable causes a change in another. Sales of personal computers and athletic shoes have both risen strongly in the last several years and there is a high correlation between them, but you cannot assume that buying computers causes people to buy athletic shoes (or vice versa).
The second caveat is that the Pearson correlation technique works best with linear relationships: as one variable gets larger, the other gets larger (or smaller) in direct proportion. It does not work well with curvilinear relationships (in which the relationship does not follow a straight line). An example of a curvilinear relationship is age and health care. They are related, but the relationship doesn't follow a straight line. Young children and older people both tend to use much more health care than teenagers or young adults.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an "inverse" correlation).

T-test Overview

T-test explained

Often in social science situations, we want to see if there is a statistical difference between two groups. To determine if the differences are significant, we use a simple inferential statistical test called the t-test.

Here's how we solve for t:

t = X1 - X2/ Sm1 - Sm2

where X1 = mean (average) of the first group and X2 = mean (average) of the second group, and Sm = standard error of the mean.

So, for example, let's say we have two groups of students in an experiment where we're trying to test whether or not kids can learn their multiplication facts better via a TV show than in school.

Let's say we bring in 20 kids and we randomly assign them to two groups. The first group of 10 learns their multiplication facts via TV show, and the second group of 10 learns via the traditional classroom approach. Notice that this is an example of an Independent Samples t-test.

So we have something like this:

# of TV kids = 10
# of class kids = 10
Overall N = 20

We examine their scores and we see that the TV kids averaged a 2/10 on a post-test quiz measuring multiplication facts and the traditional class kids averaged a 6/10 on the same test.

Normally, you'd have to solve for the Sm (standard error of the mean), but we haven't covered that in class so don't worry about it.  Let's say that the Sm = 1.05.

Ok, so here's what we do--

We solve for t like this:

t = 2 - 6/1.05

t = -4/1.05

t = 3.81 (we always take the absolute value)

This value, in and of itself, tells us nothing.

Just like chi-square, however, we have to be concerned about degrees of freedom (df).

The df for a t-test is simple-- you take the N for the group and subtract one. So, for the first group, the df is 9 (10-1), and for the second group it's also 9 (10-1).

9 + 9 = 18, so the df = 18.

So, now armed with this info, we can check out the t-value chart (either the one in the book, or one we find easily online-- like here for example-- t-test table), and we see that in order to be significant with 18 df (at the .05 level), the t-value needs to be greater than 2.10. Remember when you run the test in SPSS they automatically give you the p value so you can determine if the mean difference is significant.

Since our t-value of 3.81 is higher than 2.10, we say that there is a significant difference between the groups.

We then look back at our original data and we see that the traditional kids scored, on average, much better than the TV kids, so we conclude that it's better to use the traditional method.