Wednesday, October 28, 2015

Chi-Square Explained

Chi-square overview

When we talk about inferential statistics, we're simply determining whether or not the results we obtained were due to chance, or not due to chance. (Inferential means that, if our sample is representative, we can INFER from our sample that the results are indicative of the entire population).

If they are not due to chance, we suggest a relationship between variables. If it is due to chance, we can not make such a claim.

Think of inferential stats as a light switch-- it's either on or off. In the social sciences, if the significance is .05 or lower (that is, we allow for 95% confidence), then we say the switch is "on" and the results are "significant"-- meaning that we are 95% sure that the results are NOT due to chance.

If p (the probability that the results would show up like this by chance) is HIGHER than .05, then we say there is NO significance-- which means we can't argue that the variables are related. Of course this method of testing significance is disputed, it is a generally accepted practice.

Chi-square is a simple statistical test when are testing two categorical variables.

Put simply, chi-square is the sum of the observed frequency minus the expected frequency, squared-- divided by the expected frequency.

The observed frequency is simply the number reported. The expected frequency is what you'd expect if it were completely by chance.

It's best explained with an example.

Suppose we asked 97 people about their political affiliations (let's assume it's a random sample) and we got this:

Gender-------Republican------ Democrat----- Row Total

Male:-----------  23---------------- 17------------ 40 

Female:--------- 20---------------- 37------------ 57

Column Total:---43----------------54------------ 97


Our hypothesis is:

H1: Women are more likely to be affiliated with the Democratic party than men.

By looking at the raw data, it's difficult to say, with certainty, that this is the case, so we test the hypothesis using chi square.

Our first order of business is to find the "expected" frequency.

The "expected" frequency is R x C / N (where R is the ROW total; C is the COLUMN total; and N is the overall number.

So the ROW total for men is 40.
The ROW total for women is 57.

The Column total for Republicans is 43.
The Column total for Democrats is 54.

The overall N is 97.

The expected frequency for males who should be Republicans based on chance is 40 x 43, which is 1,720 / 97 = 17.73.

Ok, so now we know that the observed frequency for men who are Republicans is 23, and the expected frequency is 17.73. This gives us a difference of (5.27). We square this value and get 27.77.

Once we have that value, we divide by the expected value and get this-- 27.77/17.73 = 1.57 (we always round to the nearest hundredth).

Remember, though, chi-square is the SUM OF, so we have compute it for each cell.

So, we repeat the process for each "cell" and then add up the totals.

Once we have the sum of the chi-squares, we check with a chi-square chart to see if it's significant at the .05 level. You can check here-- chi-square chart.

You'll note something called "degrees of freedom," or "df." A df helps us to determine what line to look at on the chart. The easiest way to remember df is this-- it's (R-1) x (C-1) where R is the number of rows and C is the number columns. In this case, we have 2 rows and 2 columns, which gives us a df of 1 because (R-1) = (2-1), and (C-1) = (2-1), and 1 x 1 = 1.

Also, remember that we use the .05 level of significance.

No comments:

Post a Comment