This lesson introduces the chi-square test.
After completing this module, you will be able to:
Schumacker, Chapter 11.
So far we have examined a sequence of progressively more complex tests:
Now we move on to examining the relationship between multiple variables. For example:
Is there a connection between income and support for corporate social responsibility?
Is there a relationship between gender and support for capital punishment?
Is there a relationship between an individual’s neighborhood and confidence in a scientific community?
…
Are these relationships causal?
Often difficult or impossible to create an experiment to test causality directly, often rely on observational data.
We move on – starting with testing for independence to correlation to causation – over the next few weeks.
Basic question: Are two variables independent of each other? For example, are gender and political party independent of each other, or are they somehow correlated or dependent?
Answer: Chi-square test:
Similar to the dependent-sample t test as such that we have multiple measures of the same thing.
In our example we have two measures for each person: their gender, and their party affiliation.
We are now interested in whether two different aspects of the same person or object or (more generally) “observation” are connected to each other.
We are dealing with categorical data here, as opposed to numerical.
Our hypotheses are:
\(H_0\): The variables are independent of each other.
\(H_1\): The variables are not independent (ie, they are dependent).
What do we mean by dependent? Basically we mean the same thing as we meant in the probability section: A and B are independent if \(P(A \& B) = P(A)P(B)\).
What does this mean?
Knowing that B has occurred gives you no information about A, they are independent.
Research Question: Are gender and political affiliation dependent?
Our datatable:
Risk of jumping to erroneous conclusions:
When we observe an equal number of people in each cell, there is no relationship between gender and politics.
But: maybe there are more Democrats than Republicans in our survey?
When there are more Democrats than Republicans, we would expect to see equal numbers of male and female Demorats but more of each than male or female Republicans.
But: maybe there are more females than males in our survey?
Then we would expect to see more female Democrats than male Democrats, and the same for Republicans, but also more female Democrats than female Republicans, and the same for males.
Let’s say we have the following data summary statistics:
60% women and 40% men;
and 70% Democrats vs 30% Republicans.
If gender and political affiliation are independent:
then we would expect to see 0.60 times 0.70 percent female Democrats, 0.40 times 0.70 percent male Democrats, 0.60 times 0.30 percent female Republicans and 0.40 times 0.30 percent male Republicans. (And of course to go from percentages to total numbers, we’d multiply by the total number of people in our survey.)
Knowing gender does not give you information about knowing political affiliation (remember the module about probabilities: knowing that B has occurred gives you no information about A = independence).
If gender and political affiliation are not independent,
Back to our politics datatable:
Percent female: 1511 / 2771 = 0.545
Percent male: 1260 / 2771 = 0.455
Percent Dem: 959 / 2771 = 0.346
Percent Indep: 991 / 2771 = 0.358
Percent Rep: 821 / 2771 = 0.296
If gender and party ID are independent:
If being Democrat and being female are independent of each other, then
\[p(Dem \& F) = p(Dem) * p(F) = 0.346*0.545 = 0.189\]
Thus the total number of female Democrats we would expect to see (the total in that cell) would be \(0.189*2771 = 522.9\)
We can do the exact same calculation for each cell, and the expected totals we get are:
But how different is enough to show it’s not just random?
Similar as the t test or the F test:
We need a test statistic that summarizes how far off our counts in each cell are from what we would expect if the two variables are independent.
When we have our test statistic, we can determine how unlikely it would be to get a number of that size.
Our test statistic, called the chi-squared (or sometimes chi-square) statistic, is:
\[\chi^{2}= \sum \frac{(f_{o}-f_{e})^{2}}{f_{e}}\]
Where \(f_{o} =\) observed number in a cell and \(f_{e} =\) expected number in a cell, and the summation is over all the cells.
In words:
For each cell, we take the differences between the observed count and the expected count.
We square that difference,
and take it as fraction of the expected total;
and then we just add them all up.
So in this case, our statistic is:
\(\chi^{2}= \sum \frac{(f_{o}-f_{e})^{2}}{f_{e}} = \frac{(573-522.9)^{2}}{522.9} + \frac{(516-540.4)^{2}}{540.4} + ... + \frac{(399 - 373.3)^{2} }{373.3} = 16.2\)
Is this big enough?
The chi-square is just another distribution.
Whereas the normal and T distributions, for instance, deal with sample statistics such as means, the chi-squared distribution characterizes the sum of squared normal statistics.
Looking back at how we calculated our chi-square statistic, there are squared terms in the numerator, and of course the denominator is also positive (being a count), so the chi-square is always positive, and thus can’t be normal.
Look at what happens if we square a bunch of normal samples and add them up. We get a distribution that looks like the \(\chi^{2}\):
z1 <- rnorm(1000,2,5)
z2 <- rnorm(1000,5,3)
z3 <- rnorm(1000,7,7)
zsq_tot <- z1^2 + z2^2 + z3^2
hist(zsq_tot,breaks=30)
As usual, the chi-square distribution also has a shape parameter that is determined by the degrees of freedom.
The degrees of freedom is not proportional to the number of samples (as with the t distribution) but to the number of cells in the table.
Degrees of freedom:
\[df = (r-1)(c-1)\],
where \(r =\) number of rows, \(c =\) number of columns.__
The various shapes of the \(\chi^{2}\) depending on what the df are:
Our test statistic is a draw from the \(\chi^{2}\) distribution (with the appropriate degrees of freedom), and the farther out it is, the less likely it is.
The chi-square test is fundamentally one-tailed: we are only interested in whether the statistics is larger than we would expect if the variables were independent, and it can’t be negative due to squaring the differences.
If it falls into the rejection region – eg, the region of the right tail of the distribution that accounts for less than 0.05 of the total, then we know that that number was unlikely to be that large just by chance alone.
To return to our example, the df is \((r-1)(c-1) = (2-1)(3-1) = 2\), and our test statistic was 16.2.
Our 95% threshold value is thus
qchisq(.95, df=2)
[1] 5.991465
Our test statistic is clearly much larger (16.2 > 5.99), so we reject the null that these two variables (gender and political affiliation) are independent.
We could similarly calculate the p-value directly and likewise reject the null:
1-pchisq(16.2, df=2)
[1] 0.0003035391
Although again with modern computation, we don’t really need to use tables any more, we could also determine the test threshold value using the \(\chi^{2}\) table:
As usual, we find the \(df\) on the right, and look for the \(\alpha\) level along the top. Eg, for an \(\alpha\) of 0.05, we look under \(\chi^{2}_{0.050}\), and once again we see our threshold value of 5.991.
Our conclusion: Gender and party are not independent.
Can we say anything more?
Let’s go back to our table showing the expected vs the observed frequencies, and calculate the signed score for each cell, which is just
\[\frac{(f_{o}-f_{e})^{2}}{f_{e}}\]
but now we consider both negative and positive values for differences \(f_{o}-f_{e}\).
Our datatable:
Which cells are over or under their expected values?
Show me how to calculate this for each cell\(H_{0} =\) variables are independent.
\(H_{a} =\) variables not independent.
Calculate \(f_{e}\) for each cell.
Shortcut: \(f_{e} = \frac{\textrm{(row total)(column total)}}{\textrm{overall total}}\)
Calculate \(\chi^{2} = \sum \frac{(f_{o}-f_{e})^{2}}{f_{e}}\)
Calculate \(df = (r-1)(c-1)\)
Calculate the threshold value and reject the null if the test statistic (3) is greater than it.
Or calculate the p-value directly and reject the null if it is less than your chosen \(\alpha\).
sexparty <- data.frame(dem=c(573,386),indep=c(516,475),rep=c(422,399),row.names=c("female","male"))
sexparty
dem indep rep
female 573 516 422
male 386 475 399
chisq.test(sexparty)
Pearson's Chi-squared test
data: sexparty
X-squared = 16.202, df = 2, p-value = 0.0003033
Calculating in R with the GSS data:
From http://gss.norc.org/About-The-GSS: “For more than four decades, the General Social Survey (GSS) has studied the growing complexity of American society. It is the only full-probability, personal-interview survey designed to monitor changes in both social characteristics and attitudes currently being conducted in the United States.”
For this example, we want to know if the attitude towards the future is independent of the numbers of hours spent on the internet.
First, let’s load the GSS database
setwd("/Users/econphd/Dropbox/neu/2021/Summer/INSH6500/Lectures/GSS")
gss <- readRDS("GSS2016.Rds")
Second, for this test are going to extract two variables from the database: (1) LOTR3
which contains the answers to the survey question “I’m always optimistic about my future”, and, (2) INTWKDYH
which contains the answers to the survey question “How many minutes or hours do you spend actively using the Internet or web-enabled applications/APPS on a typical weekday?”. The answers are in a scale of 1-7, any value outside of that range can be considered NA
in the analysis.
hope <- gss$LOTR3
internetUsage <- gss$intwkdyh
Third, Cleaning the data
hope <- replace(hope, hope < 1 | hope > 7, NA)
internetUsage <- replace(internetUsage, internetUsage < 1 | internetUsage > 7, NA)
# Converting hope to numeric and internetUsage to factor
hope <- as.numeric(hope)
internetUsage <- as.factor(internetUsage)
Fourth, Join the two vectors in a data frame and make frequency table
data.ex2 <- data.frame(hope, internetUsage)
tabularData <- table(data.ex2$hope, data.ex2$internetUsage)
data.ex2 <- as.data.frame.matrix(tabularData)
colnames <- paste("internet_group", 1:ncol(tabularData), sep = "_")
rownames <- paste("hope_group", 1:nrow(tabularData), sep = "_")
chisq.test(data.ex2)
Pearson's Chi-squared test
data: data.ex2
X-squared = 17.556, df = 24, p-value = 0.824
Then, at \(\alpha = 0.05\), we cannot reject the null hypothesis that the average attitude towards the future and internet usage are independent variables.