Statistical Analysis 6.2: Chi-square Test

Chi-square test

Overview

This lesson introduces the chi-square test.

Objectives

After completing this module, you will be able to:

Test for independence between two variables.
Explain the derivation of the chi-square test.
Conduct a chi-squared test by hand and using R.

Readings

Schumacker, Chapter 11.

Relationships between multiple variables

So far we have examined a sequence of progressively more complex tests:

Is a sample mean equal to a fixed (null) value?
Are two sample means equal to each other?
Are three or more sample means all equal to each other?

Now we move on to examining the relationship between multiple variables. For example:

Is there a connection between income and support for corporate social responsibility?
Is there a relationship between gender and support for capital punishment?
Is there a relationship between an individual’s neighborhood and confidence in a scientific community?
…

From independence to correlation to causation

Are these relationships causal?

Often difficult or impossible to create an experiment to test causality directly, often rely on observational data.
We move on – starting with testing for independence to correlation to causation – over the next few weeks.

Testing for independence

Basic question: Are two variables independent of each other? For example, are gender and political party independent of each other, or are they somehow correlated or dependent?

Answer: Chi-square test:

Similar to the dependent-sample t test as such that we have multiple measures of the same thing.
In our example we have two measures for each person: their gender, and their party affiliation.
We are now interested in whether two different aspects of the same person or object or (more generally) “observation” are connected to each other.
We are dealing with categorical data here, as opposed to numerical.

The Chi-Square test

Our hypotheses are:

\(H_0\): The variables are independent of each other.

\(H_1\): The variables are not independent (ie, they are dependent).

What do we mean by dependent? Basically we mean the same thing as we meant in the probability section: A and B are independent if \(P(A \& B) = P(A)P(B)\).

What does this mean?

Knowing that B has occurred gives you no information about A, they are independent.

Politics example

Research Question: Are gender and political affiliation dependent?

Our datatable:

Risk of jumping to erroneous conclusions:

When we observe an equal number of people in each cell, there is no relationship between gender and politics.
But: maybe there are more Democrats than Republicans in our survey?
When there are more Democrats than Republicans, we would expect to see equal numbers of male and female Demorats but more of each than male or female Republicans.
But: maybe there are more females than males in our survey?
Then we would expect to see more female Democrats than male Democrats, and the same for Republicans, but also more female Democrats than female Republicans, and the same for males.

Examining independence I

Let’s say we have the following data summary statistics:

60% women and 40% men;
and 70% Democrats vs 30% Republicans.

If gender and political affiliation are independent:

then we would expect to see 0.60 times 0.70 percent female Democrats, 0.40 times 0.70 percent male Democrats, 0.60 times 0.30 percent female Republicans and 0.40 times 0.30 percent male Republicans. (And of course to go from percentages to total numbers, we’d multiply by the total number of people in our survey.)
Knowing gender does not give you information about knowing political affiliation (remember the module about probabilities: knowing that B has occurred gives you no information about A = independence).

If gender and political affiliation are not independent,

then we would see different counts.

Examining independence II-a

Back to our politics datatable:

Percent female: 1511 / 2771 = 0.545

Percent male: 1260 / 2771 = 0.455

Percent Dem: 959 / 2771 = 0.346

Percent Indep: 991 / 2771 = 0.358

Percent Rep: 821 / 2771 = 0.296

Examining independence II-b

If gender and party ID are independent:

If being Democrat and being female are independent of each other, then

\[p(Dem \& F) = p(Dem) * p(F) = 0.346*0.545 = 0.189\]

Thus the total number of female Democrats we would expect to see (the total in that cell) would be \(0.189*2771 = 522.9\)

We can do the exact same calculation for each cell, and the expected totals we get are:

But how different is enough to show it’s not just random?

Formalizing the test

Similar as the t test or the F test:

We need a test statistic that summarizes how far off our counts in each cell are from what we would expect if the two variables are independent.
When we have our test statistic, we can determine how unlikely it would be to get a number of that size.

Calculating a test statistic

Our test statistic, called the chi-squared (or sometimes chi-square) statistic, is:

\[\chi^{2}= \sum \frac{(f_{o}-f_{e})^{2}}{f_{e}}\]

Where \(f_{o} =\) observed number in a cell and \(f_{e} =\) expected number in a cell, and the summation is over all the cells.

In words:

For each cell, we take the differences between the observed count and the expected count.
We square that difference,
and take it as fraction of the expected total;
and then we just add them all up.

So in this case, our statistic is:

\(\chi^{2}= \sum \frac{(f_{o}-f_{e})^{2}}{f_{e}} = \frac{(573-522.9)^{2}}{522.9} + \frac{(516-540.4)^{2}}{540.4} + ... + \frac{(399 - 373.3)^{2} }{373.3} = 16.2\)

Is this big enough?

The chi-squared distribution

The chi-square is just another distribution.
Whereas the normal and T distributions, for instance, deal with sample statistics such as means, the chi-squared distribution characterizes the sum of squared normal statistics.
Looking back at how we calculated our chi-square statistic, there are squared terms in the numerator, and of course the denominator is also positive (being a count), so the chi-square is always positive, and thus can’t be normal.

Look at what happens if we square a bunch of normal samples and add them up. We get a distribution that looks like the \(\chi^{2}\):

z1 <- rnorm(1000,2,5)
z2 <- rnorm(1000,5,3)
z3 <- rnorm(1000,7,7)
zsq_tot <- z1^2 + z2^2 + z3^2 
hist(zsq_tot,breaks=30)

Degrees of freedom

As usual, the chi-square distribution also has a shape parameter that is determined by the degrees of freedom.
The degrees of freedom is not proportional to the number of samples (as with the t distribution) but to the number of cells in the table.

Degrees of freedom:

\[df = (r-1)(c-1)\],

where \(r =\) number of rows, \(c =\) number of columns.__

The various shapes of the \(\chi^{2}\) depending on what the df are:

Critical threshold value

Our test statistic is a draw from the \(\chi^{2}\) distribution (with the appropriate degrees of freedom), and the farther out it is, the less likely it is.
The chi-square test is fundamentally one-tailed: we are only interested in whether the statistics is larger than we would expect if the variables were independent, and it can’t be negative due to squaring the differences.
If it falls into the rejection region – eg, the region of the right tail of the distribution that accounts for less than 0.05 of the total, then we know that that number was unlikely to be that large just by chance alone.

To return to our example, the df is \((r-1)(c-1) = (2-1)(3-1) = 2\), and our test statistic was 16.2.

Our 95% threshold value is thus

qchisq(.95, df=2)

[1] 5.991465

Our test statistic is clearly much larger (16.2 > 5.99), so we reject the null that these two variables (gender and political affiliation) are independent.

Calculate p-value

We could similarly calculate the p-value directly and likewise reject the null:

1-pchisq(16.2, df=2)

[1] 0.0003035391

Using the table

Although again with modern computation, we don’t really need to use tables any more, we could also determine the test threshold value using the \(\chi^{2}\) table:

As usual, we find the \(df\) on the right, and look for the \(\alpha\) level along the top. Eg, for an \(\alpha\) of 0.05, we look under \(\chi^{2}_{0.050}\), and once again we see our threshold value of 5.991.

Returning to our hypotheses

Our conclusion: Gender and party are not independent.
Can we say anything more?

Let’s go back to our table showing the expected vs the observed frequencies, and calculate the signed score for each cell, which is just

\[\frac{(f_{o}-f_{e})^{2}}{f_{e}}\]

but now we consider both negative and positive values for differences \(f_{o}-f_{e}\).

Our datatable:

Which cells are over or under their expected values?

Show me how to calculate this for each cell

	Dem		Indep		Rep
Female	\(4.80\)		\(-1.10\)		\(-1.48\)
Male	\(-5.76\)		\(1.32\)		\(-1.77\)

Summary of procedure

\(H_{0} =\) variables are independent.
\(H_{a} =\) variables not independent.
Calculate \(f_{e}\) for each cell.
Shortcut: \(f_{e} = \frac{\textrm{(row total)(column total)}}{\textrm{overall total}}\)
Calculate \(\chi^{2} = \sum \frac{(f_{o}-f_{e})^{2}}{f_{e}}\)
Calculate \(df = (r-1)(c-1)\)
Calculate the threshold value and reject the null if the test statistic (3) is greater than it.
Or calculate the p-value directly and reject the null if it is less than your chosen \(\alpha\).

Doing the chi-square test in R

sexparty <- data.frame(dem=c(573,386),indep=c(516,475),rep=c(422,399),row.names=c("female","male"))
sexparty

       dem indep rep
female 573   516 422
male   386   475 399

chisq.test(sexparty)


    Pearson's Chi-squared test

data:  sexparty
X-squared = 16.202, df = 2, p-value = 0.0003033

Doing the chi-square test with GSS data in R

Calculating in R with the GSS data:

From http://gss.norc.org/About-The-GSS: “For more than four decades, the General Social Survey (GSS) has studied the growing complexity of American society. It is the only full-probability, personal-interview survey designed to monitor changes in both social characteristics and attitudes currently being conducted in the United States.”

For this example, we want to know if the attitude towards the future is independent of the numbers of hours spent on the internet.

First, let’s load the GSS database

setwd("/Users/econphd/Dropbox/neu/2021/Summer/INSH6500/Lectures/GSS") 
gss <- readRDS("GSS2016.Rds")

Second, for this test are going to extract two variables from the database: (1) LOTR3 which contains the answers to the survey question “I’m always optimistic about my future”, and, (2) INTWKDYH which contains the answers to the survey question “How many minutes or hours do you spend actively using the Internet or web-enabled applications/APPS on a typical weekday?”. The answers are in a scale of 1-7, any value outside of that range can be considered NA in the analysis.

hope <- gss$LOTR3
internetUsage <- gss$intwkdyh

Third, Cleaning the data

hope <- replace(hope, hope < 1 | hope > 7, NA)
internetUsage <- replace(internetUsage, internetUsage < 1  | internetUsage > 7, NA)

# Converting hope to numeric and internetUsage to factor
hope <- as.numeric(hope)
internetUsage <- as.factor(internetUsage)

Fourth, Join the two vectors in a data frame and make frequency table

data.ex2 <- data.frame(hope, internetUsage)
tabularData <- table(data.ex2$hope, data.ex2$internetUsage)
data.ex2 <- as.data.frame.matrix(tabularData) 
colnames <- paste("internet_group", 1:ncol(tabularData), sep = "_")
rownames <- paste("hope_group", 1:nrow(tabularData), sep = "_")
chisq.test(data.ex2)


    Pearson's Chi-squared test

data:  data.ex2
X-squared = 17.556, df = 24, p-value = 0.824

Then, at \(\alpha = 0.05\), we cannot reject the null hypothesis that the average attitude towards the future and internet usage are independent variables.