<

Hypothesis testing

Overview

This lesson introduces the joint hypothesis tests, ANOVA and the F-Test

Objectives

After completing this module, students should be able to:

  1. Test for differences among multiple groups.
  2. Explain the derivation of the F-Test.
  3. Conduct an F-Test by hand and using R.

Testing Hypothesis with Multiple Groups

Previous Learning Module

  • T-Test: Comparing if means of two different groups are statistically different.

This Week

  • Comparing means for more two or more groups simultaneously. For example,
    • Do individuals who identify with different political parties (e.g. Republican, Democrat, Independent) think about public matters differently?
    • Is the average lifetime earning different among a set of professions (doctor, lawyer, programmer, designer, etc)?
    • Do individuals from different communities have different average life expectancy?

ANOVA

Why can’t we just use the t-test?

Recall that for the t-test, we can take the difference between the sample mean of two groups and use it as the null hypothesis value. The key is that we are able to rewrite the hypothesis as a new single variable (the mean difference), that allow us to compute a single t-statistic for the test. But now that we have more than two groups, we are not going to be able to rewrite the null as a function of a single variable, e.g. if there are three groups, we need to test at least two mean difference.

How do we test for differences in means across multiple groups?

  • Through an Analysis of Variance, or ANOVA.
  • The specific test is called the F-Test and the procedure is very similar to the t-test:
    • Formulate Hypotheses: \(H_0\): All group means are equal, \(H_1\) At least one of the means is different.
    • Determine critical rejection region threshold of the test.
    • Calculate the corresponding F Statistic from the sample (this plays the same role as the t statistic in the t test), and compare that to a threshold value which is based on the F distribution - more about the F distribution in the following slides -.
    • Make a decision: Reject the null hypothesis in favor of the alternative hypothesis if the test statistic is in the rejection region (or equivalently, the p-value is less than the significance level).

The Hypotheses

Assume that we have observations from \(g\) different groups, and we want to test if the population mean of each group are statistically different. Formally, we can formulate the following hypothesis:

\[\begin{eqnarray*} H_0 & : & \mu_1 = \mu_2 = ... = \mu_g \\ H_a & : & \text{At least one is different.} \end{eqnarray*}\]

where \(\mu_i\) is the mean of group \(i\), \(\forall i = 1,2,...,g\)

The F-Statistic

Imagine that we collect data about a variable \(X\) for different groups (\(g\)), like in the image below.

  • Each group sample have individual mean and standard deviation.

  • The mean of a group (\(\mu_i\)) may be statistically different from another depending on the spread of each distribution (\(\sigma_i\)).

-The greater the mean difference among groups, the greater the probability that the samples are drawn from different population distributions. -But the more spread out each group is, the less different they are likely to be. In other words, the higher the standard deviation - the more disperse is the sample data - the harder it will be for us to statistically proof that the means are different (it will be harder to reject the null hypothesis).

Therefore, we need a ratio as a measure of the total probability that the group means are different:

\[\text{F-statistic} = \dfrac{\text{average variance between groups}}{\text{average variance within groups}}\]

The greater the numerator, the more different the means of the distributions are; but the higher the denominator, the less significant that difference is. This ratio, the F-Statistic increases when the sample average of the different groups are widely different; but decreases with the dispersion of the data.

Between-Group Variance

Between-Group Variance is an overall measure of how different the means of all the groups are from each other. Formally, for \(g\) groups:

\[\text{Between Variance} = \dfrac{n_1 (\bar{y_1}- \bar{y})^2 + n_2 (\bar{y_2}- \bar{y})^2 + ... + n_g (\bar{y_g}- \bar{y})^2 }{g-1}\]

Where \(g\) is the number of groups, \(n_i\) is the sample size of group \(i\), \(\bar{y_i}\) is the sample mean of group \(i\), and \(\bar{y}\) is the overall mean — the mean that result from combining the observations of all samples.

The between variance can be calculated in five simple steps:

  1. Calculate the overall mean of the samples pooled together, \(\bar{y}\).
  2. Measure how much each group \(i\) differs from that overall mean, i.e. compute the difference \((\bar{y_i}- \bar{y})\) for each group \(i\). And square the result.
  3. Weight each squared difference by the sample size \(n_i\) of the group. (Because a group of 1000 individuals should have a larger effect on the overall difference between the groups than a group of just 10)
  4. Sum the squared differences. This gives single number that captures the variance of all the \(\bar{y_i}\) values around their collective mean \(\bar{y}\).
  5. Divide by \(g-1\). This are the degrees of freedom of the between-group variance. The number of groups \(g\) minus one (we subtract minus one because we need to use at least one statistic, the pooled sample average, to compute the between-group variance).

Within-Group Variance

Within-Group Variance is an overall measure of the dispersion of different samples. Formally, for \(g\) groups:

\[\text{Within Variance} = \dfrac{(n_1-1)s_1^2 +(n_2-1)s_2^2 + ... + (n_g-1)s_g^2 }{N-g}\]

Where \(g\) is the number of groups, \(n_i\) is the sample size of group \(i\), \(s_i\) is the standard deviation of group \(i\), and \(N\) is the overall sample size.

The within variance can be calculated in just four steps:

  1. Calculate the standard deviation for each group, \(s_i\). And square the result.
  2. Weight each squared standard deviation by the sample size minus 1 (\(n_i-1\)).
  3. Sum the weighted standard deviations.
  4. Divide by the degrees of freedom \(N-g\).

Combining Both Variances

Now that we know how to compute both the between- and within- group variance, we can compute the F-statistic using the following formula:

\[\begin{eqnarray*} \text{F-statistic} & = & \dfrac{\text{average variance between groups}}{\text{average variance within groups}} \\ & = & \dfrac{\dfrac{n_1 (\bar{y_1}- \bar{y})^2 + n_2 (\bar{y_2}- \bar{y})^2 + ... + n_g (\bar{y_g}- \bar{y})^2 }{g-1}}{\dfrac{(n_1-1)s_1^2 +(n_2-1)s_2^2 + ... + (n_g-1)s_g^2 }{N-g}} \\ & = & \left(\dfrac{N-g}{g-1} \right)\left(\dfrac{n_1 (\bar{y_1}- \bar{y})^2 + n_2 (\bar{y_2}- \bar{y})^2 + ... + n_g (\bar{y_g}- \bar{y})^2}{(n_1-1)s_1^2 +(n_2-1)s_2^2 + ... + (n_g-1)s_g^2} \right) \\ & = & \left(\dfrac{N-g}{g-1} \right)\left(\dfrac{\Sigma_i^g n_i (\bar{y_i}- \bar{y})^2}{\Sigma_i^g(n_i-1)s_i^2} \right) \end{eqnarray*}\]

Note that there are two degrees of freedom expressions in the F statistic: - Degrees of freedom of the numerator: \(g-1\) - Degrees of freedom of the denominator: \(N-g\)

When computing the F-statistic threshold in R you need to specify each degree of freedom separately.

Example 1: Party ID and Ideology

Consider that we conduct a survey and ask individuals to provide their Party ID and to rank themselves on a 1-7 ideology scale (1 means far left, and 7 means far right).

Then we proposed the following Research Question: On an ideology scale (Liberal vs Conservative), does the average ideology differ for Party ID (Democrat, Independent, or, Republican)?

The table below shows the result from the survey:

Are the means of each party group (Democrats, Independents, and Republicans) different?

To answer this question we cannot rely on a simple t-test. This is a job for the F-statistic.

Calculating the F-statistic

Recall that F-statistic is equal to:

\[\text{F-statistic} = \dfrac{\text{average variance between groups}}{\text{average variance within groups}}\]

  1. Calculating the Between-Variance (with overall mean \(\bar{y} = 3.89\))
\[\begin{eqnarray*} \text{Between Variance} & = & \dfrac{n_1 (\bar{y_1}- \bar{y})^2 + n_2 (\bar{y_2}- \bar{y})^2 + ... + n_g (\bar{y_g}- \bar{y})^2 }{g-1} \\ = \dfrac{91 (3.23- 3.89)^2 + 111 (3.90- 3.89)^2 + 74 (4.70- 3.89)^2 }{3-1} \\ & = & 44.1 \end{eqnarray*}\]
  1. Calculating the Within-Variance
\[\begin{eqnarray*} \text{Within Variance} & = & \dfrac{(n_1-1)s_1^2 +(n_2-1)s_2^2 + ... + (n_g-1)s_g^2 }{N-g} \\ & = & \dfrac{(91-1)(1.28)^2 +(111-1)(1.43)^2 + (74-1)(1.10)^2 }{276-3} \\ & = & 1.69 \end{eqnarray*}\]
  1. Calculating F-Statistic
\[\begin{eqnarray*} \text{F-statistic} & = & \dfrac{\text{average variance between groups}}{\text{average variance within groups}} \\ & = & \dfrac{44.1}{1.69} \\ & = & 26.1 \end{eqnarray*}\]

How do we know if 26.1 is large enough to reject the null hypothesis?

F-Distribution

Last week, we learned that the t distribution has one parameter (besides the mean and se) that affects its shape, the degree of freedom, which determines how much of the mass of the distribution is in the tails vs in the middle.

Degrees of freedom in the F distribution:

Instead of one, there are now two shape parameters, reflecting two different degrees of freedom:

Recall that the degrees of freedom of the denominator, call it \(df_1\), is equal to \(g-1\) and the degrees of freedom of the numerator, call it \(df_2\) is equal to \(N-g\). \(df_1\) and \(df_2\) together determine the shape of the F distribution, and thus whether the F test statistic is large enough to reject the null. In the figure above you can see different representations of the shape of the F distribution as a function of the two degrees of freedom.

Interpreting the F-Statistic

  • As with the t distribution, we test whether the null hypothesis is true (i.e. whether observed variation between groups is simply due to chance).

  • The F distribution defines the probability density function of the ANOVA.

  • We want to know if the calculated F statistic is sufficiently unlikely to have been drawn by chance. If so, we reject the null, that all the means are the same, in favor of the alternative, that at least one is different from the rest.

  • Unlike the t, the F is always non-negative – we are adding non-negative numbers (squared numbers). Thus the F test is always one-tailed.

  • To reject the null we need a large F statistic: recall that the formula of the F statistic implies that the more mean difference the larger the F statistic.

  • A big F-stat reflects more different means, whereas the smallest possible value, 0, reflects means that are all identical (the null).

Then, we’ll reject the null hypothesis if:

\[F_\text{calculated} > F_\text{threshold}\]

Calculating the F Threshold

Returning to our example, the two degrees of freedom are: - \(df_1 = g -1 = 3 -1 - 2\) - \(df_2 = N- g = 276 - 3 =273\) And the calculated F statistic \(F_\text{calculated} = 26.1\)

In R, if the significance level is \(\alpha = 0.05\), to reject the null we need an F-statistic such that the area of the distribution is greater than the threshold.

#Parameters
N <- 276 # Total number of observations
g <- 3 #Three groups, DEM, REP, IND
alpha <- 0.05 #Significance Level

# Degrees of freedom
df1 <- g - 1 #Degrees of freedom of numerator
df2 <- N - g #Degrees of freedom of denominator

# F threshold
fVal <- qf(1 - alpha, df1, df2) #F value
fVal
## [1] 3.028847

The F threshold, 3.028847, is less than the calculated F, 26.1. This means that we can reject the null that the means of all three groups are the same, in favor the the alternative hypothesis that at least one of them is different. Substantively, the ideology of the different Party ID is different.

Using p-values

Again as with the t test, we can also calculate directly the p-value for the score we got (26.1), which is the probability of getting something that large or larger assuming the null is true.

1 - pf(26.1, df1, df2)
## [1] 4.242806e-11

This is clearly much lower than the significance level (\(\alpha = 0.05\)), so once again we can reject the null.

Example 2: Doing the F Test in R

Research Question: We are interested in “alertness” measurements for three different groups, each of which received a different dosage of some drug. Are there any differences in alertness levels across any of these three groups who received different dosages (coded as a, b, and c).

For that we are going to use data from the personality-project website.

datafilename="http://personality-project.org/r/datasets/R.appendix1.data"
data.ex1=read.table(datafilename,header=T)
head(data.ex1)
##   Dosage Alertness
## 1      a        30
## 2      a        38
## 3      a        35
## 4      a        41
## 5      a        27
## 6      a        24

ANOVA test in R

The way to conduct the F-test in R is using the aov command aov stands for Analysis of Variance or ANOVA for short. Using the data from the personality project dataset we can simply run the following commands in R:

aov.ex1 <- aov(Alertness ~ Dosage, data = data.ex1)
summary(aov.ex1)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Dosage       2  426.2  213.12   8.789 0.00298 **
## Residuals   15  363.8   24.25                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test gives us:

  • Degrees of freedom (\(df_1 =2\), there are three groups) and \(df_2 = 15\) (there are 18 observations).

  • The between variance (213.12) and the within variance (24.25).

  • The F statistic (between variance/within variance) = 8.789.

  • p-value, which equals 1-pf(8.788,2,15) = 0.00298.

Then, at \(\alpha = 0.05\) we reject the null hypothesis that the average attitude towards the future is equal among groups.

If you want to see the mean differences among groups you make a box-plot using ggplot:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
ggplot(data = data.ex1, aes(y = Alertness, fill = Dosage, x = Dosage)) + geom_boxplot()