<

Hypothesis testing

Overview

This lesson introduces the basic concepts of a hypothesis test.

Objectives

After completing this module, students should be able to:

  1. Explain the difference between the Null and Research hypothesis.
  2. Explain the importance of “rejecting the null”
  3. Explain one-tailed vs two-tailed tests and p-values.

Randomness and Hypotheses

Randomness is a measure of our inability to predict the outcome of an experiment. The more random an event is, the less likely we are to correctly predict the outcome. The more we know about the factors that are associated with a certain outcome, the less random the event becomes. e.g. if we want to know if it’s going to rain in the next 15 minutes, a good estimation of the likelihood of raining can be obtained by looking up and check if the sky is clear or cloudy. We can formalize this exercise by using the concept of conditional probability from the previous learning module:

\[\begin{eqnarray*} p_{\text{clear}} = P(y = rain | x = clear) = \text{Probability of rain given that the sky is clear} \\ \\ p_{\text{cloudy}} = P(y = rain | x = cloudy) = \text{Probability of rain given that the sky is cloudy} \end{eqnarray*}\]

From our lived experience we can guess that \(p_{\text{cloudy}}>p_{\text{clear}}\), it is more likely to observe rain if the sky is cloudy than clear. This guess is what we call a hypothesis. Hypothesis is a synonym of to suppose or to assume. We learn informe this guess from the informal observation of past experiences.

More interesting is the fact that we also know that we are not always right; sometimes the sky is cloudy and we observe no rain and sometimes the sky is clear and we observe rain. Therefore, hypothesis are generally expressed in terms of probabilities for events with random outcomes.

In this module we are going to learn how we can test the assumptions we make about random processes. In other words, we’ll learn how we can use statistics and R to systematically decide if the assumptions we make about the data can be attributed to random chance.

Population mean, sample average and hypotheses

Many assumptions about the world have a binary response (yes/no), for example:

  • The mean hourly earnings of recent U.S. college graduates equal $20 per hour.
  • The mean earnings is not the same for male and female college graduates.
  • The proportion of defective products decrease after a new manufacturing process was introduced.
  • A ski resort decreased its average response time to accidents.
  • A new facial recognition algorithm is faster than the competition.
  • Drug \(x\) reduces the probability of observing \(y\) side-effect in relation to drug \(z\).

All the previous hypothesis can be rephrased in terms of the population mean. Recall that the population of a distribution is the actual or real distribution of a random experiment, which is often not observed directly as it is costly or time consuming to get data on all individuals of a population. The population mean is the average of the population, we often use the term the real or actual mean to refer to the population mean.

Because the population of a random experiment is not observable - we never know with total certainty the actual distribution that is generating the data, unless we are using a computer simulation - we need to use an estimation. An estimator is a function of sample data to be drawn randomly from the population. The sample mean or sample average is an estimation of the population mean. Note that the sample mean is a random variable because it depends on the sample data, which is a random draw of the population. That’s why we should not be surprised to observe different values for the sample average from different samples of the same population.

Example: Imagine that we want verify the hypothesis “The mean hourly earnings of recent U.S. college graduates equal \(\$20\) per hour”. It is impossible for us to observe the population of hourly earnings of all U.S. college graduates, but we can survey a number \(n\) of individuals and use the sample average as an estimator of the population mean.

Let’s simulate this in R to have a deeper understanding of this concept. For this exercise we are going to assume that the actual population mean is indeed \(\$20\).

set.seed(1)   # To get same random data always
N <- 1000000  # population size
mu <- 20      # population mean
sigma <- 10   # population standard deviation
population <- rnorm(N, mu, sigma) # generating population data

# Population mean
mean(population)
## [1] 20.00047

This is the average using the entire population data. Now let’s take 3 different samples of 10 observations each and compare them:

set.seed(1) # To get same data always
n <- 10 #sample size
sample1 <- sample(population, size = n )
sample2 <- sample(population, size = n )
sample3 <- sample(population, size = n )

mu1 <- mean(sample1)
mu2 <- mean(sample2)
mu3 <- mean(sample3)

c(mu1,mu2,mu3)
## [1] 19.94367 19.02149 22.09363

We can verify that the different samples are:

  • Close to the population mean but, in generally, not exactly equal, and,
  • Each sample mean is different from each other.

Unfortunately, in most practical applications we’ll have only one sample to assess the validity of a proposed hypothesis. For example, if we want to know if the population mean is equal to \(\$20\) using the first sample; we compute a sample average, which is \(\approx \$19.94\) with a 0.06 difference from the proposed hypothesis value of \(\$20\).

Is this difference large enough to reject the assumption that the population mean is equal to 20? there will always be some difference due to the fact that we only have a random sample, how tolerant are we as researcher to those discrepancies? The process of testing a hypothesis involve answering the previous questions.

Null and Alternative Hypothesis

Before we test a hypothesis, we need to first learn how to properly write it in a way that we can use statistics to test it. This is very simple, you just need to follow this simple steps:

Formalize the hypothesis statement in terms of the population mean, generally using non-ambiguous terminology and/or a clear mathematical notation.

Then, consider what would be true if the hypothesis statement is false, formalize this idea just like you did with case (1).

For example, say we want to state the example from the previous slide:

  • “The mean hourly earnings of recent U.S. college graduates is equal to \(\$20\) per hour.”

First, let′s write the statement as a function of the population mean (\(\mu_{x}\)).

  1. \(\mu_{x} = 20\)

Now, we just need to consider what would be true if (1) is false. We have different candidates, it could be that: \(\mu_{x}>20\), \(\mu_{x}<20\) or \(\mu_{x} \neq 20\). For instance,

  1. \(\mu_{x} \neq 20\)

Hypothesis Testing Notation

We call null hypothesis (\(H_0\)) to the starting point of a statistical hypothesis test and alternative hypothesis (\(H_A\) or (\(H_1\)) to the what should be true when the null hypothesis is false. This terms are part of the convention of the methodology hypothesis testing, we always make them explicit and clear so that others researchers when verifying our work can clearly understand what are the starting assumptions.

Then, let’s write the previous example using the notation of hypothesis testing.

\[\begin{eqnarray*} H_{0}:\;\; & \mu_{x} = & 20 \\ H_{A}:\;\; & \mu_{x} \neq & 20 \end{eqnarray*}\]

or, more explicitly,

\[\begin{eqnarray*} \text{Null Hypothesis}:\;\; & \mu_{x} = & 20 \\ \text{Alternative Hypothesis}:\;\; & \mu_{x} \neq & 20 \end{eqnarray*}\]

Let’s practice writing hypothesis for some of the other research questions from the first slide of this section:

  • The mean earnings is not the same for male and female college graduates.
\[\begin{eqnarray*} H_{0}:\;\; & \mu_{male} = & 20 \\ H_{A}:\;\; & \mu_{female} \neq & 20 \end{eqnarray*}\]
  • A new facial recognition algorithm is faster than the competition.
\[\begin{eqnarray*} \text{Null Hypothesis}:\;\; & \mu_{x} > & \mu_{competition} \\ \text{Alternative Hypothesis}:\;\; & \mu_{x} \leq & \mu_{competition} \end{eqnarray*}\]

Testing

Let’s imagine that we know the following characteristics of a sample:

  • Sample average : \(\bar{x} = 2\)
  • Sample standard deviation : \(s=5\)
  • Sample size \(n = 100\)

We want to test the hypothesis that the population average is equal to zero (\(\mu = 0\)), under the assumption that the population distribution is normal.

First, let’s formalize this statement:

\[\begin{eqnarray*} \text{Null Hypothesis}:\;\; & \mu_{x} = & 0 \\ \text{Alternative Hypothesis}:\;\; & \mu_{x} \neq & 0 \end{eqnarray*}\]

Now, let’s think how using our knowledge of the normal distribution we can use the sample to test the hypothesis.

In any given sample, the sample average \(\bar{x}\) will rarely be exactly equal to the hypothesized value (\(μ=0\)). Differences between the sample average and the hypothesis value can arise because:

  1. The true mean is in fact not equal to the hypothesis value (which implies that the null hypothesis is false), or,
  2. The true mean is equal to the hypothesis value only, but we observe some discrepancy due to the inherent randomness of the sample.

It is impossible to distinguish between these two possibilities with total certainty. Yet, even if a sample cannot provide conclusive evidence about the null hypothesis, it is possible to do a probabilistic calculation that permits testing the null hypothesis in a way that accounts for sample uncertainty.

Standard Error and Confidence Intervals

The probabilistic calculation that allow us to test hypothesis lies in the concept of standard errors and confidence intervals.

Remember, a standard error works in two ways:

  1. The standard error determines the distribution of the sample average values around the population mean.
  2. The standard error also allows to make claims about where \(\mu\) is relative to \(\bar{x}\).

In the previous example, What is the standard error?

\[S.E. = \dfrac{s}{\sqrt{n}} = \dfrac{5}{\sqrt{100}} = \dfrac{5}{10} = \dfrac{1}{2} \]

The value of \(S.E. = \dfrac{1}{2}\) means that the if we collect different samples and compute the sample mean, the average or standard difference from the mean will be equal to 0.5 due to the inherent randomness of the sample.

Note how important this concept is, it is telling that the average difference between the observed sample average and the real value of the population mean can be about 0.5 because we are working with a random sample.

Then, we can argue that if the difference between the hypothesis value and the sample average is less than the standard error we cannot reject that the differences are due to the inherent randomness of the sample. This is the foundamental idea of hypothesis testing. Only when the differences are large enough according to the standard error we are able to reject the null hypothesis.

Rejecting the Null

Imagine that the null hypothesis is true, i.e. \(\mu=0\). Then, using to the normal distribution, we can build the following confidence intervals:

90% confidence interval \([\bar{x} \pm 1.64 \times S.E.]\) or \(\{0 \pm 1.64 \times 0.5\}=\{−0.82,0.82\}\) 95% confidence interval \([\bar{x} \pm 1.96 \times S.E.]\) or \(\{0 \pm 1.96 \times 0.5\}=\{−0.98,0.98\}\) 99% confidence interval \([\bar{x} \pm 2.58 \times S.E.]\) or \(\{0 \pm 2.58 \times 0.5\}=\{−1.29,1.29\}\)

In other words, if the null is true there is a:

90% chance that the the sample average is in the confidence interval \(\{−0.82,0.82\}\) 95% chance that the the sample average is in the confidence interval \(\{−0.98,0.98\}\) 99% chance that the the sample average is in the confidence interval \(\{−1.29,1.29\}\)

Now, with a sample average of \(\bar{x}=2\) we can conclude that:

The null hypothesis - the assumption that the population mean is equal to zero - is not consistent with the observed sample average of 2. This is due to the fact that, if the real value of the population mean is zero, there is a 99% chance that sample average is between in the \(\{−1.29,1.29\}\). But the sample average is not within those values. Therefore, we cannot argue that the reason why the sample average is different than the hypothesized value is due to random chance.

In summary, if the sample average is not in the confidence interval of the null hypothesis, we reject the null hypothesis. Note that rejecting the null does not imply accepting the alternative, we never prove a hypothesis, we simply cannot find enough evidence to reject it. Therefore, the previous test simply rejected the notion that the difference between the sample average and the hypothesis value is the result of random chance.

“In favor of”

Now that we have rejected the null hypothesis, can we say that Hypothesis 1 is true?

No.

  • We have rejected the null “in favor of” the alternative hypothesis. We have not necessarily proven the alternative hypothesis.

  • Based on the frequentist approach, we never really prove anything, we can just disprove our prior knowledge or theories.

  • We “provisionally” onsider a hypothesis as probably true until we can reject it. Knowledge or scientific consensus is built when there are enough instances were a hypothesis cannot be rejected.

Type 1 and Type 2 Errors

A statistical hypothesis test can make two types of mistakes:

Type 1 Error = False Positive: Is when the researcher rejects the Null Hypothesis while in fact it shouldn’t have. Using a 95% confidence interval to test a hypothesis, on average, if we collect 100 different samples, 5 of them will produce false positives or type 1 errors. The significance level is the probability of committing a type 1 error, therefore, a hypothesis test based on a 95% confidence interval has a 5% significance level.

Type 2 Error = False Negative: Is when the researcher fails to reject the Null Hypothesis while it should have. One minus the probability of a type 2 error is called the power of the test. Generally we cannot determine the power of the test, because it requires prior knowledge of the actual population.

  Null hypothesis is true Null hypothesis is false
We reject the null False positive (Type 1 error) OK
We fail to reject the null OK False negative (Type 2 error)

Type 1 Error = False Positive: On average, surveyed Americans seem moderately conservative but that is by chance. We reject the Null Hypothesis while in fact we shouldn’t have.

Type 2 Error = False Negative: We surveyed few Americans with a mean of 2. With few respondents, our standard error will be larger and we fail to reject the Null Hypothesis while we should have. Can happen when we have weak tests (e.g. small surveys).

The typical 95% CI’s was designed to avoid Type 1 errors, even at the cost of greater Type 2 error. We rather fail to reject the null than mistakenly reject the null: we need strong evidence to reject prior knowledge!

P-value and rejection region

Based on a confidence interval, we can determine the two regions:

Rejection Region: the interval in which we reject the null hypothesis. Acceptance Region: the interval in which which we cannot reject the null hypothesis.

In the following graph, you can see the rejection region (in dark blue) and the acceptance region (in light blue) of the hypothesis test assuming \(\mu=0\) and \(S.E. = 1\) for a 95% confidence level.

  • 95% confidence interval \([\bar{x} \pm 1.96 \times S.E.]\) or \(\{0 \pm 1.96 \times 1\}=\{−1.96,1.96\}\)

Then, \(\{−1.96,1.96\}\) is the acceptance region, and the values outside of \(\{−1.96,1.96\}\) is the rejection region.

The probability or the mass of the rejection region is called the p-value. We can compute the p-value in R using the pnorm function:

# Note that we have to multiply 
# the pnorm() calculation by 2
# because we have to consider
# both tails of the distribution.

pnorm(1.96, 0, 1, lower.tail=F)*2
## [1] 0.04999579
#Or: 
(1-pnorm(1.96,0,1))*2
## [1] 0.04999579

Which is approximately 5%. We can compare the p-value and the significance level of a hypothesis test to determine if the probability of rejecting the null is greater than the probability of a type 1 error, which is equivalent to rejecting the null using the confidence interval.

One-Tailed vs Two-Tailed Tests

  • A one-tailed test: s a hypothesis test were the hypothesis is directional, i.e. the researcher is only interested in testing \(\mu > 0\) (or only interested in \(\mu < 0\)), e.g. Are computer science majors more likely to be unemployed 1 year after graduation than other graduates? is the monthly average temperature lower in February?, etc.

  • A two-tailed test: Is a hypothesis test were the hypothesis is not-directional. i.e. the researcher is only interested in \(\mu = 0\) (or only interested in \(\mu \neq 0\)). e.g. Are computer science majors equally likely to be unemployed 1 year after graduation than other graduates? is the monthly average temperature different in February?, etc.

The type of test (one-tailed or two-tailed) that is under consideration is important because the calculation of the confidence interval and rejection regions will be different.

One-tailed rejection region

By definition, the rejection region of a one-tailed test should include only region in which the null is rejected. If the significance level of the test is 5% then, the mass of the rejection region should be equal to 5%.

If the hypothesis is of the form \(\mu > \mu_0\) - where \(\mu_0\) is the hypothesized value -.

Then, the rejection region is:

\[ \mu_0 - \text{z-val}(\alpha) \times S.E. = \mu_0 - 1.64 \times S.E. = (-\infty, \mu-1.64 \times S.E.)\]

This is the case because if the null is true, then observing a sample average greater than \(\mu_0\) is evidence in support of the null. That the population mean is greater than the hypothesized value. Only if the sample mean is sufficiently lower than the hypothesized value we can reject the null.

If the hypothesis is of the form \(\mu < \mu_0\), the rejection region is:

\[ \mu_0 + \text{z-val}(\alpha) \times S.E. = \mu_0 + 1.64 \times S.E. = (\mu_0+1.64 \times S.E.,\infty)\]

Now, only if the sample mean is sufficiently large than the hypothesized value we can reject the null.

In R, is very simply to find the z-value to compute the confidence interval of a one-tailed test given a significance level using the qnorm function:

significance <- 0.05 #Alpha is also known as the significance level
qnorm(1-significance ,0,1)
## [1] 1.644854

Example of one-tailed test

Imagine that we want to test that \(\mu > 1\) and we observe \(\bar{x}=1.2\), \(s=1\) and \(n=100\).

  • Step 1: Formalize the hypotheses:
\[\begin{eqnarray*} \text{Null Hypothesis:} \;\;\; \mu>1 \\ \text{Alternative Hypothesis:} \;\;\; \mu \leq 1 \end{eqnarray*}\]
  • Step 2: Set a significance level (\(\alpha = 0.05\)).

  • Step 3: Compute Standard Error.

s <- 1
n <- 100

se <- s / sqrt(n)
se
## [1] 0.1
  • Step 4: Compute rejection confidence interval: Because this is a greater than hypothesis,

\[\mu_0 - \text{z-val}(\alpha) \times S.E. \]

mu <- 1
alpha <- 0.05
zval <- qnorm(1-alpha, 1, se)
mu - zval*se
## [1] 0.8835515

Then, the rejection region is, \[(-\infty,0.88)\]

Because the sample mean is not in the rejection region, we cannot reject the null hypothesis. We need to observe a sample mean of 0.88 or less to reject the null. This value is known as the critical value of a hypothesis test.

Two-tailed rejection region

By definition, the rejection region of a two-tailed test should include all the region in which the null is rejected, that is the area under both tails (above the mean and below the mean). If the significance level of the test is 5% then, the mass of the rejection region should be equal to 5%. Then, the mass of each tail should be equal to \(5\%/2=2.5\%\).

If the hypothesis is of the form \(\mu \neq \mu_0\) - where \(\mu_0\) is the hypothesized value -. Then, the rejection region is:

\[ \mu_0 \pm \text{z-val}(\alpha) \times S.E. = \mu_0 \pm 1.96 \times S.E. = (\mu_0-1.96 \times S.E.,\mu_0+1.96 \times S.E.)\]

If the hypothesis is of the form \(\mu = \mu_0\)-. Then, the rejection region is the opposite of the previous one:

\[(-\infty,\mu_0-1.96 \times S.E.) \cap (\mu_0+1.96 \times S.E., \infty)\]

The following figure illustrate this last case:

In R we can compute the \(\text{z-val}(\alpha)\) of each tail:

qnorm(.025,0,1) # Lower tail zval
## [1] -1.959964
qnorm(.975,0,1) # Upper tail zval
## [1] 1.959964
# Because the normal distribution is symmetric we can do
qnorm(.025,0,1) # Lower tail
## [1] -1.959964
-qnorm(.025,0,1) # Upper tail zval
## [1] 1.959964

Example of two-tailed test

Imagine that we want to test that \(\mu = 1\) and we observe \(\bar{x}=1.2\), \(s=1\) and \(n=100\).

  • Step 1: Formalize the hypotheses:
\[\begin{eqnarray*} \text{Null Hypothesis:} \;\;\; \mu = 1 \\ \text{Alternative Hypothesis:} \;\;\; \mu \neq 1 \end{eqnarray*}\]
  • Step 2: Set a significance level (\(\alpha = 0.05\)).

  • Step 3: Compute Standard Error.

s <- 1
n <- 100

se <- s / sqrt(n)
se
## [1] 0.1
  • Step 4: Compute rejection confidence interval: Because this is a equal than hypothesis,

\[(-\infty, \mu_0 - \text{z-val}(\alpha) \times S.E.) \cap (\mu_0 + \text{z-val}(\alpha) \times S.E., \infty) \]

mu <- 1
alpha <- 0.05/2 #Two tailed test
zval <- qnorm(1-alpha, 1, se)
c(mu - zval*se, mu + zval*se)
## [1] 0.8804004 1.1195996

Then, the rejection region is, \[(-\infty,0.88) \cap (1.12, \infty)\]

Because the sample mean is in the rejection region, we can reject the null hypothesis. We need to observe a sample mean between 0.88 or 1.12 to fail to reject the null.