Statistical Analysis 4.4: Correcting for small samples

Correcting for small samples
T distribution
Finite population correction

Correcting for small samples

Overview

This lesson introduces the t distribution.

Objectives

After completing this module, students should be able to:

Calculate percentiles and confidence intervals using the t distribution.
Explain when the t distribution is more appropriate than the normal distribution.
Calculate and apply finite population corrections.

Readings

Schumacker, pp 106-112.

T distribution

What if we have really small sample sizes (N<100, or some would say N<30)?

Our samples do not have the same properties as the standard normal curve (the z-distributiong).
Therefore, our sample statistics (such as \(\bar{x}\)) are not quite normally distributed around \(\mu\).

The answer is the T distribution:

Looks like the Z (normal), except that unlike the Z, it’s not always the same shape.
There’s one additional parameter (in addition to the mean and standard deviation), the degrees of freedom, which is a function of n. It essentially shifts how much of the mass is in the center versus the tails of the bell curve. When \(n\) is small, we get essentially some bonus uncertainty, on top of the usual contribution via the standard error.
We are also going to calculate the standard errors a bit differently because the size of sample standard deviations are biased downward.

Two steps:

Consider the alpha-level.

Take into consideration the alpha-level (\(\alpha\)): the \(\alpha\) associated with a 95% CI is 1-0.95 or 0.05 and the \(\alpha\) associated with a 99% CI is 0.99 or 0.01.

Consider the degrees of freedom.

The degrees of freedom (df) simply indicates how different the sample distribution is from the standard normal curve. \(df = n -1\). But now to get the percentile for that number, we need to use the t table rather the z table. Conversely, if we want to calculate the 95% CI, we can’t just use 1.96 – we need to figure out a number specific to our \(n\).

The degrees of freedom (df)

The degrees of freedom is:

\[n-1\] Just as it was with the sample standard deviation!

Why are degrees of freedom important?

Our sample (or other data set) may actually have less independent information in it than the \(n\) items in the sample. For instance, if you have three items, a, b, and c, but c = a + b, then you really only have two independent pieces of information, since c can be deduced just from a and b.
Similarly, the sample standard deviation has slightly less information in it because we do not use the population mean (which we don’t know) but the sample mean to calculate it.
We get se through dividing sd by \(n-1\), which makes the standard deviation a little bigger (ie, it increases our uncertainty).
A lower degree of freedom generally means more uncertainty in your estimate.

Visualization

Here is a picture of the T distribution. As you can see, it looks like the normal distribution, but gets “fatter” in the tails as the degree of freedom (sample size) gets smaller. That is, with a low degree of freedom, there is more of a chance of getting a value far from the mean; this also means that the 95% confidence interval will in general be wider, since you have to go farther out into the tails to encompass 95% of the population.

T table

Like the z table, the t table helps us going back between scores and percentiles. Ith as one additional parameter (the degree of freedom). If our sample size is 8, then our degree of freedom is \(n-1\) or 7. So to get the corresponding t score, we look at row 7, \(t_{0.25}\), and get 2.365.
The percentiles are shown along the top, with the important difference that it shows not the xth percentile, but 1 minus that (for reasons we will get into in the next module). The degree of freedom is shown along the left, and the t score is shown in the middle.

This shows a portion of the table:

And here is a link to the full table:

< www.normaltable.com/ttable.html />

Calculating percentiles and scores

For example, if we have taken a small sample of the population – say, we have surveyed 8 people – our best guess about the population mean age is still the sample mean. But we are less confident in our answer, both due to the low \(n\) (which creates a large standard error), and also due to the fact that our guess of \(\bar{x}\) is now distributed not quite normally, but via a t distribution with its fatter tails. Our 95% CI is now \(\bar{x} \pm 2.365*se\).

Referring back to the t table, what would be our 90% CI with a sample mean of 3, a standard deviation of 2, and a sample size of 4?

\(3 \pm 2*1.533\): Nope.
\(3 \pm 1*1.533\): Nope.
\(3 \pm 2*1.638\): Nope.
\(3 \pm 1*2.353\): The standard error is \(2 / \sqrt{4}\). The degree of freedom is \(n-1\) or 3. And the 90% CI means than 0.05 is in each tail, so we want \(t_{0.050}\)
\(3 \pm 1*2.132\): Nope.

Make sure you know why each of the wrong answers was wrong. Think about if you had to grade this mini-exam: what error would a student be making for each of the wrong answers?

Calculating percentiles and scores in R

Of course, we can also use R instead of the table to go back and forth from percentile to scores (and thus to confidence intervals), using the built-in function qt.

If we want the percentile for a score of 2.365 with a degree of 7, we write:

pt(2.365,7)

[1] 0.9750138

To get the score for the 97.5 percentile with a degree of freedom of 7, we write:

qt(.975,7)

[1] 2.364624

Exercise

Let’s go back to our violent crime rate example. Let’s say we have the violent crime rates of all counties in Massachusetts (14 counties). Let’s say the average number of violent crimes committed in a year is 3.9 per 1,000. And let’s say the standard deviation is 1.1. Using this information, we can construct a 95% confidence interval for the true average for the number of violent crimes committed in Massachusetts.

A couple of steps to take:

Pick the appropriate distribution (z-distribution or t-distribution)
Determine df (degrees of freedom): df = n - 1
Use these pieces of information and find the associated t-value with an alpha-level of 0.05. Find the table here: http://www.statisticshowto.com/tables/z-table/ or www.normaltable.com/ttable.html. Or use the qt function in R.
Now calculate the upper and lower boundaries of the 95% confidence interval.

Finite population correction

Just to be aware about:

There is also a correction we need to make when \(n\) is very large relative to \(N\) – that is, when our sample is close the entire population. Estimating our standard error for \(\bar{x}\) as \(s/\sqrt{n}\) will be wrong, since as \(n\) approaches \(N\), our standard error should drop to 0 – when we’ve sampled everyone, our estimate \(\bar{x}\) will be exactly the true mean!

This is easily corrected using the finite population correction, where we multiply the standard error by:

\[\sqrt{\frac{N-n}{N-1}}\]

No need to worry too much about this now as we will not use it in our assignments. I’d like more info on the finite population correction