Statistical Analysis 4.3: Simulating sampling in R

A simulation

Overview

This lesson shows how to construct a simple script to verify the use of confidence intervals.

Objectives

After completing this module, students should be able to:

Distinguish a non-normally distributed population from a normally distributed sample statistic.
Write a script to sample from a population.
Write a script to repeat that sampling, and evaluate the accuracy of the confidence intervals.
Explain the statistical meaning of the confidence interval.

Readings

Schumacker, Ch 4-6.

Simulating sampling

We can further explore the usefulness of the Central Limit Theorem by simulating sampling:

We construct a population that we pretend (as survey scientists) not to know the truth about.
Then we draw a sample from that population, calculate its mean and standard deviation, and use that to make inferences about the (pretend unknown) population.
For instance, if we claim that based on the 95% CI derived from our sample, we are 95% sure the true mean \(\mu\) is between A and B, then if we repeat the whole survey 100 times, do we find that the true \(\mu\) is actually between A and B in about 95 of those runs? Hopefully so!

Show me some guidance on scripting here

The population

We’ll construct and examine each of the pieces of the script in turn before we run anything.

Let’s imagine we have a beetle, say, where the males and females are of differnt sizes, but we’re interested in the overall average size of the beetle.

Equal numbers of males and females
Males are 10mm (sd = 2mm).
Females are 20mm (sd = 5mm)

What does the overall population look like?

Sampling from the population

Let’s take a big sample from our population and plot the histogram.

Building the inner loop:

# 1. Set our sample size
nsamples <- 10000
# 2. Create an empty vector to store our samples
sampler <- rep(NA,nsamples)
# 3. Our sampling loop
for(i in 1:nsamples){
  # 4. At random we get either a male or female beetle
  #    If it's male, we draw from the male distribution
  if(runif(1) < 0.5){
    sampler[i] <- rnorm(1,10,2)
  }
  #   If it's female, we draw from the female distribution
  else{
    sampler[i] <- rnorm(1,20,5)
  }
}
# 5. Check our results
head(sampler)

[1] 10.305461 10.883454  9.473873 12.694067 21.792033 17.134588

Visualizing the sample

Now if we want to plot it to take a look:

library(ggplot2)

Warning: package 'ggplot2' was built under R version 3.5.2

library(ggthemes)
ggplot(data=data.frame(sampler),aes(x=sampler)) + geom_histogram() + theme_wsj()

(Note that we have to make our sampler vector into a data frame to make ggplot happy.)

Calculating the sample mean and sd

As we can see, the beetle population is very non-normal due to the differences between males and females. But we can figure out the mean size and the overall standard deviation easily enough.

If there are equal numbers of males and females, then the mean of the total population must be the mean of 10 and 20, or 15.
The standard deviation we could also calculate directly, but that’s a bit more work. Instead, we can just do it using our 10,000-beetle sample:

mean(sampler)

[1] 15.08535

sd(sampler)

[1] 6.319222

Great! Now let’s do that whole thing 100 times!

Repeat the sampling 100 times

Building the outer loop:

Specify how many times to run the whole thing, and where to store our summary statistics (\(\bar{x}\) and the 95% CI intervals) for each run.

Here’s the other loop, which we won’t run yet:

# 1. Set how many times we do the whole thing
nruns <- 100
# 2. Set how many samples to take in each run
nsamples <- 1000
# 3. Create an empty matrix to hold our summary data: the mean and the upper and lower CI bounds.
sample_summary <- matrix(NA,nruns,3)
# 4. Run the loop
for(j in 1:nruns){
  
  # 5. Insert the previous code that actually does the sampling in here.
  
  # 6. Finally, calculate the mean and 95% CI's for each sample 
  #    and save it in the correct row of our sample_summary matrix
  sample_summary[j,1] <- mean(sampler)  # mean
  standard_error <- sd(sampler)/sqrt(nsamples) # standard error
  sample_summary[j,2] <- mean(sampler) - 1.96*standard_error # lower 95% CI bound
  sample_summary[j,3] <- mean(sampler) + 1.96*standard_error # lower 95% CI bound
}

The whole simulation

Putting everything together:

# 1. Set how many times we do the whole thing
nruns <- 100
# 2. Set how many samples to take in each run (1000 rather than the previous 10,000)
nsamples <- 1000
# 3. Create an empty matrix to hold our summary data: the mean and the upper and lower CI bounds.
sample_summary <- matrix(NA,nruns,3)
# 4. Run the loop
for(j in 1:nruns){
  sampler <- rep(NA,nsamples)
  # 5. Our sampling loop
  for(i in 1:nsamples){
    # 6. At random we get either a male or female beetle
    #    If it's male, we draw from the male distribution
    if(runif(1) < 0.5){
      sampler[i] <- rnorm(1,10,2)
    }
    #   If it's female, we draw from the female distribution
    else{
      sampler[i] <- rnorm(1,20,5)
    }
  } 
  # 7. Finally, calculate the mean and 95% CI's for each sample 
  #    and save it in the correct row of our sample_summary matrix
  sample_summary[j,1] <- mean(sampler)  # mean
  standard_error <- sd(sampler)/sqrt(nsamples) # standard error
  sample_summary[j,2] <- mean(sampler) - 1.96*standard_error # lower 95% CI bound
  sample_summary[j,3] <- mean(sampler) + 1.96*standard_error # lower 95% CI bound
}

Are our 95% CI’s correct 95% of the time?

Now we want to test it. We know the true population mean is 15. What proportion of the 95% CI’s across all of our runs actually include the true mean in their range?

counter = 0
for(j in 1:nruns){
  # If 15 is above the lower CI bound and below the upper CI bound:
  if(15 > sample_summary[j,2] && 15 < sample_summary[j,3]){
    counter <- counter + 1
  }
}
counter

[1] 97

Not bad! Right?

Right?