<

Statistics and Tests

Overview

This lesson shows how to calculate Z-scores and probabilities with the normal distribution.

Objectives

After completing this module, students should be able to:

  1. Use z scores to calculate the probabilities of the CDF of a normal distribution.
  2. Caculate z scores from percentile values.
  3. Use and apply the z table.

References

Normal populations

The normal distribution (or Gaussian distribution or Bell curve) is the most important probability distribution to know about because it fits many different phenomena.

  • Heights

  • Weights

  • Blood pressure

  • Pizza Delivery Times

  • Logarithm of Income per Capita

The list goes on and on. It is the most common distribution used to represent random phenomena that are not biased around its mean.

Features of the normal distribution

Features of the Normal Distribution:

  • The normal distribution has two key parameters: the mean (\(\mu\)) and the standard deviation (\(\sigma\)).

    • The mean is a measure of centrality. The mean is the expected value or typical value of the distribution.

    • The standard deviation is a measure of dispersion. It indicates how far from the mean is the typical observation.

Then, if we say that the delivery time of a pizza can be described with a normal distribution (\(\mu = 30, \sigma = 5\)), measured in minutes. This means that the expected time for a pizza to be delivery is 30 minutes, but the typical observation is 5 minutes away from the mean. Then, it should be unusual to receive a pizza outside the interval (25, 35). Not impossible, but not very likely.

  • The normal distribution is symmetric, which means that it is equally likely to observe \(\mu + \delta\) than \(\mu - \delta\). Using the pizza delivery time example, it is equally likely to get your pizza after 27 minutes than to get it at 33 minutes.

  • Other distributions can also exhibit symmetry. What makes the normal distribution special is the relation between the standard deviation (\(\sigma\)) and the mass of the distribution. The standard deviation helps determine the mass of the distribution within a certain distance from the mean:

    • About 68 percent of the observations from a normal distribution are within a 1 standard deviation interval \((\mu - \sigma, \mu + \sigma)\).
    • About 95 percent of are within 2 standard deviations \((\mu - 2\sigma, \mu + 2\sigma)\).
    • And 99 percent are within 3 standard deviations \((\mu - 3\sigma, \mu + 3\sigma)\).

If you need a reminder on how to calculate the average (an estimation of the mean of a distribution) or the standard deviation click below.

(1) Average (2) Standard Deviation

Comparison to the Uniform Distribution

  • In the previous module we used the uniform distribution to describe the probability of getting a specific number on a 6-sided die, (1/6).
  • For the uniform distribution, the probability of every outcome is the same (constant). The normal distribution instead assigns a single value (the mean) to have the maximum probability of occurrence and then, as values move away from the mean, the probability decays gradually until is almost equal to zero.
  • The uniform distribution is rare to find in nature. We don’t observe individuals that are 7-feet tall with the same regularity as individuals that are around 5’7’’. In fact, the farther away from the typical the less likely it is for us to find them; which is the type of pattern that is being represented by the normal distribution.
  • This, obviously, does not mean that every phenomena should be modeled with the normal distribution and that we should forget about the uniform distribution.
  • The following graphs emphasize this difference even more:
library(ggplot2)
library(ggthemes)
uniformfun <- function(x){ifelse(x>=-1&x<=1,1,0)}
normfun <- function(x){dnorm(x,0,0.5)}
ggplot(data=data.frame(x=c(-2, 2)),aes(x)) + ylab("density") + xlab("outcome") +
  stat_function(fun=normfun, color = "red") +
  stat_function(fun=uniformfun, color = "blue") + theme_wsj()

PDF and CDF

We can represent a probability distribution in R, as illustrated in the previous slide, using stat_function with ggplot. When describing a continuous variable we often represent its probabilities using two related functions:

  • pdf or probability density function. This function takes an interval as input and returns the mass of the probability distribution. That is, given a lower limit \(x_{\ell}\) and an upper limit \(x_{u}\) the pdf returns \(Pr[x_{\ell} \leq x \leq x_{u}]\). For instance, using the example of the pizza delivery time from before. The probability that the pizza arrives between 25 to 35 minutes can be expressed as \(Pr[25 \leq x \leq 35]\). Another way of representing the same idea is using the notation \(Pr[x \in (x_\ell, x_u)]\). The total area of the pdf is equal to one.

  • cdf or cumulative density function. The CDF function is calculated by adding the values of the PDF from minus infinity up to a certain value \(x_{u}\). The value of the CDF converges to 1 as \(x\) increase.

Below you can see the pdf and cdf of a normal distribution with a mean of 75 and a standard deviation of 7:

normfun <- function(x){dnorm(x,75,7)}
ggplot(data=data.frame(x=c(50, 100)),aes(x)) + ylab("density") + xlab("outcome") +
  stat_function(fun=normfun) + theme_economist() + ggtitle("Probability Density Function")

cumlfun <- function(x){pnorm(x,75,7)}
ggplot(data=data.frame(x=c(50, 100)),aes(x)) + ylab("cumulative distribution") + xlab("outcome") + 
  stat_function(fun=cumlfun) + theme_economist() + ggtitle("Cumulative Density Function")

Standard normal distributions

Often times we’ll need to standardize a random variable. By subtracting the mean and dividing by the standard deviation, a normally distributed variable will follow a normal distribution with mean 0 and standard deviation equal to one. This allows the comparison across normal distributions with different parameters. For example, we may want use a standardization like this if we want to compare students’ grades from two different classes as it is expected for the course mean and standard deviation to be different each semester.

The Z-score of a variable, is the value of the variable after subtracting the mean and dividing by the standard deviation:

Z-Score

\[ Z = \frac{x - \mu}{\sigma} \]

The values of \(Z\) will be negative when the variable \(x\) is below the mean (\(x - \mu < 0\)) and positive when above the mean (\(x - \mu > 0\)). After standardization, we say that a variable follows a standard normal distribution.

Probabilities of the Z-scores

We can calculate the CDF of the standard normal distribution (Z-scores) by calculating the area under the CDF for a normal distribution with parameters \(\mu = 0\) and \(\sigma = 1\). Because this distribution is the same for any variable that we standardize, there are tables like the one below that allows you to calculate the values of the CDF.

Here’s an example of part of a z-table:

(See for a complete ztable and some really helpful info: http://www.statisticshowto.com/tables/z-table/)

To get the CDF, look up the first digit of the z-score on the left, and the second along the top.

For instance, imagine that a student got an 75 on a test. The course average was 70 and the standard deviation 10. The Z-score for the student’s grade is:

\[\begin{eqnarray*} Z-Score & = & \dfrac{x - \mu}{\sigma} \\ & = & \dfrac{75 - 70}{10} \\ & = & 0.5 \end{eqnarray*}\]

Using the Z-table we can figure out that the probability of getting a 75 or less was a 0.6915 (row 6, column 1). Always keep in mind that the Z-table reports the CDF, and therefore should be interpret as the probability of getting \(z\) or less.

Z-scores in R

Using a table to calculate the CDF was the traditional way in which statistics was taught. Yet, that method is very inefficient. We can instead use R to calculate the values of the CDF directly.

Because the function pnorm returns the CDF and the standard normal distribution has mean of zero and standard deviation of one. We simply need to use pnorm using \(\mu = 0\) and \(\sigma = 1\) in order to obtain the CDF values.

Using the previous Z-score (0.5):

zscore <- 0.5
pnorm(zscore, 0, 1)
## [1] 0.6914625

It is even more precise than the table!

Below you can see the values of the CDF of a standard normal variable converging to 1:

zscores <- seq(-3,3,0.25)
probs   <- pnorm(zscores, 0, 1)
egData <- data.frame(z = zscores, p = probs)
ggplot(egData, aes(x = zscores, y = p)) + geom_point(size = 3) + geom_line() + theme_economist()  + scale_colour_economist() + ylab("Probabilities") + xlab("Z-Scores")

Note how pnorm can be applied to a vector of values.

The values of the CDF are often called percentile values.

From probabilities to scores

Conversely, we can go from probabilities to z-scores and then back to the original value of the variable by applying the inverse of the z-score. We can do this quickly in R using the qnorm function:

zscore <- 0.5
prob <- pnorm(zscore, 0, 1)
qnorm(prob,70,10)
[1] 75

Example: Creating a Quality Index

One application of standardization is the creation of a quality index for a product. Imagine that we are tasked with assigning a number between 0 and 100 to the quality of several products. For this we are given this data:

  • id: This is the product identifier.
  • week: This variable identifies the week.
  • retur: Number of returned units.
  • calls: Number of complain calls that are related to the quality of the product.
  • defec: Number of units that are defective out of production.
  • repur: Number of clients that choose to buy the product again.

As you can see this variables measure different things, individually any of them would be an inadequate measure of quality. But put together they may provide some insight about the quality of the products.

We are going to proceed in the following way for each product:

  1. Standardize each variable into Z-scores.
  2. Change the sign of the variables, so that higher is always an indicator of better quality.
  3. Produce an average of the four variables. For this example I’m going to produce a simple arithmetic average, but it is probably better to use a weighted average to reflect the relative importance that each variable has on quality.
  4. Multiply the result by 100.

Simulating Data

I, of course, don’t have this data. But I will use R to simulate it. Let’s start by generating the data for 7 products, for 300 weeks.

n <- 7  # Number of Products
w <- 300 # Number of Weeks

# This will create the ID variable
# It will add the product number (1 to n)
# for each of the w weeks.
# For example, for product 1 it will set 
# to 1 the value of ID for the first w observations
# then it will set it 2 for the next w observations...etc., etc.
id <- rep(1:n, each = w) 

# This will generate the variable for weeks
week <- rep(1:w, n)

Now let’s generate the quality indicators. To make it interesting, I will use a combination of different distributions and add some biases. I will generate a fake quality variable for each product initially and use that as a benchmark to see if the methodology roughly reflects its values.

quality <- rnorm(n, 0, 1)
quality
## [1] -1.80510468  0.02106668 -1.72820427  0.45826932  0.50226943  2.00867156
## [7]  2.71869727
# This is the loop that generates the quality indicators.
# 
# I'm going to initialize the vectors that I need 
# first:
retur <- rep(NA, n*w)
calls <- rep(NA, n*w)
defec <- rep(NA, n*w)
repur <- rep(NA, n*w)

for(i in 1:n) {
  
  # Average of each quality indicator as a function
  # of quality
  mu_retur <- 100 - 4.5*quality[i]
  mu_calls <- 5*mu_retur - 1.96*quality[i]
  mu_defec <- 10 - 3.14*quality[i]
  mu_repur <- 21 + 9.9*quality[i]
  
  # For the standard deviation. I'll use a fifth of
  # the average when needed.
  sigma_retur <- 0.2*mu_retur

  # GENERATING RANDOM DATA
  
  # For the returns I'll use a normal distribution
  # rounded to the nearest integer
  # 
  # The index for the data is (1 + w*(i-1)):(w*i)
  # This will add the data on the corresponding 
  # vector on chunks of 300.
  retur[(1 + w*(i-1)):(w*i)] <- rnorm(w, mu_retur,sigma_retur)
  retur <- round(retur, 0)
  
  # For calls I'll use the sum of three uniform distributions
  calls[(1 + w*(i-1)):(w*i)] <- runif(w, min = mu_calls*0,
                                         max = mu_calls*0.25) +
                                runif(w, min = mu_calls*0.25,
                                         max = mu_calls*0.75) +
                                runif(w, min = mu_calls*0.75,
                                         max = mu_calls)
  calls <- round(calls, 0)
  
  # For the defective units I'll use a Poisson distribution.
  defec[(1 + w*(i-1)):(w*i)] <- rpois(w, mu_defec) 
  
  # Finally, for the repurchases I'll use combination of Poisson and Uniform distributions
  repur[(1 + w*(i-1)):(w*i)] <- 0.5*rpois(w, mu_repur) + 0.5*runif(w, min = mu_repur*0.5, max = mu_repur*1.5)
}

Creating Summary Statistics

As you can see, for some of our variables the data generation was fairly complex and thus the distribution of the data should not necessarily resembles a normal distribution. One thing we can do is to judge the symmetry of the distribution. One can do so by producing summary statistics using: the mean, standard deviation, minimum, maximum and 25th and 75th percentile. A fairly normal distribution should have an equal mean and median, and the distance from the median to the 25th and 75th percentile should be roughly the same.

But first, let’s add all of our data in a single data.frame first:

qualityData <- data.frame(id,week,retur, calls, defec, repur)

Now let’s write a function to produce the statistics that we are interested in.

# Let's write a function for the summary stats:

summaryStats <- function(x) 
{
  out <- c(min(x),
           quantile(x, .25),
           mean(x),
           median(x),
           quantile(x, .75),
           max(x),
           sd(x))
  
  return(out)
}

Finally, let’s apply the function to each of the variables in the data.frame:

stats <- apply(qualityData, 2, summaryStats)
rownames(stats) <- c("Min",
                     "25th Perc.",
                     "Mean",
                     "Median",
                     "75th Perc.",
                     "Max",
                     "Std. Dev.")
stats
##                  id      week     retur     calls     defec      repur
## Min        1.000000   1.00000  32.00000  455.0000  0.000000  0.9862723
## 25th Perc. 2.000000  75.75000  83.00000  663.0000  4.000000  4.7964906
## Mean       4.000000 150.50000  98.03714  737.5957  9.211429 23.8915012
## Median     4.000000 150.50000  97.00000  732.0000  9.000000 24.0507041
## 75th Perc. 6.000000 225.25000 112.00000  807.0000 13.000000 35.0050658
## Max        7.000000 300.00000 189.00000 1053.0000 30.000000 68.7649778
## Std. Dev.  2.000476  86.62269  21.28843  102.5337  5.850045 16.1410672

For the quality indicators we see that the median is almost the same as the mean. Yet, we can tell that defec and repur are not probably not symmetric due to the distance between the median and the 25th and 75th percentile not being that similar.

Additionally, we can produce histograms for each variable to get a feeling for the symmetry of each distribution and compare it with a normal distribution with the same mean and standard deviation.

# For Returns
ggplot(data = qualityData, aes(x=retur) )  + geom_histogram(aes(y =..density..), binwidth = 5)  + stat_function(fun = dnorm, args = list(mean = mean(qualityData$retur), sd = sd(qualityData$retur)), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Returns")

# For Calls
ggplot(data = qualityData, aes(x=calls) )  + geom_histogram(aes(y =..density..), binwidth = 20)  + stat_function(fun = dnorm, args = list(mean = mean(qualityData$calls), sd = sd(qualityData$calls)), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Calls")

# For Defective
ggplot(data = qualityData, aes(x=defec) )  + geom_histogram(aes(y =..density..), binwidth = 2)  + stat_function(fun = dnorm, args = list(mean = mean(qualityData$defec), sd = sd(qualityData$defec)), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Defective")

# For Repurchases
ggplot(data = qualityData, aes(x=repur) )  + geom_histogram(aes(y =..density..), binwidth = 5)  + stat_function(fun = dnorm, args = list(mean = mean(qualityData$repur), sd = sd(qualityData$repur)), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Repurchases")

Step 1: Standardize Variables

To standardize a variable we just need to use the formula:

\[ Z-Score = \dfrac{x-\mu}{\sigma} \]

# I'll write a quick function to do this quickly.

zscore_transformation <- function(x) {
  return ( (x-mean(x))/sd(x) )
}

standardData <- apply(qualityData[3:6], 2, zscore_transformation)

qualityDataZvals <- data.frame(id, week, standardData)
names(qualityDataZvals) <- c("id",
                             "weeks",
                             "retur_zval",
                             "calls_zval",
                             "defec_zval",
                             "repur_zval")

Let’s take a look again at the summary stats after standarization:

stats <- apply(qualityDataZvals, 2, summaryStats)
rownames(stats) <- c("Min",
                     "25th Perc.",
                     "Mean",
                     "Median",
                     "75th Perc.",
                     "Max",
                     "Std. Dev.")
round(stats,4)
##                id    weeks retur_zval calls_zval defec_zval repur_zval
## Min        1.0000   1.0000    -3.1020    -2.7561    -1.5746    -1.4191
## 25th Perc. 2.0000  75.7500    -0.7064    -0.7275    -0.8908    -1.1830
## Mean       4.0000 150.5000     0.0000     0.0000     0.0000     0.0000
## Median     4.0000 150.5000    -0.0487    -0.0546    -0.0361     0.0099
## 75th Perc. 6.0000 225.2500     0.6559     0.6769     0.6476     0.6885
## Max        7.0000 300.0000     4.2729     3.0761     3.5536     2.7801
## Std. Dev.  2.0005  86.6227     1.0000     1.0000     1.0000     1.0000

As you can see now all the variables have mean zero and standard deviation equal to one. That does not mean that they are normally distributed. The issues of asymmetry for defec and repur will persist after the transformation:

# For Returns
ggplot(data = qualityDataZvals, aes(x=retur_zval) )  + geom_histogram(aes(y =..density..), binwidth = 0.25)  + stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Returns (Z-Scores)")

# For Calls
ggplot(data = qualityDataZvals, aes(x=calls_zval) )  + geom_histogram(aes(y =..density..), binwidth = 0.25)  + stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Calls (Z-Scores)")

# For Defective
ggplot(data = qualityDataZvals, aes(x=defec_zval) )  + geom_histogram(aes(y =..density..), binwidth = 0.25)  + stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Defective (Z-Scores)")

# For Repurchases
ggplot(data = qualityDataZvals, aes(x=repur_zval) )  + geom_histogram(aes(y =..density..), binwidth = 0.25)  + stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Repurchases (Z-Scores)")

Step 2: Change the signs of the variables so that higher values mean better quality

This part is fairly easy. We know that if a product is of good quality we should not receive a more than usual number of returns. As a result, more results means lower quality; thus, we must change the sign of returns from positive to negative. We can apply the same reasoning to number of complain calls about quality and defective unit. The only indicator that should remain with the same sign is the number of repurchased units. Then, this is a simply as multiplying the first three quality indicators by a minus sign.

qualityDataZvals[,3:5] <- -qualityDataZvals[,3:5]

Step 3 and 4: Producing the average of quality indicators and final index

Now we just need to calculate the average of the four quality indicators by product, then use the probabilities of the standard normal distribution (Z-table) and then multiply it by 100.

qualityZscore <- rep(NA, n) # Creating an empty vector to store the average

for(i in 1:n) {
  
  averagePerWeek <- rowMeans(qualityDataZvals[id == i,3:6])
  qualityZscore[i] <- mean(averagePerWeek)
  
}

We can quickly check if our indicator is correctly ranking the goods according to quality:

ggplot(data = data.frame(quality, qualityZscore), aes(x = qualityZscore, y = quality)) + geom_point(size = 3) + geom_smooth(method= "lm", color = "red", se = F) +theme_economist() + scale_color_economist()
## `geom_smooth()` using formula 'y ~ x'

This looks great! now we just need to transform the Z-Scores into probabilities and then multiply it by 100 so that it is easier for non-experts to understand.

qualityProb <- pnorm(qualityZscore, mean = 0, sd = 1)*100
ggplot(data = data.frame(id = as.factor(1:n), index = qualityProb), aes(x = reorder(id, qualityProb), y = qualityProb,fill=id)) + geom_bar(stat = "identity") + coord_flip() + xlab("Product ID") + ylab("Quality Index") + geom_text(aes(label= round(qualityProb)), hjust = 1.6, color="white", size=5) + theme_economist() + scale_fill_economist() + theme(legend.position = "none") 

Summary

  • In this lesson we discussed the main features of the normal distribution and the standard normal distribution.
  • The normal distribution is symmetric.
  • It is characterized by the mean and standard deviation.
  • It has specific ~68% of the mass around 1 standard deviation away from the mean, ~95% around 2 standard deviation and ~99% around 3 standard deviation.
  • We can standardize a variable by subtracting the mean and dividing by the standard deviation. The resulting variable is called a Z-Score and it has a mean zero and standard deviation equal to one.
  • Using the Z-table (or pnorm function in R) we can get the probabilities of the normal distribution CDF. This are also called percentile values. They represent the probability of observing a value equal to the Z-Score or less.
  • Finally, we use the concept of standardization to create a Quality Index. Even using data from non-normal distributions, the quality index based on Z-Score was able to reflect the underlying quality variable of each product.
  • Later in the semester we’ll use machine learning to make a better quality index.