This lesson shows how to calculate Z-scores and probabilities with the normal distribution.
After completing this module, students should be able to:
The normal distribution (or Gaussian distribution or Bell curve) is the most important probability distribution to know about because it fits many different phenomena.
Heights
Weights
Blood pressure
Pizza Delivery Times
Logarithm of Income per Capita
…
The list goes on and on. It is the most common distribution used to represent random phenomena that are not biased around its mean.
Features of the Normal Distribution:
The normal distribution has two key parameters: the mean (\(\mu\)) and the standard deviation (\(\sigma\)).
The mean is a measure of centrality. The mean is the expected value or typical value of the distribution.
The standard deviation is a measure of dispersion. It indicates how far from the mean is the typical observation.
Then, if we say that the delivery time of a pizza can be described with a normal distribution (\(\mu = 30, \sigma = 5\)), measured in minutes. This means that the expected time for a pizza to be delivery is 30 minutes, but the typical observation is 5 minutes away from the mean. Then, it should be unusual to receive a pizza outside the interval (25, 35). Not impossible, but not very likely.
The normal distribution is symmetric, which means that it is equally likely to observe \(\mu + \delta\) than \(\mu - \delta\). Using the pizza delivery time example, it is equally likely to get your pizza after 27 minutes than to get it at 33 minutes.
Other distributions can also exhibit symmetry. What makes the normal distribution special is the relation between the standard deviation (\(\sigma\)) and the mass of the distribution. The standard deviation helps determine the mass of the distribution within a certain distance from the mean:
If you need a reminder on how to calculate the average (an estimation of the mean of a distribution) or the standard deviation click below.
(1) Average (2) Standard Deviationlibrary(ggplot2)
library(ggthemes)
uniformfun <- function(x){ifelse(x>=-1&x<=1,1,0)}
normfun <- function(x){dnorm(x,0,0.5)}
ggplot(data=data.frame(x=c(-2, 2)),aes(x)) + ylab("density") + xlab("outcome") +
stat_function(fun=normfun, color = "red") +
stat_function(fun=uniformfun, color = "blue") + theme_wsj()
We can represent a probability distribution in R, as illustrated in the previous slide, using stat_function
with ggplot
. When describing a continuous variable we often represent its probabilities using two related functions:
pdf or probability density function. This function takes an interval as input and returns the mass of the probability distribution. That is, given a lower limit \(x_{\ell}\) and an upper limit \(x_{u}\) the pdf returns \(Pr[x_{\ell} \leq x \leq x_{u}]\). For instance, using the example of the pizza delivery time from before. The probability that the pizza arrives between 25 to 35 minutes can be expressed as \(Pr[25 \leq x \leq 35]\). Another way of representing the same idea is using the notation \(Pr[x \in (x_\ell, x_u)]\). The total area of the pdf is equal to one.
cdf or cumulative density function. The CDF function is calculated by adding the values of the PDF from minus infinity up to a certain value \(x_{u}\). The value of the CDF converges to 1 as \(x\) increase.
Below you can see the pdf and cdf of a normal distribution with a mean of 75 and a standard deviation of 7:
normfun <- function(x){dnorm(x,75,7)}
ggplot(data=data.frame(x=c(50, 100)),aes(x)) + ylab("density") + xlab("outcome") +
stat_function(fun=normfun) + theme_economist() + ggtitle("Probability Density Function")
cumlfun <- function(x){pnorm(x,75,7)}
ggplot(data=data.frame(x=c(50, 100)),aes(x)) + ylab("cumulative distribution") + xlab("outcome") +
stat_function(fun=cumlfun) + theme_economist() + ggtitle("Cumulative Density Function")
Often times we’ll need to standardize a random variable. By subtracting the mean and dividing by the standard deviation, a normally distributed variable will follow a normal distribution with mean 0 and standard deviation equal to one. This allows the comparison across normal distributions with different parameters. For example, we may want use a standardization like this if we want to compare students’ grades from two different classes as it is expected for the course mean and standard deviation to be different each semester.
The Z-score of a variable, is the value of the variable after subtracting the mean and dividing by the standard deviation:
Z-Score
\[ Z = \frac{x - \mu}{\sigma} \]
The values of \(Z\) will be negative when the variable \(x\) is below the mean (\(x - \mu < 0\)) and positive when above the mean (\(x - \mu > 0\)). After standardization, we say that a variable follows a standard normal distribution.
We can calculate the CDF of the standard normal distribution (Z-scores) by calculating the area under the CDF for a normal distribution with parameters \(\mu = 0\) and \(\sigma = 1\). Because this distribution is the same for any variable that we standardize, there are tables like the one below that allows you to calculate the values of the CDF.
Here’s an example of part of a z-table:(See for a complete ztable and some really helpful info: http://www.statisticshowto.com/tables/z-table/)
To get the CDF, look up the first digit of the z-score on the left, and the second along the top.
For instance, imagine that a student got an 75 on a test. The course average was 70 and the standard deviation 10. The Z-score for the student’s grade is:
\[\begin{eqnarray*} Z-Score & = & \dfrac{x - \mu}{\sigma} \\ & = & \dfrac{75 - 70}{10} \\ & = & 0.5 \end{eqnarray*}\]Using the Z-table we can figure out that the probability of getting a 75 or less was a 0.6915 (row 6, column 1). Always keep in mind that the Z-table reports the CDF, and therefore should be interpret as the probability of getting \(z\) or less.
Using a table to calculate the CDF was the traditional way in which statistics was taught. Yet, that method is very inefficient. We can instead use R to calculate the values of the CDF directly.
Because the function pnorm
returns the CDF and the standard normal distribution has mean of zero and standard deviation of one. We simply need to use pnorm
using \(\mu = 0\) and \(\sigma = 1\) in order to obtain the CDF values.
Using the previous Z-score (0.5):
zscore <- 0.5
pnorm(zscore, 0, 1)
## [1] 0.6914625
It is even more precise than the table!
Below you can see the values of the CDF of a standard normal variable converging to 1:
zscores <- seq(-3,3,0.25)
probs <- pnorm(zscores, 0, 1)
egData <- data.frame(z = zscores, p = probs)
ggplot(egData, aes(x = zscores, y = p)) + geom_point(size = 3) + geom_line() + theme_economist() + scale_colour_economist() + ylab("Probabilities") + xlab("Z-Scores")
Note how pnorm
can be applied to a vector of values.
The values of the CDF are often called percentile values.
Conversely, we can go from probabilities to z-scores and then back to the original value of the variable by applying the inverse of the z-score. We can do this quickly in R using the qnorm
function:
zscore <- 0.5
prob <- pnorm(zscore, 0, 1)
qnorm(prob,70,10)
[1] 75
One application of standardization is the creation of a quality index for a product. Imagine that we are tasked with assigning a number between 0 and 100 to the quality of several products. For this we are given this data:
As you can see this variables measure different things, individually any of them would be an inadequate measure of quality. But put together they may provide some insight about the quality of the products.
We are going to proceed in the following way for each product:
I, of course, don’t have this data. But I will use R to simulate it. Let’s start by generating the data for 7 products, for 300 weeks.
n <- 7 # Number of Products
w <- 300 # Number of Weeks
# This will create the ID variable
# It will add the product number (1 to n)
# for each of the w weeks.
# For example, for product 1 it will set
# to 1 the value of ID for the first w observations
# then it will set it 2 for the next w observations...etc., etc.
id <- rep(1:n, each = w)
# This will generate the variable for weeks
week <- rep(1:w, n)
Now let’s generate the quality indicators. To make it interesting, I will use a combination of different distributions and add some biases. I will generate a fake quality variable for each product initially and use that as a benchmark to see if the methodology roughly reflects its values.
quality <- rnorm(n, 0, 1)
quality
## [1] -1.80510468 0.02106668 -1.72820427 0.45826932 0.50226943 2.00867156
## [7] 2.71869727
# This is the loop that generates the quality indicators.
#
# I'm going to initialize the vectors that I need
# first:
retur <- rep(NA, n*w)
calls <- rep(NA, n*w)
defec <- rep(NA, n*w)
repur <- rep(NA, n*w)
for(i in 1:n) {
# Average of each quality indicator as a function
# of quality
mu_retur <- 100 - 4.5*quality[i]
mu_calls <- 5*mu_retur - 1.96*quality[i]
mu_defec <- 10 - 3.14*quality[i]
mu_repur <- 21 + 9.9*quality[i]
# For the standard deviation. I'll use a fifth of
# the average when needed.
sigma_retur <- 0.2*mu_retur
# GENERATING RANDOM DATA
# For the returns I'll use a normal distribution
# rounded to the nearest integer
#
# The index for the data is (1 + w*(i-1)):(w*i)
# This will add the data on the corresponding
# vector on chunks of 300.
retur[(1 + w*(i-1)):(w*i)] <- rnorm(w, mu_retur,sigma_retur)
retur <- round(retur, 0)
# For calls I'll use the sum of three uniform distributions
calls[(1 + w*(i-1)):(w*i)] <- runif(w, min = mu_calls*0,
max = mu_calls*0.25) +
runif(w, min = mu_calls*0.25,
max = mu_calls*0.75) +
runif(w, min = mu_calls*0.75,
max = mu_calls)
calls <- round(calls, 0)
# For the defective units I'll use a Poisson distribution.
defec[(1 + w*(i-1)):(w*i)] <- rpois(w, mu_defec)
# Finally, for the repurchases I'll use combination of Poisson and Uniform distributions
repur[(1 + w*(i-1)):(w*i)] <- 0.5*rpois(w, mu_repur) + 0.5*runif(w, min = mu_repur*0.5, max = mu_repur*1.5)
}
As you can see, for some of our variables the data generation was fairly complex and thus the distribution of the data should not necessarily resembles a normal distribution. One thing we can do is to judge the symmetry of the distribution. One can do so by producing summary statistics using: the mean, standard deviation, minimum, maximum and 25th and 75th percentile. A fairly normal distribution should have an equal mean and median, and the distance from the median to the 25th and 75th percentile should be roughly the same.
But first, let’s add all of our data in a single data.frame first:
qualityData <- data.frame(id,week,retur, calls, defec, repur)
Now let’s write a function to produce the statistics that we are interested in.
# Let's write a function for the summary stats:
summaryStats <- function(x)
{
out <- c(min(x),
quantile(x, .25),
mean(x),
median(x),
quantile(x, .75),
max(x),
sd(x))
return(out)
}
Finally, let’s apply the function to each of the variables in the data.frame:
stats <- apply(qualityData, 2, summaryStats)
rownames(stats) <- c("Min",
"25th Perc.",
"Mean",
"Median",
"75th Perc.",
"Max",
"Std. Dev.")
stats
## id week retur calls defec repur
## Min 1.000000 1.00000 32.00000 455.0000 0.000000 0.9862723
## 25th Perc. 2.000000 75.75000 83.00000 663.0000 4.000000 4.7964906
## Mean 4.000000 150.50000 98.03714 737.5957 9.211429 23.8915012
## Median 4.000000 150.50000 97.00000 732.0000 9.000000 24.0507041
## 75th Perc. 6.000000 225.25000 112.00000 807.0000 13.000000 35.0050658
## Max 7.000000 300.00000 189.00000 1053.0000 30.000000 68.7649778
## Std. Dev. 2.000476 86.62269 21.28843 102.5337 5.850045 16.1410672
For the quality indicators we see that the median is almost the same as the mean. Yet, we can tell that defec
and repur
are not probably not symmetric due to the distance between the median and the 25th and 75th percentile not being that similar.
Additionally, we can produce histograms for each variable to get a feeling for the symmetry of each distribution and compare it with a normal distribution with the same mean and standard deviation.
# For Returns
ggplot(data = qualityData, aes(x=retur) ) + geom_histogram(aes(y =..density..), binwidth = 5) + stat_function(fun = dnorm, args = list(mean = mean(qualityData$retur), sd = sd(qualityData$retur)), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Returns")
# For Calls
ggplot(data = qualityData, aes(x=calls) ) + geom_histogram(aes(y =..density..), binwidth = 20) + stat_function(fun = dnorm, args = list(mean = mean(qualityData$calls), sd = sd(qualityData$calls)), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Calls")
# For Defective
ggplot(data = qualityData, aes(x=defec) ) + geom_histogram(aes(y =..density..), binwidth = 2) + stat_function(fun = dnorm, args = list(mean = mean(qualityData$defec), sd = sd(qualityData$defec)), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Defective")
# For Repurchases
ggplot(data = qualityData, aes(x=repur) ) + geom_histogram(aes(y =..density..), binwidth = 5) + stat_function(fun = dnorm, args = list(mean = mean(qualityData$repur), sd = sd(qualityData$repur)), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Repurchases")
To standardize a variable we just need to use the formula:
\[ Z-Score = \dfrac{x-\mu}{\sigma} \]
# I'll write a quick function to do this quickly.
zscore_transformation <- function(x) {
return ( (x-mean(x))/sd(x) )
}
standardData <- apply(qualityData[3:6], 2, zscore_transformation)
qualityDataZvals <- data.frame(id, week, standardData)
names(qualityDataZvals) <- c("id",
"weeks",
"retur_zval",
"calls_zval",
"defec_zval",
"repur_zval")
Let’s take a look again at the summary stats after standarization:
stats <- apply(qualityDataZvals, 2, summaryStats)
rownames(stats) <- c("Min",
"25th Perc.",
"Mean",
"Median",
"75th Perc.",
"Max",
"Std. Dev.")
round(stats,4)
## id weeks retur_zval calls_zval defec_zval repur_zval
## Min 1.0000 1.0000 -3.1020 -2.7561 -1.5746 -1.4191
## 25th Perc. 2.0000 75.7500 -0.7064 -0.7275 -0.8908 -1.1830
## Mean 4.0000 150.5000 0.0000 0.0000 0.0000 0.0000
## Median 4.0000 150.5000 -0.0487 -0.0546 -0.0361 0.0099
## 75th Perc. 6.0000 225.2500 0.6559 0.6769 0.6476 0.6885
## Max 7.0000 300.0000 4.2729 3.0761 3.5536 2.7801
## Std. Dev. 2.0005 86.6227 1.0000 1.0000 1.0000 1.0000
As you can see now all the variables have mean zero and standard deviation equal to one. That does not mean that they are normally distributed. The issues of asymmetry for defec
and repur
will persist after the transformation:
# For Returns
ggplot(data = qualityDataZvals, aes(x=retur_zval) ) + geom_histogram(aes(y =..density..), binwidth = 0.25) + stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Returns (Z-Scores)")
# For Calls
ggplot(data = qualityDataZvals, aes(x=calls_zval) ) + geom_histogram(aes(y =..density..), binwidth = 0.25) + stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Calls (Z-Scores)")
# For Defective
ggplot(data = qualityDataZvals, aes(x=defec_zval) ) + geom_histogram(aes(y =..density..), binwidth = 0.25) + stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Defective (Z-Scores)")
# For Repurchases
ggplot(data = qualityDataZvals, aes(x=repur_zval) ) + geom_histogram(aes(y =..density..), binwidth = 0.25) + stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 2) + theme_economist() + ggtitle("Distribution of Repurchases (Z-Scores)")
This part is fairly easy. We know that if a product is of good quality we should not receive a more than usual number of returns. As a result, more results means lower quality; thus, we must change the sign of returns from positive to negative. We can apply the same reasoning to number of complain calls about quality and defective unit. The only indicator that should remain with the same sign is the number of repurchased units. Then, this is a simply as multiplying the first three quality indicators by a minus sign.
qualityDataZvals[,3:5] <- -qualityDataZvals[,3:5]
Now we just need to calculate the average of the four quality indicators by product, then use the probabilities of the standard normal distribution (Z-table) and then multiply it by 100.
qualityZscore <- rep(NA, n) # Creating an empty vector to store the average
for(i in 1:n) {
averagePerWeek <- rowMeans(qualityDataZvals[id == i,3:6])
qualityZscore[i] <- mean(averagePerWeek)
}
We can quickly check if our indicator is correctly ranking the goods according to quality:
ggplot(data = data.frame(quality, qualityZscore), aes(x = qualityZscore, y = quality)) + geom_point(size = 3) + geom_smooth(method= "lm", color = "red", se = F) +theme_economist() + scale_color_economist()
## `geom_smooth()` using formula 'y ~ x'
This looks great! now we just need to transform the Z-Scores into probabilities and then multiply it by 100 so that it is easier for non-experts to understand.
qualityProb <- pnorm(qualityZscore, mean = 0, sd = 1)*100
ggplot(data = data.frame(id = as.factor(1:n), index = qualityProb), aes(x = reorder(id, qualityProb), y = qualityProb,fill=id)) + geom_bar(stat = "identity") + coord_flip() + xlab("Product ID") + ylab("Quality Index") + geom_text(aes(label= round(qualityProb)), hjust = 1.6, color="white", size=5) + theme_economist() + scale_fill_economist() + theme(legend.position = "none")
pnorm
function in R) we can get the probabilities of the normal distribution CDF. This are also called percentile values. They represent the probability of observing a value equal to the Z-Score or less.