Statistical Analysis 3.2: Probability Distributions

Probability Distributions
- Overview
- Objectives
Discrete distributions
Sampling
- Probability mass function
- Example calculation
Non-discrete distributions
R functions: Draws from the PDF
- R: Probability from CDF
- R: CDF value from probability

Probability Distributions

Overview

This lesson introduces probability distributions.

Objectives

After completing this lesson, students should be able to:

Calculate and use discrete distributions.
Sample from distributions using R functions.
Calculate probability density functions and cumulative distribution functions.
Understand the uniform distribution.

Discrete distributions

Distributions for random variables:

Usually denoted with a capital letter such as X, and for our die, we might write \(P(X=2) = 1/6\).
We can describe the probability of all those values in a single function, the probability mass function:

Sampling

The sample function in R takes as its input any vector or list, and randomly selects items from this list. For example, let’s take a die with six sides numbered 1-6 and sample two of these at random:

die1 <- c(1,2,3,4,5,6)
sample(die1, 2)

[1] 1 5

By default, sample assumes “without replacement” – ie, once an item is sampled, it can’t be sampled again. To sample with replacement:

die1 <- c(1,2,3,4,5,6)
sample(die1, 10, replace=TRUE)

 [1] 2 2 2 3 3 5 5 5 2 1

Probability mass function

We can generate our probability mass function – ie, the probability of getting each of our possible outcomes – by simply rolling our die many times and plotting the proportion for each outcome.

library(ggplot2)
sout <- sample(die1, 1000, replace=TRUE)
ggplot(data=data.frame(sout),aes(x=sout)) + 
  geom_histogram(aes(y=..count../sum(..count..))) + 
  xlim(0, 7) + ylab("density") + xlab("outcome")

(The “aes” option for geom_histogram tells ggplot we want proportions and not counts on the y axis.)

How does what we have learned earlier apply to PMF:

Probability law: P(\(\omega\)) = \(\frac{\omega}{N}\)
\(f_x\)(\(x\)) = P(\(X=x\))
0 \(\le\) \(f_x\)(\(x\)) \(\le\) 1
\(\sum_i\)\(f_x(x_i)\) = 1
PMF and our previously used randomized experiment with coins: P(X=x)

Example calculation

We can also do the same thing for our previous example, where 1 was the outcome 20% of the time, 3 was the outcome 50% of the time, and 7 the outcome 30% of the time. The easiest way to do this is to create a 10-sided die with two 1s, five 3, and three 7s, and then sample from that:

die2 <- c(1,1,3,3,3,3,3,7,7,7)
sout <- sample(die2, 1000, replace=TRUE)
ggplot(data=data.frame(sout),aes(x=sout)) + 
  geom_histogram(aes(y=..count../sum(..count..))) + 
  xlim(0, 8) + ylab("density") + xlab("outcome")

Warning: Removed 2 rows containing missing values (geom_bar).

Non-discrete distributions

What is a non-discrete distribution?

Intuitively we know what it means to choose a number between 0 and 1 at random, but in practice and theory it’s a bit more complex. Any given number – such as 0.1232342524534 – has essentially a 0% chance of being selected, but at the same time, we know that we have a 50% chance of getting a number between 0 and 0.5, and a 10% chance of getting a number between 0.8 and 0.9.
In contrast to discrete distributions, non-discrete distributions don’t have “spikes” in the distribution but have continuous values (e.g. random number between 0 and 1, random direction, family income, etc.).
There exists a nonnegative function.

Probability Density Functions

We can now get the Probability Density Function.

What is the probability of a point mass?

From point mass	To PDF
P(X=a)	P(a\(\le\)X\(\le\)a) = 0

But: don’t think of the PDF \(f_x(x)\) as P(X=x) because, remember, P(x=x) = 0 for every x! Probability density function does not define a probability but probability density. To obtain the probability, we need to integrate it in an interval (it is mathematically impossible to find the exact probability of a number from a continuous distribution, see Lander, p. 172).

So: Instead of the previous probability mass function with spikes at each outcome whose heights must add up to 1, we now have a probability density function (pdf). A pdf is continuous and its area sums up to 1.
PDF’s are often written as functions, for example for a uniform distribution:

\[ f_x(x) = \frac{1}{b-a} \; for \; x \in [a,b] \] and

\[ f_x(x) = 0, otherwise \] Wait, what is a uniform distribution?

Uniform distribution

As you might expect, R has a function for generating a random number between A and B, where any number between A and B is equally likely:

runif(1,0,1)

[1] 0.9502876

The first argument is the number of numbers to generate, the second is the lower bound, and the third is the upper bound. As before, we can generate a large number of these and plot the histogram to see our pdf:

randunifs <- runif(10000,0,1)
ggplot(data=data.frame(randunifs),aes(x=randunifs)) +  geom_histogram(aes(y=..density..)) +  xlim(-1, 2) + ylab("density") + xlab("outcome")

Warning: Removed 2 rows containing missing values (geom_bar).

This is what’s known as a uniform distribution: set of numbers with equal frequency between A and B.

Visualization

Rather than shaded in, pdfs are often drawn as functions, although you can always imagine them as idealized histograms.

The true pdf for the uniform distribution between 0 and 1 – sometimes abbreviated U[0,1] – is:

uniformfun <- function(x){ifelse(x>=0&x<=1,1,0)}
ggplot(data=data.frame(x=c(-1, 2)), aes(x)) + stat_function(fun=uniformfun)

(To create this graph, first we create our uniform function, which is 1 between 0 and 1, and 0 elsewhere. Then we use the stat_function option in ggplot2 to add the function to the ggplot, where the data frame is just a holder setting the x range.)

Cumulative distribution function

The probability that a random number drawn from this distribution is less than some \(x\) is known as the cumulative distribution function or cdf.

For a uniform pdf: The chance that some number drawn from it is less than 0 is 0; the chance that some number drawn from it is less than 0.25 (for instance) is 25% or 0.25, and the chance that some number drawn from it is less than 1 is 100%, or 1.

Thus, while the pdf of U[0,1] is a line of y=1 for all x between 0 and 1 and y=0 elsewhere, for the cdf of this uniform pdf, the chance that a drawn is less than x is x, which we can draw as y = x for \(x \in [0,1]\), y = 0 for \(x \leq 0\) and y = 1 for \(x \geq 1\).

uniformcdf <- function(x){ifelse(x>=0&x<=1,x,ifelse(x<0,0,1))}
ggplot(data=data.frame(x=c(-1, 2)), aes(x)) + stat_function(fun=uniformcdf)

(Note how we nested the second ifelse within the first in defining our cdf. There are plenty of orther ways to do it though.)

Properties of the cdf:

\(F_x(x)\) never decreases.
\(F_x(x)\) is normalized.
\(F_x(x)\) is right-continuous (no jumps when we approach a point from the right)

PDF vs CDF

PDF: gives you the density for any given x, denoted \(pdf(x)\) or simply \(p(x)\).

CDF: gives you the probability that you will get a draw less than x, which is equal to \(cdf(x)\), sometimes abbreviated \(F(x)\). This is equal to the area under the pdf to the left of x, as we discussed with the uniform pdf. As you may recall from calculus, this area is equal to integral of pdf(x) up to x, or:

\[P(X < a) = F(a) = \int_{- \infty }^a p(x) dx\]

(Note how we use capital P or F for a probability and lowercase p for a probability density function; capital X for the random variable, and lowercase x and a for specific values of that random variable.)

Example: Often we are interested in the chance of getting a draw between a and b. We know from before that the chance of drawing a number from U[0,1] that’s between, say, 0.8 and 0.9, is 0.1 (since it’s 10% of the interval). To get this using the cdf, we will calculate the chance of getting something between a and b, which is the chance of getting something less than b, minus the chance of getting something less than a. That is:

\[P(X \in [a,b]) = F(b) - F(a) = \int_{- \infty }^b p(x) dx - \int_{- \infty }^a p(x) dx = \int_{a}^b p(x) dx\]

Calculating with a CDF

In simple terms, we want to know \(F(b) - F(a)\). For example, looking at our previous figure and if we want a probability that a random draw will be between 0.345 and 0.234, say, we just calculate \(F(0.345) - F(0.234)\), that is, the y value at 0.345 minus the y value at 0.234, which for this super-simple function is just

\[F(0.345) - F(0.234) = 0.345 - 0.234 = 0.111\] So: there is a 11.1% chance that our random draw will be in that range.

Uniform with different bounds

In some ways, the U[0,1] distribution is a bit more confusing than some, because \(F(x) = x\) (between 0 and 1). But say we have a uniform chance of drawing a number between 1 and 3. In that case, we have a pdf that looks like this:

randunifs <- runif(10000,1,3)
ggplot(data=data.frame(randunifs),aes(x=randunifs)) +  geom_histogram(aes(y=..density..)) +  xlim(0, 5) + ylab("density") + xlab("outcome")

Warning: Removed 2 rows containing missing values (geom_bar).

The pdf is now only half as high, because the total area (total probbility) has to still sum to 1, and we’ve increased the width by a factor of 2. Once again, it is easy enough to plot the true pdf and cdf, and to see that now the chance of getting a draw less 2 is 0.5.

uniformfun <- function(x){ifelse(x>=1&x<=3,1/2,0)}
ggplot(data=data.frame(x=c(0, 4)), aes(x)) + stat_function(fun=uniformfun)

uniformcdf <- function(x){ifelse(x>=1&x<=3,(x-1)/2,ifelse(x<1,0,1))}
ggplot(data=data.frame(x=c(0, 4)), aes(x)) + stat_function(fun=uniformcdf)

R functions: Draws from the PDF

We can also get the pdf, cdf, and the inverse of the cdf using built-in R functions similar to runif:

To get the density (pdf) we use dunif, which as with runif takes as its default range [0,1] but that can be changed by the second and third arguments:

dunif(-1)

[1] 0

dunif(.5)

[1] 1

dunif(2.5,1,3)

[1] 0.5

Again, generally the equation for the pdf of a uniform distribution \(U[a,b]\) is:

\[p(x; a,b) = \frac{1}{b-a} \textrm{for all } x \in [a,b] \textrm{, = 0 otherwise}\]

R: Probability from CDF

To get the cdf, we use punif, which again, is the probability that a draw will be less than x:

punif(-1)

[1] 0

punif(.5)

[1] 0.5

punif(2)

[1] 1

punif(2.5,1,3)

[1] 0.75

(Make sure you understand all these results!)

R: CDF value from probability

What if we know P(\(X \le x\)) = p but don’t know x? We can get the inverse CDF, known as the quantile function.

To get the reverse of the cdf we use qunif, which given an input probability p, gives us the value of \(a\) such that there is an p% chance of getting a draw less than \(a\).

qunif(0.5)

[1] 0.5

qunif(0.5,1,3)

[1] 2

As we will see, there are many different probability distributions of interest, and many of them have similar built-in R functions akin to the four described for the uniform distribution.