<

Fundamental Distributions

Overview

This lesson explores some common probability distributions.

Objectives

After completing this lesson, students should be able to:

  1. Employ and use R functions related to the binomial distribution.
  2. Employ and use R functions related to the poisson distribution.
  3. Employ and use R functions related to the normal distribution.

Key distributions

There are dozens if not hundreds distributions in the statistician’s toolkit, but a few of the most popular ones do by far the majority of the work – often to the point of being applied to processes which they don’t quite match, but where assuming they do makes our analysis much easier:

  • Normal distribution

  • Binomial distribution

  • Poisson Distribution

The key thing to bear in mind is the process that generates each distribution, and when we might want to apply it in the real world.

Normal distribution

The normal distribution is the most important and ubiquitous distribution that results from the central limit theorem (discussed next). The equation for the normal distribution is:

\[p(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} e^{\frac{-(x-\mu)^2}{2 \sigma^2}}\]

There are two parameters here: the mean \(\mu\) the standard deviation \(\sigma\) (\(\sigma^2\) is the variance).

Some features of the normal distribution:

  • The mean = median = mode

  • Symmetry: 50% less than the mean and 50% more than the mean.

  • Standard deviations: 68% of the values are within 1 standard deviation of the mean. 95% of the values are within 2 standard deviations of the mean.

  • 99.7% are within 3 standard deviations of the mean.

Remember how we calculate the mean, the variance and the standard deviations?

Vizualization

As usual, we can generate a distribution by taking multiple draws using the R function rnorm:

library(ggplot2)
Warning: package 'ggplot2' was built under R version 3.5.2
randnorms <- rnorm(10000,0,1)
ggplot(data=data.frame(randnorms),aes(x=randnorms)) +  geom_histogram(aes(y=..density..)) +  xlim(-4, 4) + ylab("density") + xlab("outcome")
Warning: Removed 1 rows containing non-finite values (stat_bin).
Warning: Removed 2 rows containing missing values (geom_bar).

Here the mean is 0 and the standard deviation (sd) is 1. As a rule of thumb, for the normal distribution about 68% of observations appear within 1 sd of the mean, and about 95% of observations appear within 2 sd’s of the mean.

Varying the parameters

We can plot the normal distribution for various combinations of \(\mu\) and \(\sigma\), but the key thing to bear in mind is that they are all basically the same shape, just transposed and stretched.

normfun <- function(x){dnorm(x,0,1)}
ggplot(data=data.frame(x=c(-4, 4)),aes(x)) + ylab("density") + xlab("outcome") + 
  stat_function(fun=normfun)

normfun <- function(x){dnorm(x,1,0.25)}
ggplot(data=data.frame(x=c(-4, 4)),aes(x)) + ylab("density") + xlab("outcome") + 
  stat_function(fun=normfun)

R functions

What if we are interested in the probability that an even will be between some \(a\) and \(b\), or \(F[X \in [a,b]]\)? This requires the cdf of the normal, as usual it can be calculated by generating the histogram and then just adding up the events below \(x\) to give us \(F[x]\) for each value of \(x\).

Using R functions:

pnorm(0,0,1)
[1] 0.5
pnorm(0,1,0.25)
[1] 3.167124e-05

This gives the proportion of events less than 0 for \(N(0,1)\) and \(N(1,0.25)\), to use the usual abbreviation, where \(\mu\) is the first number and \(\sigma\) is the second.

Vizualizing the CDF

We can also approximately plot the cdf using the pnorm function, to get a sense of how it looks:

cumlfun <- function(x){pnorm(x,0,1)}
ggplot(data=data.frame(x=c(-4, 4)),aes(x)) + ylab("cumulative distribution") + xlab("outcome") + 
  stat_function(fun=cumlfun)

As you can see, half the events are below 0, and nearly all are below 2 (ie, 2 standard deviations above 0, since sd = 1 in this case.)

Binomial distribution

How do we know a coin is biased? We flip it a lot, and see whether we get too many heads or tails than we would expect. But of course we’re never going to get exactly 50% heads and 50% tails. So how many heads or tails is enough to be suspicious?

The binomial distribution gives the probability of getting \(x\) events (eg, heads) out of a total of \(n\) trials, where each of the events we’re interested in has probability \(p\).

If we’re asking about, say, 3 heads out of four flips, we could get HHHT, HTHH, etc. Each one has the same probability: \(0.5^4\), but there are a lot of different ways of getting three heads and one tail.

The binomial pdf

The binomial pdf is:

\[P(x;n,p) = {n \choose x}p^x (1-p)^{n-x}\]

The semi-colon separates the random variable, \(x\), from the parameters \(p\) and \(n\). For any given \(p\) and \(n\), we can derive the probability of getting \(x\) events for any \(x\).

\(p^x (1-p)^{n-x}\): the probability of getting any sequence. For instance, with three heads and one tail, x = 3 and n = 4. If it was a biased coin, \(p\) could be something other than 0.5, such as 0.1; if it was a fair die, \(p\) would be 1/6. The chance of getting two 6’s and one non-6 would then be: \(1/6 \cdot 1/6 \cdot 5/6\)

To get the total probability of any such sequence (such as three heads and a tail), we just add up the probability of each of these sequences: P(HHHT) + P(HTHH) + … etc, each of which are individually the same probability. \({n \choose x}\) just counts up the number of ways we could get three H’s and one T, and is read as “n choose x” and equals \(n!/x!(n-x)!\).

Binomial R functions

As usual, R has four functions for dealing with binomial distributions, analogous to the ones we saw for the uniform distribution: rbinom, pbinom, dbinom and qbinom.

pbinom gives the distribution function:

pbinom(3,4,0.5)
[1] 0.9375

dbinom gives the density:

dbinom(2,3,1/6)
[1] 0.06944444

rbinom generates the number of times an event occurs:

rbinom(1,4,.5)
[1] 2

qbinom gives the quantile function:

qbinom(1,4,.5)
[1] 4

Visualization

As usual, we can derive the dbinom result from the rbinom by repeatedly flipping our coin 4 times, and for each set of four flips, writing down the total number of heads, and plotting the histogram of those totals:

library(ggplot2)
headcounts <- rbinom(100,4,.5)
ggplot(data=data.frame(headcounts),aes(x=headcounts)) +  geom_histogram(aes(y=..density..)) +  xlim(-1, 5) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing missing values (geom_bar).

Visualization with more samples

Strictly speaking, this is a probability mass function rather than pdf (since x is not continuous), but the previous figure gives an idea of the sorts of shapes one encounters. Here’s another version with a much larger sequence of flips:

library(ggplot2)
headcounts <- rbinom(10000,100,.5)
ggplot(data=data.frame(headcounts),aes(x=headcounts)) +  geom_histogram(aes(y=..density..)) +  xlim(-2, 102) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing missing values (geom_bar).

Note how the distribution approaches the bell-curve shape as the number of events goes up. As we will see, this is a general property, and one reason why the normal distribution (which we will turn to shortly) is so important.

Poisson Distribution

  • Another important discrete distribution.
  • Similar to the binomial distribution in that it measures the total number of events, but in this case, we assume that the events happen during some time (or space) interval (e.g. the number of emails you receive in an hour).

As before, we have a pdf, and a cdf which describes the probability of seeing x or fewer events in the time period. There is one parameter, \(\lambda\), which is the average number of events one generally sees in the period. The pdf is:

\[P(x;\lambda) = \frac{\lambda^x e^{-\lambda}}{x!}\]

Visualization

As before, we can plot it by generating samples using rpois, where each sample is a total number of observed events:

randpois <- rpois(10000,1)
ggplot(data=data.frame(randpois),aes(x=randpois)) +  geom_histogram(aes(y=..density..),binwidth=1) +  xlim(0, 20) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing missing values (geom_bar).

randpois <- rpois(10000,5)
ggplot(data=data.frame(randpois),aes(x=randpois)) +  geom_histogram(aes(y=..density..),binwidth=1) +  xlim(0, 20) + ylab("density") + xlab("outcome")
Warning: Removed 2 rows containing missing values (geom_bar).

R functions

For rare events (where the mean number is 1 or less), the distribution not bell-curved, but for larger mean values, it too resembles a bell curve. In fact, the poisson distribution reduces to the binomial when there are just two possible events.

As before, we can calculate the cdf and all the rest using the cognate functions: dpois, ppois and qpois.

poisfun <- function(x){dpois(x,5)}
ggplot(data=data.frame(randpois),aes(x=randpois)) +  geom_histogram(aes(y=..density..),binwidth=1) +  xlim(0, 20) + ylab("density") + xlab("outcome") + 
  stat_function(fun=poisfun,color="yellow")

(Note that the function is spikey because it is only defined for integer values of x. In fact, plotting poisfun in this way produces lots of warnings as ggplot tries to evaluate it for non-integer x values, because dpois is a probability mass function, not a density function.)

Review of pdf functions

Distribution Parameters determining shape of PDF Cumulative probability defined by input \(x\)
Uniform 2: upper and lower bounds Probability of drawing a number \(\leq x\)
Binomial 2: probability of some event; total number of trials Probability of observing \(\leq x\) events in \(n\) trials
Poisson 1: average number of events in a period of time Probability of observing \(\leq x\) events in a period of time
Normal 2: mean; sd Probability of drawing a number \(\leq x\)

Review of R functions

All four R functions, r–, p–, d–, q– take in one main input, plus a number of extra parameters that define the shape of the pdf being used. Each R function is basically a way of calculating with the height or area of the pdf that’s defined by the extra parameters.

dnorm(x,mean,sd): Give me the height of the pdf at x. This is just the PDF itself, which is just an equation where you input an x and get an output of p(x), the height of the curve at x.

rnorm(n,mean,sd): Give me n random numbers drawn from the distribution.

pnorm(q,mean,sd): Give me the probability (P) that a random draw from the distribution will be less than q (some x value). Equivalently, give me the area under the curve up through point q. This is the CDF, or F(x).

qnorm(p,mean,sd): This is the inverse of pnorm, as you can see from the p/q reversal: Give me the q (an x value) such that the area under the curve up to point q is equal to P. Equivalently, give me the q (x) such that there is a p% chance of drawing a random number less than q.

Here is a plot to help you visualize how these all interrelate using the normal distribution, but the same logic holds for any pdf.