Statistical Analysis 3.1: Probability

Probability

Overview

This lesson introduces the basic concepts of probability.

Objectives

After completing this lesson, students should be able to:

Calculate simple probabilities.
Use the concept of sample spaces to estimate probabilities.
Calculate and understand conditional probabilities.
Apply Bayes’ Theorem to calculate unknown conditional probabilities.

Probability and randomness

What is Probability?

Show explanation

Random Variables and Probability

Probabilities and Outcomes: The age of the next person you meet, your grade on an exam, and the number of errors on an R code all have an element of chance or randomness. In each of these examples there is something not yet known that is eventually revealed.
The mututally exclusive potential results of a random process are called outcomes. For example, your R code might have no errors, it may only have one error, it may have two errors, and so on. Only one of these outcomes will actually occur (the outcomes are mutually exclusive), and the outcomes need not be equally likely. The better coder you are, the more likely it is that your code has only a few or no errors.
The probability of an outcome is the proportion of the time that the outcome occurs in the long run. If the probability of your computer not crashing while you are writing a term paper is 80%, then over the course of writing many term papers you will complete 80% without a crash.

Calculating probabilities

You flip a coin three times in a row. What is the chance of getting three heads?

Show explanation

How about if we roll a die. What’s the chance of getting a 1 or a 3?

Show explanation

Sample spaces

The easiest way to think about probability is via an event, state, or sample space (all different terms for the same thing). The sample space is just the set of all possible independent outcomes of your event. For a roll of a six-sided die, the sample space is {1,2,3,4,5,6}. The sum of the probabilities of each event has to sum up to 1.

We can write this as:

\[p(die=1) + p(die=2) + p(die=3) + p(die=4) + p(die=5) + p(die=6) =\] \[\frac{1}{6}+\frac{1}{6}+\frac{1}{6}+\frac{1}{6}+\frac{1}{6}+\frac{1}{6} = 1\]

As we will see, complex probabilities can be calculated very easily by just counting up the independent events in our sample space.

Let’s roll two dice. What is the chance of getting a total of 7?

Show explanation

Outcomes

Let’s create a table of all the possible outcomes of these two rolls – the sample space:

die1 = c(1,2,3,4,5,6)
die2 = c(1,2,3,4,5,6)
twodice = outer(die1,die2,"+")
twodice

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    2    3    4    5    6    7
[2,]    3    4    5    6    7    8
[3,]    4    5    6    7    8    9
[4,]    5    6    7    8    9   10
[5,]    6    7    8    9   10   11
[6,]    7    8    9   10   11   12

Show explanation

Compound events

Once you start thinking in terms of sample spaces, it is easier to conceptualize the various laws of probability and solve probability problems. The challenge is always to just define your outcomes as a set of (usually) equally probable events, and then just add up the ones you are interested in as a fraction of the total.

Another one:

What is the chance of getting exactly two heads in a row when flipping a coin three times?

Show explanation

Seven features of probabilities

Seven features:

\(P\)(single event) = \(\frac{1}{N}\)
\(P\)(Group | Single event) = \(\frac{Group frequency}{N}\)
\(P\)(single event) >= 0 and <= 1: nonnegativity.
\(\sum\)probabilities = 1; also P(\(\Omega\))=1; normalization.
\(P\)(event) + \(P\)(no event) = 1.
\(P\)(any one event) from a set of mutually exclusive events is the sum of the probabilities: addition rule.
\(P\)(combination of independent events) is the product of their separate probabilities: multiplication rule.

Important notations:

Some important notations:

\(\Omega\): Sample space, all possible events

\(\omega \in \Omega\): one particular event

\(X(\omega)\): Random variable / outcome

Use lower case \(x\) to denote a realized value of the random variable X.

Let’s consider a randomized experiment with coins with X being the number of events getting a head or tail in one of the possible orders below.

\(\omega\)	P({\(\omega\)})	X(\(\omega\))
HC	1/4	0
HH	1/4	1
CH	1/4	3
CC	1/4	1

So, for every \(x\)=0, \(x\)=1, and \(x\)=3, what is \(P(X=x)\)?

Exercise

Much of the rest of probability is just being clever about how you count up the outcomes you’re interested in (the numerator) and the total number of outcomes (the denominator).

Which of the following will use R to correctly count up the numerator (number of ways of getting 5, 7 or 9 on a roll of two dice)?

length(which(twodice==5|twodice==7|twodice==9)): Yes. The which gives you a vector of the elements in the matrix which fulfill the critereon; the length gives you the number of elements, or 14.
length(twodice[twodice==5|twodice==7|twodice==9]): Yes. The inner part gives you the values of the elements of twodice which sum to 5, 7 or 9, and the length gives you the number of those elements, or 14.
sum(twodice==5|twodice==7|twodice==9): Yes. If you leave off the sum part you will see a matrix with TRUE in the right place, which R treats as 1 (while treating FALSE as 0) and sums to 14.

Joint probabilities

We can think a bit more abstractly about sample spaces now. Imagine instead of our 6x6 grid from the dice example, we have a grid with an arbitrary number of squares. Say we have a dartboard like this, and we are the world’s worst darts player: we’re sure to hit the board, but it’s totally random where. The probability of the dart landing in some portion of the dartboard A is just the area of A as a proportion of the total dartboard.

Based on this figure, the probability of landing in A is: (6x10)/(10*16) = 60/160 = 0.375. We usually consider the total to always equal 1, so we can just more abstractly say the probability of A, denoted \(P(A)\), is equal to the area of A, assuming the total area of the sample space = 1.

A can be any event we’re interested in, such as getting a total of 7 on rolling two dice, or getting exactly two heads in a row on flipping a coin three times.

Union / Or

This addresses the chance of \(P(A)\) or \(P(B)\), i.e. the chance of event A or event B happening.

The total area in A is the area in its square, and the same for B; the combination or union of these two outcomes we denote \(A \cup B\) (the union of A and B). But \(P(A \cup B) \neq P(A) + P(B)\), since they clearly overlap, and we would be double-counting the overlapped area. So in general, we have:

\[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]

Where \(P(A \cap B)\) is the probability of landing in the intersection of A and B.

Referring to our previous figure:

What is the probabiltiy of \(P(A \cap B)\)?: \(3/160 = 0.01875\).
What is the probabiltiy of \(P(A \cup B)\)?: \((6 \cdot 10)/160 + (5 \cdot 5)/160 - 3/160 = 0.375 + 0.15625 - 0.01875 = 0.5125\)

Intersection / And

We can also think of \(P(A \cap B)\) as the probability of \(A \& B\), ie, \(P(A \& B)\).

Given that our dart lands somewhere in A, what is the probability that our dart has landed in \(A \& B = A \cap B\)?

We write this as:

\[P(A \& B | A) = \frac{P(A \& B)}{P(A)}\]

That is, the probability of getting \(A \& B\) conditional on (\(``|"\)) already getting A is the probability of \(A \& B\) divided by \(P(A)\) – that is, the fraction of A that is \(A \& B\). We could also write the first part as \(P(B|A)\), since given that we are in A, landing in B is necessarily going to be the same as landing in \(A \& B\): \(P(B|A) = P(A \& B | A)\).

So for our figure, we can calculate \(P(B|A)\): \(P(B|A) = \frac{3}{60} = 0.05\), ie, there is a five percent chance of landing in B given that we have landed in A.

Of course, we can also run it the other way, asking what the chance is of landing in A given that we know we landed somewhere in B:

\[P(A \& B | B) = P(A|B) = \frac{P(A \& B)}{P(B)} = \frac{3}{25} = 0.12\]

Bayes’ Theorem

So to push this one step further, what’s the chance of landing in \(A \& B\)? The obvious answer is 3 in 160, or 0.01875. But another way to look at it is the chance of first landing in A, and then given you’ve landed in A, landing in \(A \& B\). So:

\[P(A \& B) = P(B|A) P(A)\]

In this case,

\[P(A \& B) = P(B|A) P(A) = \frac{3}{60} \frac{60}{160} = 0.01875\]

And of course the other way:

\[P(A \& B) = P(A|B) P(B) = \frac{3}{25} \frac{25}{160} = 0.01875\]

The theorem

Thomas Bayes put these two together (though he wasn’t the first):

\[P(A \& B) = P(A|B) P(B) = P(B|A) P(A)\]

This is often rearranged as:

\[P(A|B) = \frac{P(B|A) P(A)}{P(B)}\]

Like many a theorem, this one seems rather obvious. But it turns out to be incredibly useful. -> In particular, we often are interested in knowing \(P(A|B)\), but we only know \(P(B|A)\), \(P(A)\), and \(P(B)\). In fact, people often mix up \(P(A|B)\) and \(P(B|A)\) altogether, and don’t even realize they are calculating the wrong thing.

Bayes Example

Suppose you’re having a fever and you’re going to the doctor to have some tests done and exclude this new disease that people have been warning about. Unfortunately, one of these tests comes up positive. The doctor explains that the test has a sensitivity of 80%: 80% of the time, if you have the disease, the test is positive. That’s a big percentage, right?

But it turns out you’re calculating the wrong thing. What you are interested in is the probability that you have the disease given that your test was positive.

So: P(you have the disease | test was positive), or \(P(D|T=+)\). The 80% number is P(test was positive | you have the disease ). To calculate \(P(D|T=+)\), we use Bayes’s Theorem:

Bayes’s Theorem:

\[P(D | T=+) = \frac{P(T=+|D) P(D)}{P(T=+)}\]

The numerator on the RHS we know: 80%. The doctor tells you \(P(C)\), which - given it’s a new disease - is quite low: 0.004. But what is \(P(T=+)\)?

Example answer

Assume that you either have the disease or not, that is you either have a positive result and the disease or you have a positive result but no disease:

\[(P(T=+|D) P(D))\] \[(P(T=+|ND) P(ND))\]

\[P(T=+) = P(T=+|D) P(D) + P(T=+|ND) P(ND)\]

We know that P(ND) is 1 - P(D) = .996.
So we just need to know the “false positive rate”, i.e. the chance of getting a positive test even when you don’t have the disease. Let’s say the doctor tells you this is 10%.

So, the chance that you have the disease is:

\[P(D|T=+) = \frac{P(T=+|D) P(D)}{P(T=+)} = \frac{P(T=+|D) P(D)}{P(T=+|D) P(D) + P(T=+|ND) P(ND)}\]

or

\[P(D|T=+) = \frac{0.80 \cdot 0.004}{0.80 \cdot 0.004 + 0.10 \cdot 0.996} = 0.0311\]

That is, even given the positive test and the 80% sensitivity, you still only have a 3% chance of actually having cancer. Do we now want everyone to be tested for the disease?

The chain rule

Remember:

\[P(A \& B) = P(A,B)\] And:

\[P(A,B) = P(A|B) P(B) = P(B|A) P(A)\] Also if there are more than two variables, then:

\[P(A,B,C) = P(A|B,C)P(B|C)P(C) = P(C|A,B)P(B|A)P(A) = P(A|B,C)P(C|B)P(B)\] Conclusion: it doesn’t matter what order we do things in: the probability of getting A, B and C is just the probability of getting C (for instance), times the probability of getting B given that you’ve first gotten C, then the probability of getting A given that you’ve gotten B and C. But you can factor it in any order you prefer.This is known as the chain rule of probability.

The chain rule: final notes

One final thing to bear in mind: if A and B are independent (eg, two rolls of a die), then:

\[P(A|B) = P(A)\]

and

\[P(B|A) = P(B)\] What’s the chance of getting a 3 on the first roll and a 3 on the second? Well, when the two rolls are independent, it’s just \(1/6 \cdot 1/6\). In other words, when A and B are independent, then:

\[P(A,B) = P(A)P(B)\]

Often with complex functions with lots of joint variables, \(P(A,B,C...)\), we can approximate \(P(A,B,C...)\) by assuming the variables are all independent:

\[P(A,B,C...) \approx P(A)P(B)P(C)...\]

This can be a good approximation if the variables don’t intersect very much, ie \(P(X,Y) \approx 0\) for all pairs, but it will be very misleading if they do intersect substantially. Much work in probability theory involves turning complicated functions \(P(A,B,C...)\) into the product of more mathematically simple functions using the chain rule and (when it is appropriate) independence assumptions.