Computational Statistics 11.1: Binary Dependent Variables

Logistic regression
Introduction
- Being Clever
The linear regression model with categorical dependent variable
- Linear Probability Model
- The problem with the linear probability model
Logistic Regression
The intercept
- Interpret \(\beta_0\) when all \(x\) are zero
The slope
Hypothesis tests and significance
Estimating a Logit model

Logistic regression

Overview

This lesson introduces regression for categorical dependent variables.

Objectives

After completing this module, students should be able to:

Perform a logistic regression and interpret the results.

Readings

Lander, Chapter 20.1; Schumacker, Chapter 18.

Introduction

Previously: We model \(y\) as a linear or non-linear function of \(x\). But we didn’t impose a restriction on the values that \(y\) can take.

In this lesson: The range of \(y\) is bounded to a specific discrete interval. Why would we want to restrict the range of \(y\) to a set of specific values?

Some research questions imply that \(y\) is a binary or categorical response:

Did you complete high school? (answer can be Yes or No)
Did you use drugs in the last year? (answer can be Yes or No)
What is your marital status? (answer can be single/married/divorced/widowed/etc.)

The common factor in all of the previous questions is that the variable is not continuous and it takes only a limited predefined set of values. In this lesson we’ll learn how to deal with \(y\) being binary (only two possible values). There are other techniques that allow researchers to deal with more than two categories, but that’s beyond the scope of this introductory course.

Being Clever

Can we use some clever transformation to make the binary variable continuous or at least unbounded?

Sometimes it is possible to do so, consider how we can transform some of the previous questions in a way that is more appropriate to the regression models that we have already explored:

How many years of education you have completed? (answer range from zero to infinity in theory)
How many grams (or some quantitative measure) of drugs did you use last year? (answer range from zero to infinity in theory)

The problem is that unless you are not collecting the data yourself and using an existing database, some of this questions are not going to be asked in the format that you prefer. Moreover, some questions cannot be transformed to a convenient continuous domain (see the case of marital status). Therefore, in some instances we are stuck with binary dependent variables.

The linear regression model with categorical dependent variable

The multivariate linear regression model is:

\[y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \epsilon\]

And the interpretation of \(\beta_{1}\) is, for every unit-increase in \(x_{1}\), \(y\) change by \(\beta_{1}\) units.

Now, let’s assume that \(y\) is a binary variable, for example, consider the employment status of an individual:

employment status (0 = unemployed, 1 = working) , and,
\(x_{1}\) is years of education.

In this case the literal interpretation of \(\beta_{1}\) is a bit strange, assuming \(\beta_{1} = 0.1\), for every additional year of education, the unemployment status increase by \(0.1\) units. Which implies that individuals with more than 10 years of education will have an unemployment status of more than one (which is outside the bounds of the variable) or the individuals without an education are always unemployed.

This happens because the \(y\) variable is bounded, it cannot take values greater 1 and less than zero, which is an important restriction. But the regression model does not take into account this restriction in the range of \(y\), the fitted values of \(y\) range from minus infinity to positive infinity. See how the regression line (red line) can be above and/or below the bounds of \(y\) in the previous graph.

Even if the interpretation is a bit strange, the parameter is still an indication of the relation between \(x\) and \(y\). For instance, when the fitted values of \(y\) are very low, we can think of that as an indication that it is more likely that \(y\) is equal to the lower bound (zero), and if the predicted values of \(y\) are very high, we can think of that as an indication that it is more likely that \(y\) is equal to the upper bound (one). Formally,

High values of \(\hat{y}\) are correlated with \(p(y=1)\)
Low values of \(\hat{y}\) are correlated with \(p(y=0)\)

Then, if the fitted value is \(\hat{y} = 10\), it does not literally mean that the variable takes the value of \(10\), because that’s impossible \(y\) being a binary variable, but that it is very likely that \(y = 1\).

Linear Probability Model

In the linear probability model we consider the probability of \(y=1\) as the dependent variable.
Let the probability of \(y\) being \(1\) conditional on \(X\). be:

\[P(y = 1 | X) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon\]

In this case, the estimated parameters (\(\beta_{i}\)) will represent how the probability of \(y=1\) changes when the values of \(x_{i}\) change.
In this equation, a 1-unit increase in \(x_1\) cause a \(\beta_1\) change in the probability that \(y\) will be 1.
This model is called the Linear Probability Model, and it can be estimated just like any other multivariate regression, but instead of using an unbounded continuous dependent variable, we use a dummy.

But there’s a problem…

The problem with the linear probability model

Let’s say:

We have a model that looks at maternal employment status as a function of the number of children and the mother’s education level.

\[P(\text{employnment status} = 1 | \text{number of children}) = \beta_0 + \beta_1\text{number of children} + \epsilon\]

The dependent variable is a binary ( 1 = employed full-time, or 0 = unemployed).
Now imagine that each child decreases the probability of full-time work by 0.3. And \(\beta_{0} = 1\).
If someone has four children, that means a negative probability of working!
\[P(\text{employnment status} = 1 | \text{number of children} = 4) = 1 - 0.3 \times 4 = 1 - 1.2 = -0.2\]

In other words, even if we use the interpretation of the fitted values of the dependent to represent the conditional probability \(P(y = 1 | X)\), the model still makes very bad predictions for very high/low values \(x\) – this is very similar to the problem of non-linear relations –. The linear probability model can only make decent predictions for the average values of \(x\).

Again, one could bound the output of the regression and take any negative value as equal to zero and any value greater than one as equal to one. But, as you’ll see, there are more elegant and theoretically consistent solutions.

Logistic Regression

In the light of how inadequate is the linear regression model when dealing with binary variables, we need some special functional form \(f(\cdot)\) to allow the relation between \(x\) and \(y\) to have following characteristics:

A change in \(x\) cause a change in the probability of \(y=1\).
The values of the dependent variable of the model are unbounded, even when \(y\) is bounded. In particular, we want the functional that takes the bounded values of \(y\) (0 and 1) and maps them to minus infinity and positive infinity as in a typical regression model.

Assuming we know of a function with such characteristics, we can then apply this function to the dependent variable in the linear probability model like this:

\[ \begin{eqnarray} P(y = 1 | X) & = & \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon \;\;\;\; \text{(Linear Probability Model)}\\ f(P(y = 1 | X)) & = & \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon \;\;\;\; \text{(Functional Form Transformation)} \end{eqnarray} \]

There are different functions that can satisfy the previous conditions. The two most common functional forms are:

The normal cumulative density function (probit model), and,
The logistic function (logit model).

The logistic regression model (or logit) is the most common estimation technique when the dependent variable is binary. In this course we’ll only learn about the logit model, but the probit model implementation/interpretation is very similar. The figure below shows the general functional form of a logistic regression:

when \(x\) is very large, \(y\) is almost certainly 1 (assuming \(\beta\) is positive), but bounded.
when \(x\) is sufficiently small (or even negative), \(y\) is almost certainly 0, but bounded.

In the next section we’ll derive this functional form and introduce it to the regression model.

The odds ratio

The odds ratio of a binary variable \(y\) is: \[\dfrac{P(y=1)}{P(y=0)} \;\; \text{or} \;\; \dfrac{P(y=1)}{1-P(y=1)}, \;\;\; \text{since} \;\; y \;\; \text{is either} \;\; 0 \;\; \text{or} \;\; 1\]
An odds ratio represents how likely is \(y\) to be 1 relative to being 0.
For example:

If \(P(y=1)\) is 0.75, then \(P(y=0) = 1 - P(y=1) = 1 - 0.75 = 0.25\), then the odds ratio is \(P(y=1)/P(y=0) = 0.75/0.25 = 3\) or 3:1 (three to one). In other words, the variable \(y\) is three times more likely to be one than zero.

If \(y\) it’s a gamble and you have a 75% chance of winning and a 25% chance of losing, you are three times more likely to win than to lose; if the odds double, they are now 6:1; if they halve, they are 1.5:1.
As \(P(y=1)\) gets closer to 1, the odds ratio goes to infinity, and as \(P(y=1)\) gets closer to 0, the odds ratio goes to 0.

\[ \begin{eqnarray} \lim_{P(y=1) \rightarrow 1} \dfrac{P(y=1)}{P(y=0)} = \lim_{P(y=1) \rightarrow 1} \dfrac{P(y=1)}{1-P(y=1)} = \infty \end{eqnarray} \] and,

\[ \begin{eqnarray} \lim_{P(y=1) \rightarrow 0} \dfrac{P(y=1)}{P(y=0)} = \lim_{P(y=1) \rightarrow 0} \dfrac{P(y=1)}{1-P(y=1)} = 0 \end{eqnarray} \]

Then, then range of the odds ratio as a function of \(P(y=1)\) is \((0, \infty)\) as \(P(y=1)\) is bounded by \((0, 1)\).

The natural logarithm of the odds ratio

You may be wondering, if we use the odds ratio instead of \(y\) as a dependent variable, then we have a dependent continuous variable that maps the values of \(P(y=1)\) from, originally, zero to one to, now, zero to infinity. With this transformation we get rid of the upper-bound, but still the dependent variable is still bounded by zero. A way around that is to use a natural logarithm transformation on the odds ratio.
We can apply the natural logarithm function to the odds ratio, we call this transformation the log odds or logit.
The natural log of the odds ratio \(P(y=1)/P(y=0)\) ranges from negative infinity (since the log of 0 or a negative number is minus infinity) to positive infinity (since the log of infinity is infinity). Formally,

\[ \begin{eqnarray} \lim_{P(y=1) \rightarrow 1} \ln(\dfrac{P(y=1)}{P(y=0)}) = \lim_{P(y=1) \rightarrow 1} \ln(\dfrac{P(y=1)}{1-P(y=1)}) = \ln(\infty) = \infty \end{eqnarray} \] , and, \[ \begin{eqnarray} \lim_{P(y=1) \rightarrow 0} \ln(\dfrac{P(y=1)}{P(y=0)}) = \lim_{P(y=1) \rightarrow 0} \ln(\dfrac{P(y=1)}{1-P(y=1)}) = \ln(0) = -\infty \end{eqnarray} \]

See the previous learning module for a brief introduction to the natural logarithm function.

Natural Logarithms in the regression model

Using the log odds or logit transformation to the binary variable \(y\) and write the following regression model:

\[ln(\frac{P(y=1)}{1-P(y=1)}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon\]

This is the logistic regression model

This functional form satisfy the assumptions (1) and (2).

A change in \(x\) cause a change in the probability of \(y=1\).
The values of the dependent variable are unbounded, even when \(y\) is bounded.

In summary, in the logistic regression model:

The estimation procedure consist on transforming the dependent variable into a logit (log of odds ratio).
Then, use a linear regression on the log odds transformation to estimate the effect of \(x\) on the probability of \(y=1\). Even if the relation between \(x\) and the logit is linear, it implies a non-linear relation between \(x\) and \(y\) that is bounded between 0 and 1.

Deriving the relation between \(x\) and \(y\) in the logit model

Using the previous transformation, the relation between \(x\) and \(y\) is not constant and easy to interpret. But we can do some transformations that makes this process much easier.

Let’s start with the new logistic regression model:

\[ \begin{eqnarray} ln(\frac{P(y=1)}{1-P(y=1)}) & = & \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon \end{eqnarray} \] Now, let’s use the exponential function \(exp(\cdot)\) to get the ratio \(\frac{P(y=1)}{1-P(y=1)}\),

\[ \begin{eqnarray} e^{ln(\frac{P(y=1)}{1-P(y=1)})} & = & e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon} \\ \frac{P(y=1)}{1-P(y=1)} & = & e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon} \\ \end{eqnarray} \] The next step is to solve for \(P(y=1)\) with some algebraic manipulations: \[ \begin{eqnarray} \frac{P(y=1)}{1-P(y=1)} & = & e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon} \\ \frac{1}{P(y=1)^{-1}-1} & = & e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon} \\ P(y=1)^{-1}& = & 1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon)} \\ P(y=1)& = & \dfrac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon)}} \;\;\;\; \text{, or,} \\ P(y=1)& = & \dfrac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon}}{1+e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon}} \\ \end{eqnarray} \]

This function is non-linear and bounded between zero and one. It’s not easy to interpret, that’s why – as you’ll see in the next part – we prefer to interpret the estimated parameters of the model in terms of odds ratios instead of changes in \(P(y=1)\).

The intercept

Let’s figure out what’s the correct interpretation of the parameters in the logit model, starting with the intercept,

\[ln(\frac{P(y=1)}{1-P(y=1)}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon\]

Assume that \(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon\), then:
\(\text{logit}(P(y=1)) = 0\), which implies that:

\[ \begin{eqnarray} ln(\frac{P(y=1)}{1-P(y=1)}) & = & \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon \\ ln(\frac{P(y=1)}{1-P(y=1)}) & = & 0 \\ e^{log(\frac{P(y=1)}{1-P(y=1)})} & = & e^{0} \\ \frac{P(y=1)}{1-P(y=1)} & = & 1 \\ P(y=1) & = & 1-P(y=1) \;\;\;\; \text{or} \;\;\; \left(P(y=1) = P(y=0)\right) \\ 2P(y=1) & = & 1 \\ P(y=1) & = & 0.5 \end{eqnarray} \]

i.e. when the right hand side sums to 0 is equally probable that \(y\) is \(1\) or 0.

Interpret \(\beta_0\) when all \(x\) are zero

When all the \(X\) values are 0, all we have left on the right is \(\beta_0\) – thus \(\beta_0\) is the only factor determining the value of the probability of \(y\) when all the \(x\) are zero, just like the intercept in a multiple regression determines the average value of \(y\) when all the \(x\) are 0.
If \(\beta_0\) is 0, then \(y\) is equally likely to be 0 or 1 when all the \(X\) are 0 (see previous slide);
If \(\beta_0\) is negative, then \(P(y=1) < 0.5\) when all \(x\) = 0, and vice versa;
If \(\beta_0\) is positive, \(P(y=1) > 0.5\) when all \(x\) = 0.

The slope

Let’s move on with how \(y\) relate to \(x\) across all values of \(x\)

Previously we showed that:

\[P(y=1) = \frac{e^{\beta_0 + \beta_1 x_1}}{1+e^{\beta_0 + \beta_1 x_1}}\]

The function of \(x\) on the right hand side is, for obvious reasons, known as the inverse logit and is sometimes written \(P(y=1) = \textrm{invlogit}(x)\).
As \(x\) increase, \(P(y=1)\) goes from almost certainly 0 to almost certainly 1, and is at 0.5 when the right hand side is 0 – which since in our example here \(\beta_0 = 0\), is when \(x = 0\).
Put most simply, the effect of x depends on the position on the curve: the predicted probability of an event happening versus not happening.

We can make a simple function in R to visualize the logit. (You can try different parameter values and see how the function changes)

library(ggplot2)
b0 <- 1;
b1 <- 0.5;
invlogit <- function(x){exp(b0 + b1*x)/(1+exp(b0 + b1*x))}
ggplot(data=data.frame(x=c(-10, 10)), aes(x)) + stat_function(fun=invlogit) + ylab("P(y=1)")

# Let's try now a greater value for b1
b1 <- 1
ggplot(data=data.frame(x=c(-10, 10)), aes(x)) + stat_function(fun=invlogit) + ylab("P(y=1)")

See how the change from 0 to 1 is more drastic in the second graph. Finally, if \(\beta_{1}\) is negative,

# b1 close to zero
b1 <- -0.5
ggplot(data=data.frame(x=c(-10, 10)), aes(x)) + stat_function(fun=invlogit) + ylab("P(y=1)")

Now the relation is inverted, the probability of \(y=1\) goes to zero as \(x\) increase.

Interpreting \(\beta_{i}\)

We can conclude that,

A 1-unit change in \(x\) will have different effects on \(y\) depending on the values of \(x\) (the relation is non-linear). Then, finding a statistically significant parameter implies that we reject the hypothesis of linearity.
Looking at the previous graph, when \(x\) is large, a 1-unit change in \(x\) produces a very slight change in y (because the slope is very flat), as is the case when \(x\) is very negative. In fact, the largest effect of \(x\) on \(y\) is when \(y\) is at the average values of \(x\), and the slope is steepest.
Interpreting the impact of a unit change in \(x\) on the logged odds in \(y\) is not as straightforward, that’s why we use the effect on the odds ratio instead.

Interpreting \(\beta\) - Calculate I

We can calculate the estimated probability \(\hat{P}(y=1)\) for a couple of specific \(x\) values and take the difference among those values as an approximation of the effect of \(x\) on \(y\).
For example, we can consider what is the effect of a 1-unit change in \(x_1\) starting from some \(\bar{x}_1\). Generally, we use the average value of \(x\) for the comparison.
Calculate \(\hat{P}(y=1 | x_{1} = \bar{x}_1)\) and \(\hat{P}(y=1 | x_{1} = \bar{x}_1 + 1)\) and take the difference to get the estimated effect of the change in \(x\).

So for the logit, to get the change in the probability that \(y=1\) from \(x=b\) to \(x=a\) is:

\[\hat{P}(y=1|x=a) - \hat{P}(y=1|x=b)\]

\[ = \textrm{invlogit}(a) - \textrm{invlogit}(b)\]

\[= \frac{e^{\beta_0 + \beta_1 a}}{1+e^{\beta_0 + \beta_1 a}} - \frac{e^{\beta_0 + \beta_1 b}}{1+e^{\beta_0 + \beta_1 b}}\]

Example

For example, assuming \(\beta_0 = 0\) and \(\beta_1 = 1\), the effect of a change in x from 2 to 4 is:

b0 <- 0
b1 <- 1
invlogit <- function(x){exp(b0 + b1*x)/(1+exp(b0 + b1*x))}
invlogit(4) - invlogit(2)

[1] 0.1012167

Therefore, the probability that \(y=1\) increase by about 10 percent when \(x\) increase from 2 to 4. But, note that that’s this effect is not always the same. See what happens when we compute the change in x from 4 to 6:

invlogit(6) - invlogit(4)

[1] 0.01551359

The effect is only about 1.5 percent. Which is drastically different, then you should be careful with the interpretation of a logistic regression.

Interpreting \(\beta\) - example

Research Question: What is the effect of the race of a defendant and a victim on whether the defendant gets the death sentence (y=1)?

Independent variables:

\(x_1 = 1\) if the defendant is white and zero otherwise;
\(x_2 = 1\) if the victim is white and zero otherwise.

Logistic model:

\[\text{logit}[P(y=1)] = \beta_0+ \beta_1 x_1 + \beta_2 x_2\] The estimated parameters:

\(\beta_0 = -3.596\)
\(\beta_1 = -0.868\)
\(\beta_2 = 2.404\).

Interpretation:

White defendants have lower chance of getting the death penalty,
If the victim is white, that increases their chance.
Specifically, if we want to know the probability of the death penalty if the defendant is non-white (\(x_1=0\)) and victim is white (\(x_2=1\)) that’s just:

\[\hat{P}(y=1 | x_{1} = 0, x_{2}=1) = \frac{e^{-3.596-0.868*0 + 2.404*1}}{1+e^{-3.596-0.868*0+2.404*1}}= 0.233\]

We can calculate \(\hat{P}(y=1 | x_{1} = 1, x_{2}=1)\) and take the difference to get the overall effect of a shift in the race of the defendant from non-white to white (or vice versa).

\[\begin{eqnarray*} \text{Effect of defendant race} & = & \hat{P}(y=1 | x_{1} = 0, x_{2}=1) - \hat{P}(y=1 | x_{1} = 1, x_{2}=1) \\ & = & \frac{e^{-3.596-0.868*0 + 2.404*1}}{1+e^{-3.596-0.868*0+2.404*1}} - \frac{e^{-3.596-0.868*1 + 2.404*1}}{1+e^{-3.596-0.868*1+2.404*1}}\\ & \approx & 0.12 \end{eqnarray*}\]

Then, the probability of getting the death sentence when the race of the defendant is non-white increase by about 12 percent. The odds ratio between white and non-white defendants is:

\[\begin{eqnarray*} \text{Odds Ratio} & = & \dfrac{\hat{P}(y=1 | x_{1} = 0, x_{2}=1)}{\hat{P}(y=1 | x_{1} = 1, x_{2}=1)} \\ & = & \dfrac{\frac{e^{-3.596-0.868*0 + 2.404*1}}{1+e^{-3.596-0.868*0+2.404*1}}}{\frac{e^{-3.596-0.868*1 + 2.404*1}}{1+e^{-3.596-0.868*1+2.404*1}}}\\ & \approx & 2 \end{eqnarray*}\]

Then, non-white defendants are two times more likely to receive the death sentence than white defendants.

Presumably there may be also an interaction effect, where if the defendant is non-white and the victim is white, that boosts the odds of the death penalty by even more than the sum of the two terms.
Also, we have to consider other control variables that account for OVB – the type of crime, the state laws, culture, etc. –, but we won’t get into that here.
In summary, the interpretation of the parameter just indicates the sign relation between the variable and the probability of \(y=1\), the interpretation of the magnitude cannot be made directly from the regression table, as we need specific values of \(x\).
A general practice is to consider a unit increase in \(x\) at average value of \(x\).

Interpreting Odds ratio effects I

We have now seen that the interpretation of the parameters is not as straightforward: the estimated parameter indicate the change in logged odds in \(y\) per unit change in \(x\).
A more natural way to interpret \(\beta_1\) is to return to the odds ratios: converting the coefficients back into the odds of \(y\) taking on the value of 1, per unit increase in \(x\).

The odds ratio formula:

\[\frac{P(y=1)}{1-P(y=1)} = e^{\beta_0 + \beta_1 x_1}\]

which can be be rearranged as: \[\frac{P(y=1)}{1-P(y=1)} = e^{\beta_0}e^{\beta_1 x_1}\]

Each 1-unit increase in \(x_1\) multiplies the right hand side by another factor of \(e^{\beta_1}\) (in the same way that going from \(a^3\) to \(a^4\) is equivalent to multiplying \(a^3\) by another factor of \(a\)). That is,

\[\begin{eqnarray*} \frac{P(y=1 | x_1 + 1 )}{1-P(y=1 | x_1 + 1)} & = & e^{\beta_0}e^{\beta_1 (x_1 + 1)} \\ & = & e^{\beta_0}e^{\beta_1 x_1 + \beta_1} \\ & = & \left(e^{\beta_0}e^{\beta_1 x_1}\right)e^{\beta_1} \\ & = & \frac{P(y=1)}{1-P(y=1)} \times e^{\beta_1} \\ & = & \text{Original Odds Ratio} \times e^{\beta_1} \end{eqnarray*}\]

So each 1-unit change in \(x_1\) increases the odds ratio by \(e^{\beta_1}\).
To return to our example, shifting \(x_1\) from 0 to 1 changes the odds by \(e^{\beta_1}\) or \(e^{-0.868} = 0.42\). (In R: exp(-0.868))

Interpreting Odds ratio effects II

Values of \(e^{\beta_1}\) greater than 1 show an increased odds of observing a 1 on the dependent variable.
Values of \(e^{\beta_1}\) less than 1 show a decreased odds of observing a 1 on the dependent variable.
Note that \(e^{\beta_1} > 1 \longleftrightarrow \beta_{1} \geq 0\) and \(e^{\beta_1} < 1 \longleftrightarrow \beta_{1} < 0\). Then, the sign of the estimated parameter indicates if an increase of \(x_{1}\) cause the odds ratio to increase or decrease.
We can also express the change in odds as percent change. To do so, we use the formula

\[\text{Percent change} = 100 \times e^{\beta - 1}\]

Hypothesis tests and significance

Calculating the significance is not so different from linear regression!

Standard Errors

We can again look at the standard error on its \(\beta\) and use it to test whether the \(\beta\) is significantly different from 0.

Confidence Intervals

We can also construct a confidence interval around \(\beta\), but sometimes we are also interested in the confidence interval not of \(\beta\) (which by itself is hard to interpret), but for the odd-ratio effect.

Confidence interval coefficient:

With, for example, \(se\)=0.601: 95% CI (assuming a large n) is \(2.404 \pm 1.96 * 0.601 = [1.23,3.58]\).
Because \(\beta\) 95% CI of \([1.23,3.58]\) doesn’t include 0 (just as we do for regular regression), we can reject the null that the effect of the victim’s race is insignificant.

Confidence interval odds ratio (more interpretable quantity):

\([e^{1.23},e^{3.58}]\), or \([3.4,35.9]\)
Ie: our best guess is that the victim being white increases the odds of the defendant getting the death penalty by a factor of 11.1, but we are 95% sure it is between 3.4 and 35.9.
Because the odds 95% CI range doesn’t include 1, we can reject the null that the effect of the victim’s race is insignificant (multiplying the odds by 1 doesn’t change it…)

T-statistic

We can also just generate our test statistic of \(2.404/0.601 = 4\) and conclude that that is well above the threshold of 1.96, etc.

In summary, having a binary dependent variable causes no changes in the way we conduct hypothesis tests in the regression model.

Estimating a Logit model

Unfortunately, we cannot estimate a logit model using the lm command as all the previous models. Instead we need to use the glm function.
glm stands for Generalized Linear Model. And is a more general function to estimate regression models.
It has a similar structure to lm, but requires you to specify the family of models that you want to use in the estimation.
In the case of the logit model, we need to specify the option family = binomial.

Consider this brief simulation of the process of estimating a logit model.

n <- 200 # Number of observations

# Simulating x variable
x <- runif(min = -5, max = 5, n)

# Setting parameter values
b0 <- 0
b1 <- 1.25

# Simulating error term
epsilon <- rnorm(n)

# Simulating probabilies of y=1 using invlogit function
py <- exp(b0 + b1*x + epsilon) / (1 + exp(b0 + b1*x + epsilon))

# Simulating realizations of y=1
y <- sapply(py, FUN = function(x) sample(c(0,1), size = 1, replace = TRUE, prob = c(1-x,x)))

# Visualizing data simulation
db <- data.frame(x = x, y = y, py = py)
ggplot() + geom_point(data = db, aes(x = x, y = y), color = "black") + geom_point(data = db, aes(x = x, y = py), color = "red")

In the previous figure, the red dots represent the simulated probabilities of the logit model, which we don’t actually observe in real data. What we’ll observe are simply the black dots, which as you can see only take the values of zero and one.

Simulation

Let’s now use the glm function to estimate the model.

# Estimating logit, very similar to lm, just add
# family = "binomial"
logit <- glm(y ~ x, data = db, family = "binomial")

# Using stargazer for output
suppressMessages(library(stargazer))
stargazer(logit, header = FALSE, type = "html")


	Dependent variable:

	y

x	1.137^***
	(0.156)

Constant	0.281
	(0.233)


Observations	200
Log Likelihood	-60.826
Akaike Inf. Crit.	125.653

Note:	p<0.1; p<0.05; p<0.01

Unfortunately, the output of glm does not include an F-stat nor \(R^2\).

Interpretation

The parameter of \(x\) is statistically different than zero and positive, which means that \(x\) increases the probability that \(y=1\).
Using \(e^{\beta_{x}} \approx =\) 3.12. On average, a unit increase in \(x\), cause an increase of about 3.12 in the odds of \(y=1\) relative to \(y=0\).
Using the invlogit formula we can compute the effect on \(P(y=1)\) for different values of \(x\). For instance, from \(x=-2\) to \(x=-1\) the effect is that \(P(y=1)\) increase by about, 15 percent, and from \(x=-1\) to \(x=0\) the change is about 28 percent.

This is a graphical representation of how the estimated model fit the data, relative to a linear model

# Estimating logit, very similar to lm, just add
# family = "binomial"
logit <- glm(y ~ x, data = db, family = "binomial")


# Comparing fitted values
gplot <- ggplot() + geom_point(data = db, aes(x = x, y = y), color = "black")

# Adding logit model
gplot <- gplot + geom_smooth(data = db, aes(x = x, y = y), method = "glm", formula = y ~ x, method.args = list(family = "binomial"), color = "red", se = FALSE)

# Adding linear model
gplot <- gplot + geom_smooth(data = db, aes(x = x, y = y), method = "lm", formula = y ~ x, color = "blue", se = FALSE)

# display graph
gplot

# Estimating linear probability model for comparison
linear <- lm(y ~ x, data = db)

# Using stargazer for output
stargazer(list(linear, logit), header = FALSE, type = "html")


	Dependent variable:

	y
	OLS	logistic
	(1)	(2)

x	0.139^***	1.137^***
	(0.008)	(0.156)

Constant	0.524^***	0.281
	(0.023)	(0.233)


Observations	200	200
R²	0.578
Adjusted R²	0.576
Log Likelihood		-60.826
Akaike Inf. Crit.		125.653
Residual Std. Error	0.324 (df = 198)
F Statistic	271.158^*** (df = 1; 198)

Note:	p<0.1; p<0.05; p<0.01

Real world example: Smoking regulation (1/4)

This example is based on the empirical exercise in Stock and Watson introduction to Econometrics.

We are trying to estimate the effect of workplace smoking bans on smoking using data on a sample of 10,000 U.S. indoor workers from 1991 to 1993.
The data set can be downloaded using from the AER package using the data("SmokeBan") command. It includes the following variables,
- smoker is the individual a current smoker? (yes/no)
- ban is there a smoke area ban? (yes/no)
- age measured in years
- education factor indicating highest education level attained: hs for high school drop out, high school graduate, some college, college graduate, master’s degree (or higher)
- afam is the individual African-American? (yes/no)
- hispanic is the individual Hispanic? (yes/no)
- gender is the individual female? (yes/no)

Real world example: Smoking regulation (2/4)

Let’s start with loading data and taking a quick look at it:

# Loading Data
suppressMessages(library(AER))
data(SmokeBan)

head(SmokeBan)

  smoker ban age    education afam hispanic gender
1    yes yes  41           hs   no       no female
2    yes yes  44 some college   no       no female
3     no  no  19 some college   no       no female
4    yes  no  29           hs   no       no female
5     no yes  28 some college   no       no female
6     no  no  40 some college   no       no   male

Let’s manually compute the probability of smoking for (i) all workers, (ii) workers affected by workplace smoking bans, and, (iii) workers not affected by workplace smoking bans.

# Creating binary version of smoker
binarySmoker <- ifelse(SmokeBan$smoker == "yes", 1, 0)

# Probability of smoking of all workers
pSmokingAll <- mean(binarySmoker)

# Probability of smoking of wokers with ban
pSmokingBan <- mean(binarySmoker[SmokeBan$ban == "yes"])

# Probability of smoking of wokers with no ban
pSmokingNoBan <- mean(binarySmoker[SmokeBan$ban == "no"])

# Let's make a quick barplot to see the diference in probabilities
pSmoking <- data.frame(y = c(pSmokingNoBan, pSmokingAll, pSmokingBan), ban = c("With No Ban", "All Workers", "With Ban"))

ggplot(data = pSmoking, aes(x = ban, y = y)) + geom_bar(stat = "identity") + scale_x_discrete(limits= c("With No Ban", "All Workers", "With Ban")) + theme_classic()

There seems to be a reduction in the probability of being a smoker when workplace smoking bans are enforced. But, we know that without controlling for other variables, this difference may be the result of bias; this seems like a job for a multivariate regression model.

Real world example: Smoking regulation (3/4)

#Using education = HS drop out as comparison category
SmokeBan$education <- relevel(SmokeBan$education, ref = "hs drop out")

# Adding the binary version of smoker to the dataset
SmokeBan$binarySmoker <- binarySmoker

# Estimating linear probability model
linear <- lm(binarySmoker ~ ban + age + I(age^2) + education + afam + hispanic + gender, data = SmokeBan)

# Estimating logit model
logit <- glm(binarySmoker ~ ban + age + I(age^2) + education + afam + hispanic + gender, data = SmokeBan, family = "binomial")

# Creating covariate labels
variableLabels <- c("Ban (yes = 1)",
                    "Age",
                    "Age2",
                    "Education (High School = 1)",
                    "Education (Some College = 1)",
                    "Education (College degree = 1)",
                    "Education (Master or higher = 1)",
                    "African-American (yes = 1)",
                    "Hispanic (yes = 1)",
                    "Gender (Female = 1)")
# Display output
stargazer(list(linear, logit), header = FALSE, type = "html",
          covariate.labels = variableLabels)


	Dependent variable:

	binarySmoker
	OLS	logistic
	(1)	(2)

Ban (yes = 1)	-0.047^***	-0.262^***
	(0.009)	(0.049)

Age	0.010^***	0.060^***
	(0.002)	(0.012)

Age2	-0.0001^***	-0.001^***
	(0.00002)	(0.0001)

Education (High School = 1)	-0.090^***	-0.438^***
	(0.016)	(0.084)

Education (Some College = 1)	-0.158^***	-0.787^***
	(0.017)	(0.087)

Education (College degree = 1)	-0.278^***	-1.570^***
	(0.018)	(0.102)

Education (Master or higher = 1)	-0.323^***	-2.017^***
	(0.020)	(0.132)

African-American (yes = 1)	-0.028^*	-0.156^*
	(0.016)	(0.090)

Hispanic (yes = 1)	-0.105^***	-0.597^***
	(0.014)	(0.083)

Gender (Female = 1)	-0.033^***	-0.191^***
	(0.009)	(0.049)

Constant	0.309^***	-0.982^***
	(0.042)	(0.243)


Observations	10,000	10,000
R²	0.057
Adjusted R²	0.056
Log Likelihood		-5,233.999
Akaike Inf. Crit.		10,490.000
Residual Std. Error	0.416 (df = 9989)
F Statistic	60.371^*** (df = 10; 9989)

Note:	p<0.1; p<0.05; p<0.01

Interpretation

Both models predict that the smoking ban reduces the probability of smoking. And the effect is statistically different than zero.
The linear probability model predicts that the reduction in probability of being a smoker caused by the smoking ban is 5 percent.
The logit model predicts that the odds of being a smoker relative to not being a smoker are change by a factor of \(e^{\beta_{ban}} \approx\) 0.7695.
In the linear regression model, the effect is always the same, regardless of the value of the other control variables.

Real world example: Smoking regulation (4/4)

In the logistic model, the effect of the \(x\) variable depends on the value of the other control variables. For instance, consider that we have two individuals that are: 30 years old, with a college degree, that is not African-american, and not Hispanic, but one is male and the other is female. Do you think the effect of the ban will be the same for both genders?

To answer this question we can use the invlogit function:

# Data for male without ban
dataMaleNoBan <- c(0,    # no ban
                   30,   # age
                   30^2, # age^2
                   0,    # high school
                   0,    # some college
                   1,    # college
                   0,    # master
                   0,    # african-american
                   0,    # hispanic
                   0)    # male

# Data for male with ban
dataMaleBan <- dataMaleNoBan
dataMaleBan[1] <- 1 # Setting ban to 1

# Data for female without ban
dataFemaleNoBan <- dataMaleNoBan
dataFemaleNoBan[10] <- 1 # Setting gender to 1

# Data for female with ban
dataFemaleBan <- dataMaleBan
dataFemaleBan[10] <- 1 # Setting gender to 1

# Writing invlogit function
invlogit <- function(betas, data) {
  pY <- exp(betas[1] + betas[-1] %*% data) / (1 + exp(betas[1] + betas[-1] %*% data))
  return(pY)
}

# Note that I'm using the %*% operator in this part as
# there are many different explanatory variables and
# we need to control for all of them. 
 
# Computing probabilities for males and effect for females with and without ban
betas <- logit$coefficients
pMaleBan <- invlogit(betas, dataMaleBan) # Probability of smoking for male = 1, ban = 1
pMaleNoBan <- invlogit(betas, dataMaleNoBan) # Probability of smoking for male = 1, ban = 0
pFemaleBan <- invlogit(betas, dataFemaleBan) # Probability of smoking for male = 0, ban = 1
pFemaleNoBan <- invlogit(betas, dataFemaleNoBan) # Probability of smoking for male = 0, ban = 0

# Calculating changes in probabilities
effectForMales <- pMaleBan - pMaleNoBan
effectForFemales <- pFemaleBan - pFemaleNoBan

# Creating data.frame with data to make a plot
pData <- data.frame(male = c(pMaleBan, pMaleNoBan, effectForMales),
                    female = c(pFemaleBan, pFemaleNoBan, effectForFemales))
pData <- round(pData,4)*100
row.names(pData) <- c("Ban", "No Ban", "Difference")

# knitr  and kableExtrapackage for table output
library("knitr")
library("kableExtra")
kable(pData) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)  %>%
  footnote(general = "Predicted smoking probabilities for an individual that is 30 years old, \n with a college degree, that is not african-american, and not hispanic.")

	male	female
Ban	14.77	12.52
No Ban	18.38	15.69
Difference	-3.61	-3.16
Note:
Predicted smoking probabilities for an individual that is 30 years old, with a college degree, that is not african-american, and not hispanic.

Therefore, the effect of the smoking ban will be greater for males than for females.