This lesson introduces the use of categorical variables in regression models
After completing this module, students should be able to:
Schumacker, Chapter 17.
gender
were coded as 0 when the individual is male and 1 if female, hispanic
with 0 meaning non-Hispanic and 1 for Hispanic, etc.A categorical variable is a discrete variable that represents different categories. The value of a categorical variable is one category out of a set of possible categories. For example, the manufacturer of a car is a discrete variable that can take the values manufacturer = {"toyota", "ford", "mercedes", ...}
. We cannot use the variable manufactuer
in a regression directly because is not a numeric variable.
toyota
, 2 represents ford
, 3 represent mercedes
, etc. In R
we would then save the categorical variable manufacturer
as a numerical variable manufacturer = {1, 2, 3, ...}
. With this transformation we can now run a regression because now manufacturer
is technically a numeric variable. This approach to dealing with categorical variables is incorrect.The reason for that is the following, imagine that we run a regression using manufacturer
as previously defined and the estimated that parameter is equal to 5. What’s the interpretation of this result? the literal interpretation is that when a car manufacturer
changes from toyota
to ford
(1 to 2) or ford
to mercedes
(2 to 3) the variable \(y\) changes by 5 units. This does not make any sense, because the difference between toyota
and ford
and ford
and mercedes
cannot be quantified, car manufacturers are not numbers but categories. With this type of variables you cannot make quantitative statements like you can with numbers, that is; statements like toyota > ford
are nonsensical.
The correct way to represent different categories is to use binary variables, often called dummy variables. A binary variable represents a single category using the values of 0 and 1. In the case of car manufacturers, for each \(k\) manufacturer we need to create a dummy with value 0 if the car is not of the specific manufacturer or 1 if it is.
\[y = \beta_{0} + \beta_{1}x + \epsilon\]
\[ \begin{eqnarray} y(x = 0) & = & \beta_{0} + \epsilon \\ y(x = 1) & = & \beta_{0} + \beta_{1} + \epsilon \\ \end{eqnarray}\]
\[\Delta y = y(x = 1) - y(x = 0) = \beta_{0} + \beta_{1} + \epsilon - \beta_{0} - \epsilon = \beta_{1}\]
Therefore, the regression coefficient \(\beta_{1}\) for the categorical variable \(x\) represents the change in the \(y\) variable caused by the change in category (recall that 0 and 1 represents two different categories). In other words, the regression coefficient indicates the difference in means between groups or categories.
Sounds familiar? Yes! The t-test for the regression coefficients tells us whether the means of \(y\) or expected values of the dependent variable are significantly different (like the original t-test for the difference of means that we learn earlier in the semester).
Then, if we want to use a regression model to test if the expected value of \(y\) is different for a categorical variable with two groups the process is as simple as running a regression with the dummy variable as independent.
Imagine that we want to test if there is a statistically difference in earnings between males and females. The initial bivariate model would look like this:
\[\text{Earnings} = \beta_{0} + \beta_{1}\text{Gender}+ \epsilon\]
with \(\text{Gender}\) being a dummy variable with \(0=male\) and \(1=female\). Then, the expected earnings of a male individual are estimated by:
\[\text{Earnings (Gender = 0)} \equiv \text{Earnings}_{male} = \beta_{0} + \epsilon\] and for females, \[\text{Earnings (Gender = 1)} \equiv \text{Earnings}_{female} = \beta_{0} + \beta_{1} + \epsilon\] And the difference among groups is equal to, \[\text{Wage Gap} \equiv \text{Earnings}_{female} - \text{Earnings}_{male}= \beta_{1}\]
If \(\beta_{1}\) is statistically different than zero, then we conclude that there is a wage gap caused by gender, and the sign of \(\beta_{1}\) would indicate which group earns more (if negative, males earn more and if positive, females earn more).
Obviously, this bivariate model is very likely to be biased if we don’t control for other factors (why?).
A bit trickier is the scenario when we have more than two categories. If that’s the case we cannot represent the variable with a single dummy.
Imagine that we have a categorical variable with \(k\) categories, in a regression model we can represent that variable with \(k-1\) binary variables:
\[y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{k-1}x_{k-1} + \epsilon\]
\[y(x_{1} = x_{2} = ... = x_{k-1} = 0) \equiv y_{k} = \beta_{0} + \epsilon\]
In other words, if all the \(k-1\) categories are zero it means that for the remaining observations the value of category \(k\) is one - those variable belong to the \(k\) category -.
If we include all the \(k\) categories then the sum of all the dummy variables is equal to a vector full of ones, which we need to estimate the intercept. Then the column of ones that we use to estimate the parameters of the model will be a perfect linear combination of all the dummies, that causes perfect multicollinearity – see last week learning module – and the parameters cannot be estimated. This issue is called the dummy variable trap. Then, you should always exclude at least one category from the regression.
This implies that the estimated coefficient for each category represents the effect of that category relative to the excluded category. Formally,
\[ \begin{eqnarray} y(x_{1} = x_{2} = ... = x_{k-1} = 0) \equiv y_{k} & = & \beta_{0} + \epsilon \\ y(x_{i} = 1 \;\; \text{and} \;\; x_{-i} = 0) & = & \beta_{0} + \beta_{i} + \epsilon \\ \end{eqnarray}\] where \(x_{-i}\) represents all the categories different than \(i\). Then,
\[\Delta y = y(x_{i} = 1) - y(x_{1} = x_{2} = ... = x_{k-1} = 0) = \beta_{0} + \beta_{i} + \epsilon - \beta_{0} - \epsilon = \beta_{i}\]
To create dummy variables in R
we have two basic procedures,
Manually: using a categorical variable you can write a piece of code that generate binary variables for each category. There are different methods to do so, but they all rely in some logical test, in the next slides I’ll show you a sample code on how to do generate dummies manually
Factors: categorical variables are known as factors in R
. When using a factor in a regression, R
will automatically create the necessary dummies for the analysis.
We always need \(k-1\) dummies if there are \(k\) categories.
One of the categories is always omitted and becomes the default or base category. (Recall that we cannot include all categories because of perfect multicollinearity)
Then, if we have a variable with \(k\) categories we need to generate \(k-1\) variables. Sometimes is more convenient to create \(k\) categories and then exclude a category when running the regression. The reason for that is that you may want to change the default category later on.
This is a sample code of how to do manually create dummies it in R
.
# Generating a vector with random values
# for categories "a", "b", "c".
n <- 100 # Number of observations
categories <- c("a", "b", "c")
categoricalVariable <- sample(categories, size = n, replace = T)
head(categoricalVariable)
## [1] "b" "c" "c" "c" "a" "c"
k <- length(categories) # Number of categories
# Creating a matrix of zeros
dummyMatrix <- matrix(0, nrow = n, ncol = k)
# Set value to 1 in column i when the row is equal to category i
for(i in 1:k)
{
dummyMatrix[,i] <- ifelse(categoricalVariable == categories[i], 1, 0)
}
# Naming the columns with each category
colnames(dummyMatrix) <- categories
head(dummyMatrix)
## a b c
## [1,] 0 1 0
## [2,] 0 0 1
## [3,] 0 0 1
## [4,] 0 0 1
## [5,] 1 0 0
## [6,] 0 0 1
# Verifying that the data is correct
head(cbind(categoricalVariable, dummyMatrix))
## categoricalVariable a b c
## [1,] "b" "0" "1" "0"
## [2,] "c" "0" "0" "1"
## [3,] "c" "0" "0" "1"
## [4,] "c" "0" "0" "1"
## [5,] "a" "1" "0" "0"
## [6,] "c" "0" "0" "1"
# Finally, you can convert the dummy matrix to a data.frame
# using the as.data.frame command
df <- as.data.frame(dummyMatrix)
R
.stringAsFactor = FALSE
when loading a dataset or creating data.frame, the reason is that R
thinks that any string
variable in a dataset by default represent categories and proceeds to make the conversion automatically.factorVariable <- as.factor(categoricalVariable)
factorVariable
## [1] b c c c a c c a b a c c b c b b a b c c a b a b c b c c c b b a b b c
## [36] b c b a b a a c b a c b c c c c b c b a c b c c b a c b a a a c b b a
## [71] b b c c a b a a b c b a c c c a c a c b a a b c b a c a c b
## Levels: a b c
as.factor
on a string variable with categories and the output looks just like any string variable. But notice how at the end there is a line that states Levels: a b c
. This indicates that R
understand each observation belongs to one of the three different categories or levels
. This can be use to estimate a regression using factorVariable
instead of explicitly creating the \(k-1\) dummies.Now lets run some regressions with categorical dependent variables:
lm
command. I’m going to simulate some \(y\) data with the dummyMatrix
from before to estimate a regression. Recall that we need to exclude one category,# Simulating y
y <- 10 + 2*df$b + -3*df$c + rnorm(n)
df$y <- y
# Estimating regression.
reg <- lm(y ~ b + c, data = df)
# Using stargazer for output
suppressMessages(library(stargazer))
stargazer(reg, type = "html")
Dependent variable: | |
y | |
b | 2.095*** |
(0.235) | |
c | -2.788*** |
(0.228) | |
Constant | 9.891*** |
(0.175) | |
Observations | 100 |
R2 | 0.845 |
Adjusted R2 | 0.842 |
Residual Std. Error | 0.911 (df = 97) |
F Statistic | 263.862*** (df = 2; 97) |
Note: | p<0.1; p<0.05; p<0.01 |
a
from the regression to avoid perfect multicollinearity. If you try to run the regression with a
, R
will automatically drop one of the dummies. See,# Estimating regression with all categories
reg <- lm(y ~ a + b + c, data = df)
# Using stargazer for output
suppressMessages(library(stargazer))
stargazer(reg, type = "html")
Dependent variable: | |
y | |
a | 2.788*** |
(0.228) | |
b | 4.883*** |
(0.214) | |
c | |
Constant | 7.103*** |
(0.146) | |
Observations | 100 |
R2 | 0.845 |
Adjusted R2 | 0.842 |
Residual Std. Error | 0.911 (df = 97) |
F Statistic | 263.862*** (df = 2; 97) |
Note: | p<0.1; p<0.05; p<0.01 |
factorVariable
. Now, instead of adding \(k-1\) dummy variables, we just need to add the factor variable.# Making data.frame with factorVariable and y
df2 <- data.frame(y, factorVariable)
# Estimating regression.
reg2 <- lm(y ~ factorVariable, data = df2)
# Using stargazer for output
suppressMessages(library(stargazer))
stargazer(reg2, type = "html")
Dependent variable: | |
y | |
factorVariableb | 2.095*** |
(0.235) | |
factorVariablec | -2.788*** |
(0.228) | |
Constant | 9.891*** |
(0.175) | |
Observations | 100 |
R2 | 0.845 |
Adjusted R2 | 0.842 |
Residual Std. Error | 0.911 (df = 97) |
F Statistic | 263.862*** (df = 2; 97) |
Note: | p<0.1; p<0.05; p<0.01 |
R
automatically drops one category, the first level
is the reference or default category. In this case R
dropped the category a
. We can specify what category to use as default using the command relevel
:df2$factorVariable <- relevel(df2$factorVariable, ref = "b") #sets "b" as the default/reference category
# Estimating regression with "b" as default category
reg2 <- lm(y ~ factorVariable, data = df2)
# Using stargazer for output
suppressMessages(library(stargazer))
stargazer(reg2, type = "html")
Dependent variable: | |
y | |
factorVariablea | -2.095*** |
(0.235) | |
factorVariablec | -4.883*** |
(0.214) | |
Constant | 11.986*** |
(0.156) | |
Observations | 100 |
R2 | 0.845 |
Adjusted R2 | 0.842 |
Residual Std. Error | 0.911 (df = 97) |
F Statistic | 263.862*** (df = 2; 97) |
Note: | p<0.1; p<0.05; p<0.01 |
covariate.labels
command in stargazer to have better labels for the dummy variables when using a factor variable.The interpretation of the estimated parameter \(\beta_{i}\) of a dummy variable \(i\) is simple, you read the parameter as:
‘’the estimated parameter \(\beta_{i}\) indicates that, relative to the default category, the category \(i\) causes a \(\beta_{i}\) change in the dependent variable’’
Let’s revisit the example from lesson 8, the regression of earnings on education, gender, and, age in a sample of workers with at least a high school diplomma.
dataCPS <- read.csv("/Users/econphd/Dropbox-NEU/Dropbox/Teaching/NEU/2019/PPUA5301/PPUA5301 - Summer 2019/Lectures/cps12.csv",sep=",", stringsAsFactors=FALSE)
mr2 <- lm(ahe ~ bachelor + female + age, data = dataCPS)
stargazer(mr2, type = "html")
Dependent variable: | |
ahe | |
bachelor | 8.319*** |
(0.227) | |
female | -3.810*** |
(0.230) | |
age | 0.510*** |
(0.040) | |
Constant | 1.866 |
(1.188) | |
Observations | 7,440 |
R2 | 0.180 |
Adjusted R2 | 0.180 |
Residual Std. Error | 9.678 (df = 7436) |
F Statistic | 544.495*** (df = 3; 7436) |
Note: | p<0.1; p<0.05; p<0.01 |
female
is a dummy with 1 being female and bachelor
is a dummy with 1 meaning the individual has a bachelor degree. The default category for the regression for gender is male and for education is having a high school diploma. ahe
is the average hourly earnings and age
is measured in years.
female
dummy the p-value is less or equal to 0.05).bachelor
dummy the p-value is less or equal to 0.05).Now you can proceed to interpret the magnitude of the parameter,
The estimated parameter for female
can be interpret as: ‘’relative to males, females average hourly earnings is \(\$3.81\) less’’.
The estimated parameter for bachelor
can be interpret as: ‘’relative to individuals with a high school diploma, those with a bachelor degree earn about \(\$8.32\) more per hour on average’’.
Note: Obviously the results from this regression are likely to be biased due to omitted variable bias (why do you think that’s the case?).
Categorical variables cannot be represented numerically, but have to be transformed to binary variables before we can use them in the regression model.
In a regression model with a categorical variable with \(k\) categories, we only include \(k-1\) binary variables.
The estimated parameter of a dummy variables represents effect on the \(y\) variable when the dummy changes from 0 to 1.
Dummies can be created manually or using the as.factor
command.