Statistical Analysis 10.1: Categorical Independent Variables

Categorical Variables
Introduction
- Definition
Two categories
- Example: Gender wage gap
More than two categories
- Example, Estimating the education earnings premium using several dummy variables
Coding dummies
- Manually
- Factors
Estimation
- Using factors
Interpretation
Summary

Categorical Variables

Overview

This lesson introduces the use of categorical variables in regression models

Objectives

After completing this module, students should be able to:

Estimate and interpret regressions using categorical independent variables.

Readings

Schumacker, Chapter 17.

Introduction

Previously, we used different methodologies to test hypotheses, e.g. t-test, F-test, ANOVA, etc. We then introduced the regression model as an alternative method that is more appropriate when the variables under analysis are continuous.
A regression model not only allow us to test hypothesis, but also make the process of controlling for other factors and provides a more natural framework to deal with the issue of causality.
It also allows us to answer questions about the what are the expected values (fitted values) of the dependent variable $y$ given a set of independent variables $\{x_{1}, x_{2}, ..., x_{k}\}$, or to tell how well the model fits the data in a relative straightforward manner.
Last week we learned how to control for other factors to correct omitted variable bias (OVB) and to how to detect and avoid the problem multicollinearity.
In some of the estimated models, we use dummy variables as controls - variables for which the values were coded as $0$ or $1$ to represent the different categories of a variable -, e.g. variables like gender were coded as 0 when the individual is male and 1 if female, hispanic with 0 meaning non-Hispanic and 1 for Hispanic, etc.
In this lesson we are going to take a closer look at the role of dummy variables as independent variables in the regression model.

Definition

A categorical variable is a discrete variable that represents different categories. The value of a categorical variable is one category out of a set of possible categories. For example, the manufacturer of a car is a discrete variable that can take the values manufacturer = {"toyota", "ford", "mercedes", ...}. We cannot use the variable manufactuer in a regression directly because is not a numeric variable.

One alternative to this problem is to store the data of the different categories as numbers, for example: let 1 represents toyota, 2 represents ford, 3 represent mercedes, etc. In R we would then save the categorical variable manufacturer as a numerical variable manufacturer = {1, 2, 3, ...}. With this transformation we can now run a regression because now manufacturer is technically a numeric variable. This approach to dealing with categorical variables is incorrect.

The reason for that is the following, imagine that we run a regression using manufacturer as previously defined and the estimated that parameter is equal to 5. What’s the interpretation of this result? the literal interpretation is that when a car manufacturer changes from toyota to ford (1 to 2) or ford to mercedes (2 to 3) the variable $y$ changes by 5 units. This does not make any sense, because the difference between toyota and ford and ford and mercedes cannot be quantified, car manufacturers are not numbers but categories. With this type of variables you cannot make quantitative statements like you can with numbers, that is; statements like toyota > ford are nonsensical.

The correct way to represent different categories is to use binary variables, often called dummy variables. A binary variable represents a single category using the values of 0 and 1. In the case of car manufacturers, for each $k$ manufacturer we need to create a dummy with value 0 if the car is not of the specific manufacturer or 1 if it is.

Two categories

Let’s start with the basic bivariate regression model

\[y = \beta_{0} + \beta_{1}x + \epsilon\]

Now let’s assume that the variable $x$ is not continuous, but binary (can only take the values 0 or 1). If that’s the case we can easily compute what’s the value of $y$ for the given the only two potential values of $x$:

\[ \begin{eqnarray} y(x = 0) & = & \beta_{0} + \epsilon \\ y(x = 1) & = & \beta_{0} + \beta_{1} + \epsilon \\ \end{eqnarray}\]

Then, the difference (change) in $y$ when $x$ goes from 0 to 1 is:

\[\Delta y = y(x = 1) - y(x = 0) = \beta_{0} + \beta_{1} + \epsilon - \beta_{0} - \epsilon = \beta_{1}\]

Therefore, the regression coefficient $\beta_{1}$ for the categorical variable $x$ represents the change in the $y$ variable caused by the change in category (recall that 0 and 1 represents two different categories). In other words, the regression coefficient indicates the difference in means between groups or categories.
Sounds familiar? Yes! The t-test for the regression coefficients tells us whether the means of $y$ or expected values of the dependent variable are significantly different (like the original t-test for the difference of means that we learn earlier in the semester).
Then, if we want to use a regression model to test if the expected value of $y$ is different for a categorical variable with two groups the process is as simple as running a regression with the dummy variable as independent.

Example: Gender wage gap

Imagine that we want to test if there is a statistically difference in earnings between males and females. The initial bivariate model would look like this:

\[\text{Earnings} = \beta_{0} + \beta_{1}\text{Gender}+ \epsilon\]

with $\text{Gender}$ being a dummy variable with $0=male$ and $1=female$. Then, the expected earnings of a male individual are estimated by:

\[\text{Earnings (Gender = 0)} \equiv \text{Earnings}_{male} = \beta_{0} + \epsilon\] and for females, \[\text{Earnings (Gender = 1)} \equiv \text{Earnings}_{female} = \beta_{0} + \beta_{1} + \epsilon\] And the difference among groups is equal to, \[\text{Wage Gap} \equiv \text{Earnings}_{female} - \text{Earnings}_{male}= \beta_{1}\]

If $\beta_{1}$ is statistically different than zero, then we conclude that there is a wage gap caused by gender, and the sign of $\beta_{1}$ would indicate which group earns more (if negative, males earn more and if positive, females earn more).
Obviously, this bivariate model is very likely to be biased if we don’t control for other factors (why?).

More than two categories

A bit trickier is the scenario when we have more than two categories. If that’s the case we cannot represent the variable with a single dummy.
Imagine that we have a categorical variable with $k$ categories, in a regression model we can represent that variable with $k-1$ binary variables:

\[y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{k-1}x_{k-1} + \epsilon\]

Note that we excluded the $kth$ category from the model, this is not an oversight. The reason for excluding the $k$ category is that the model intercept parameter $\beta_{0}$ is implicitly representing the excluded category, i.e.

\[y(x_{1} = x_{2} = ... = x_{k-1} = 0) \equiv y_{k} = \beta_{0} + \epsilon\]

In other words, if all the $k-1$ categories are zero it means that for the remaining observations the value of category $k$ is one - those variable belong to the $k$ category -.
If we include all the $k$ categories then the sum of all the dummy variables is equal to a vector full of ones, which we need to estimate the intercept. Then the column of ones that we use to estimate the parameters of the model will be a perfect linear combination of all the dummies, that causes perfect multicollinearity – see last week learning module – and the parameters cannot be estimated. This issue is called the dummy variable trap. Then, you should always exclude at least one category from the regression.
This implies that the estimated coefficient for each category represents the effect of that category relative to the excluded category. Formally,

\[ \begin{eqnarray} y(x_{1} = x_{2} = ... = x_{k-1} = 0) \equiv y_{k} & = & \beta_{0} + \epsilon \\ y(x_{i} = 1 \;\; \text{and} \;\; x_{-i} = 0) & = & \beta_{0} + \beta_{i} + \epsilon \\ \end{eqnarray}\] where $x_{-i}$ represents all the categories different than $i$. Then,

\[\Delta y = y(x_{i} = 1) - y(x_{1} = x_{2} = ... = x_{k-1} = 0) = \beta_{0} + \beta_{i} + \epsilon - \beta_{0} - \epsilon = \beta_{i}\]

Example, Estimating the education earnings premium using several dummy variables

Now imagine that we want to estimate the effect of education on earnings. We have a dataset for workers that have at least a high school diploma with a variable for education with the following categories: high school, bachelor, master, and, Ph.D. The categorical variable for education take four different values.

To represent this variable with dummies in a regression we need to create 3 dummies:

bachelor,
master, and,
Ph.D.

Leaving high school as the default category. Thus, all the results have to be interpreted as effects relative to having a high school diploma. Then, the regression model would look like this:

\[\text{Earnings} = \beta_{0} + \beta_{1}\text{Bachelor} + \beta_{2}\text{Master} + \beta_{3}\text{PhD} + \epsilon\]

The expected earnings of an individual with a only a high school diploma are:

\[\text{Earnings (Bachelor = 0, Master = 0, PhD = 0)} \equiv \text{Earnings}_\text{high school} = \beta_{0} + \epsilon\]

The expected earnings of an individual with a bachelor degree are:

\[\text{Earnings (Bachelor = 1, Master = 0, PhD = 0)} \equiv \text{Earnings}_\text{bachelor} = \beta_{0} + \beta_{1} + \epsilon\]

The expected earnings of an individual with a master degree are:

\[\text{Earnings (Bachelor = 0, Master = 1, PhD = 0)} \equiv \text{Earnings}_\text{master} = \beta_{0} + \beta_{2} + \epsilon\]

The expected earnings of an individual with a PhD degree are:

\[\text{Earnings (Bachelor = 0, Master = 0, PhD = 1)} \equiv \text{Earnings}_\text{PhD} = \beta_{0} + \beta_{3} + \epsilon\]

Then, the education earnings premium - how much your earnings increase by having more education - can be estimated using the parameters of each category. For instance, the earnings premium for an individual with a master degree is:

\[\text{Earnings}_\text{master} - \text{Earnings}_\text{high school} = \beta_{0} + \beta_{2} + \epsilon - \beta_{0} - \epsilon = \beta_{2}\]

Coding dummies

To create dummy variables in R we have two basic procedures,

Manually: using a categorical variable you can write a piece of code that generate binary variables for each category. There are different methods to do so, but they all rely in some logical test, in the next slides I’ll show you a sample code on how to do generate dummies manually
Factors: categorical variables are known as factors in R. When using a factor in a regression, R will automatically create the necessary dummies for the analysis.

Manually

We always need $k-1$ dummies if there are $k$ categories.
One of the categories is always omitted and becomes the default or base category. (Recall that we cannot include all categories because of perfect multicollinearity)
Then, if we have a variable with $k$ categories we need to generate $k-1$ variables. Sometimes is more convenient to create $k$ categories and then exclude a category when running the regression. The reason for that is that you may want to change the default category later on.

This is a sample code of how to do manually create dummies it in R.

# Generating a vector with random values
# for categories "a", "b", "c".
n <- 100 # Number of observations
categories <- c("a", "b", "c")
categoricalVariable <- sample(categories, size = n, replace = T)
head(categoricalVariable)

## [1] "b" "c" "c" "c" "a" "c"

k <- length(categories) # Number of categories

# Creating a matrix of zeros
dummyMatrix <- matrix(0, nrow = n, ncol = k)

# Set value to 1 in column i when the row is equal to category i
for(i in 1:k)
{
  dummyMatrix[,i] <- ifelse(categoricalVariable == categories[i], 1, 0)
}

# Naming the columns with each category
colnames(dummyMatrix) <- categories
head(dummyMatrix)

##      a b c
## [1,] 0 1 0
## [2,] 0 0 1
## [3,] 0 0 1
## [4,] 0 0 1
## [5,] 1 0 0
## [6,] 0 0 1

# Verifying that the data is correct
head(cbind(categoricalVariable, dummyMatrix))

##      categoricalVariable a   b   c
## [1,] "b"                 "0" "1" "0"
## [2,] "c"                 "0" "0" "1"
## [3,] "c"                 "0" "0" "1"
## [4,] "c"                 "0" "0" "1"
## [5,] "a"                 "1" "0" "0"
## [6,] "c"                 "0" "0" "1"

# Finally, you can convert the dummy matrix to a data.frame
# using the as.data.frame command
df <- as.data.frame(dummyMatrix)

Factors

Categorical variables are known as factors in R.
You probably have wondered why we add the option stringAsFactor = FALSE when loading a dataset or creating data.frame, the reason is that R thinks that any string variable in a dataset by default represent categories and proceeds to make the conversion automatically.
Let’s go over the previous example but using factors instead.

factorVariable <- as.factor(categoricalVariable)
factorVariable

##   [1] b c c c a c c a b a c c b c b b a b c c a b a b c b c c c b b a b b c
##  [36] b c b a b a a c b a c b c c c c b c b a c b c c b a c b a a a c b b a
##  [71] b b c c a b a a b c b a c c c a c a c b a a b c b a c a c b
## Levels: a b c

This may not look like a big deal, I just use the command as.factor on a string variable with categories and the output looks just like any string variable. But notice how at the end there is a line that states Levels: a b c. This indicates that R understand each observation belongs to one of the three different categories or levels. This can be use to estimate a regression using factorVariable instead of explicitly creating the $k-1$ dummies.
Factors are convenient because they save space in your code and data.frame.

Estimation

Now lets run some regressions with categorical dependent variables:

If the dummy variables are generated manually we proceed just as before, simply add the variable name to the lm command. I’m going to simulate some $y$ data with the dummyMatrix from before to estimate a regression. Recall that we need to exclude one category,

# Simulating y
y <- 10 + 2*df$b + -3*df$c + rnorm(n)

df$y <- y

# Estimating regression.
reg <- lm(y ~ b + c, data = df)

# Using stargazer for output
suppressMessages(library(stargazer))
stargazer(reg, type = "html")


	Dependent variable:

	y

b	2.095^***
	(0.235)

c	-2.788^***
	(0.228)

Constant	9.891^***
	(0.175)


Observations	100
R²	0.845
Adjusted R²	0.842
Residual Std. Error	0.911 (df = 97)
F Statistic	263.862^*** (df = 2; 97)

Note:	p<0.1; p<0.05; p<0.01

Note that we excluded the category a from the regression to avoid perfect multicollinearity. If you try to run the regression with a, R will automatically drop one of the dummies. See,

# Estimating regression with all categories
reg <- lm(y ~ a + b + c, data = df)

# Using stargazer for output
suppressMessages(library(stargazer))
stargazer(reg, type = "html")


	Dependent variable:

	y

a	2.788^***
	(0.228)

b	4.883^***
	(0.214)

c


Constant	7.103^***
	(0.146)


Observations	100
R²	0.845
Adjusted R²	0.842
Residual Std. Error	0.911 (df = 97)
F Statistic	263.862^*** (df = 2; 97)

Note:	p<0.1; p<0.05; p<0.01

Using factors

Now let’s estimate the same regression but using the previously generated factorVariable. Now, instead of adding $k-1$ dummy variables, we just need to add the factor variable.

# Making data.frame with factorVariable and y
df2 <- data.frame(y, factorVariable)

# Estimating regression.
reg2 <- lm(y ~ factorVariable, data = df2)

# Using stargazer for output
suppressMessages(library(stargazer))
stargazer(reg2, type = "html")


	Dependent variable:

	y

factorVariableb	2.095^***
	(0.235)

factorVariablec	-2.788^***
	(0.228)

Constant	9.891^***
	(0.175)


Observations	100
R²	0.845
Adjusted R²	0.842
Residual Std. Error	0.911 (df = 97)
F Statistic	263.862^*** (df = 2; 97)

Note:	p<0.1; p<0.05; p<0.01

When using a factor variable R automatically drops one category, the first level is the reference or default category. In this case R dropped the category a. We can specify what category to use as default using the command relevel:

df2$factorVariable <- relevel(df2$factorVariable, ref = "b") #sets "b" as the default/reference category

# Estimating regression with "b" as default category
reg2 <- lm(y ~ factorVariable, data = df2)

# Using stargazer for output
suppressMessages(library(stargazer))
stargazer(reg2, type = "html")


	Dependent variable:

	y

factorVariablea	-2.095^***
	(0.235)

factorVariablec	-4.883^***
	(0.214)

Constant	11.986^***
	(0.156)


Observations	100
R²	0.845
Adjusted R²	0.842
Residual Std. Error	0.911 (df = 97)
F Statistic	263.862^*** (df = 2; 97)

Note:	p<0.1; p<0.05; p<0.01

You may want to use the covariate.labels command in stargazer to have better labels for the dummy variables when using a factor variable.

Interpretation

The interpretation of the estimated parameter $\beta_{i}$ of a dummy variable $i$ is simple, you read the parameter as:

‘’the estimated parameter $\beta_{i}$ indicates that, relative to the default category, the category $i$ causes a $\beta_{i}$ change in the dependent variable’’
Let’s revisit the example from lesson 8, the regression of earnings on education, gender, and, age in a sample of workers with at least a high school diplomma.

dataCPS <- read.csv("/Users/econphd/Dropbox-NEU/Dropbox/Teaching/NEU/2019/PPUA5301/PPUA5301 - Summer 2019/Lectures/cps12.csv",sep=",", stringsAsFactors=FALSE)
mr2 <- lm(ahe ~ bachelor + female + age, data = dataCPS)
stargazer(mr2, type = "html")


	Dependent variable:

	ahe

bachelor	8.319^***
	(0.227)

female	-3.810^***
	(0.230)

age	0.510^***
	(0.040)

Constant	1.866
	(1.188)


Observations	7,440
R²	0.180
Adjusted R²	0.180
Residual Std. Error	9.678 (df = 7436)
F Statistic	544.495^*** (df = 3; 7436)

Note:	p<0.1; p<0.05; p<0.01

female is a dummy with 1 being female and bachelor is a dummy with 1 meaning the individual has a bachelor degree. The default category for the regression for gender is male and for education is having a high school diploma. ahe is the average hourly earnings and age is measured in years.
The first thing we have to do is to check if the estimated parameters are statistically significant.
- In this case they are, so we can reject the null hypothesis that males and females earnings are statistically equal (because for the female dummy the p-value is less or equal to 0.05).
- Individuals with a high school diploma and those with a bachelor degree earn have statistically different levels of earnings (because for the bachelor dummy the p-value is less or equal to 0.05).

Now you can proceed to interpret the magnitude of the parameter,

The estimated parameter for female can be interpret as: ‘’relative to males, females average hourly earnings is $\$3.81$ less’’.
The estimated parameter for bachelor can be interpret as: ‘’relative to individuals with a high school diploma, those with a bachelor degree earn about $\$8.32$ more per hour on average’’.
Note: Obviously the results from this regression are likely to be biased due to omitted variable bias (why do you think that’s the case?).

Summary

Categorical variables cannot be represented numerically, but have to be transformed to binary variables before we can use them in the regression model.
In a regression model with a categorical variable with $k$ categories, we only include $k-1$ binary variables.
The estimated parameter of a dummy variables represents effect on the $y$ variable when the dummy changes from 0 to 1.
Dummies can be created manually or using the as.factor command.