<

Interaction Terms and Functional Form Selection

Overview

This lesson introduces interaction terms to linear regressions and how to select a functional form

Objectives

After completing this module, students should be able to:

  1. Estimate and interpret regressions using interactions between independent variables.
  2. Use the F-Statistic to compare different functional forms.

Readings

Schumacker, Chapter 17.

Introduction

Example (1): Effect of school distance on years of completed education

  • In the last homework we tried to estimate the effect of distance to college and years of completed education.

  • The main conclusion was that distance was negatively related with years of completed education.

  • Some may argue that the reason for this is that there is an important transportation and moving cost - not only a monetary cost -. Also, we may think that this is caused because individuals prefer to live close to their families. So, the effect may be moderated by the interaction of this variables. That is, the relationship between college distance and education may be moderated, making it stronger or weaker, if we control for the cost of transportation and/or moving.

  • If the previous hypothesis about transportation cost is true, we should observe that the effect of distance on completed education is more important for individuals of lower income than those with high income.

  • Moreover, if is true that individuals prefer to stay close to their families, we should find that individuals with less numerous or less attached to their families, are less worried about living far away from their families than the rest.

  • In both cases, the key is that some variables can make the relationship between the independent variable of interest (distance to college) and the dependent variable (years of education) stronger or weaker.

  • Definition: When a variable \(z\) moderates the effect of \(x\) on \(y\), we say that \(x\) is interacting with the variable \(z\).

Interaction terms in the regression model

We can introduce interaction terms by adding the product of the variables that are interacting. In this example we are assuming that the effect of \(x_{1}\) is moderated by \(x_{2}\):

Formally:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon\]

Conceptually:

  • The effect of \(x_1\) varies according to the values of \(x_2\).

  • The effect of each variable in the interaction depends on the values of the other .

  • The effects of \(\beta_1\) and \(\beta_2\) can no longer be interpreted independently.

For example, let’s figure out what’s the effect on \(y\) when \(x_{1}\) increase by one unit in a model with an interaction term:

\[ \begin{eqnarray} y(x_{1}) & = & \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \\ y(x_{1} + 1) & = & \beta_0 + \beta_1 (x_1 + 1) + \beta_2 x_2 + \beta_3 (x_1 + 1) x_2 + \epsilon \\ \Delta y & = & y(x_{1} + 1) - y(x_{1}) \\ & = & \beta_0 + \beta_1 (x_1 + 1) + \beta_2 x_2 + \beta_3 (x_1 + 1) x_2 + \epsilon -( \beta_0 + \beta_1 x_{1} + \beta_2 x_2 + \beta_3 x_{1} x_2 + \epsilon)\\ & = & \beta_{1} + \beta_3 x_2 \\ \end{eqnarray} \]

  • Note how the effect depends not only on \(\beta_{1}\) but also on the value of \(x_{2}\) and \(\beta_{3}\).

We can also show this by rearranging the terms the original model as:

\[y = \beta_0 + (\beta_1 + \beta_3 x_{2} ) x_1 + \beta_2 x_2 + \epsilon\]

Interpretation:

  • The effect of changing \(x_1\) by +1 unit is equal to its coefficient (\(\beta_1\)) plus the value of \(x_2\) modified by \(\beta_3\). Basically, the slope of \(x_{1}\) on \(y\) is \((\beta_1 + \beta_3 x2 )\).

  • The effect of changing \(x_2\) by +1 unit is equal to its coefficient (\(\beta_2\)) plus the value of \(x_1\) modified by \(\beta_3\).

  • If \(x_2\) is 1, the total effect is \(\beta_1 + \beta_3\), whereas when \(x_2\) is, say, -5, the total effect is \(\beta_1 - 5\beta_3\).

Simulation

Let’s simulate some data to illustrate the interaction between two variables

n <- 1000 #Number of observations
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- 10 + 2*x1 - 2*x2 + x1*x2 + rnorm(n)
df <- data.frame(y=y,x1=x1,x2=x2)
intreg <- lm(y ~ x1 + x2 + x1*x2,data=df)

suppressMessages(library(stargazer))
stargazer(intreg, type = "html", header = FALSE)
Dependent variable:
y
x1 1.954***
(0.030)
x2 -1.936***
(0.029)
x1:x2 1.018***
(0.029)
Constant 10.065***
(0.030)
Observations 1,000
R2 0.913
Adjusted R2 0.913
Residual Std. Error 0.954 (df = 996)
F Statistic 3,475.546*** (df = 3; 996)
Note: p<0.1; p<0.05; p<0.01

Interpretation:

  • The effect of \(x_1\) now depends on \(x_2\).

  • The parameter of \(x1*x2\) indicates the nature of the interaction of the two variables. If the parameter is positive, then the moderating effect increases the effect of \(x_{1}\) on \(y\). If the parameter is negative, then the moderating effect reduces the effect of \(x_{1}\) on \(y\).

  • The coefficients of \(x_{i}\) indicates the effect of \(x_{i}\) when the other variable is zero.

Real World example: Gender and Education interaction (1/2)

  • Previously we have found that the effect of gender on earnings is negative for woman, that is; on average females earn less than males and the difference is statistically significant. Also, we have found that education has a positive effect on earnings. But, do you think that the gender gap is moderated by education? that is, do educated woman suffer the same reduction in income, relative to man, than females with less education?

  • We can check if that’s the case by specifying the following regression model, and interacting the variables gender and education:

\[\text{Earnings} =\beta_0 + \beta_1 \text{Gender} + \beta_2 \text{Education} + \beta_3 \left(\text{Gender} \times \text{Education} \right) + \epsilon\]

  • See how the gender wage gap is moderated by education. Formally,

\[ \begin{eqnarray} \text{Earnings (Gender = 0)} \equiv \text{Earnings}_\text{male} & = & \beta_0 + \beta_2 \text{Education} + \epsilon \\ \text{Earnings (Gender = 1)} \equiv \text{Earnings}_\text{female} & = & \beta_0 + \beta_{1} + (\beta_2 + \beta_{3}) \text{Education} + \epsilon \\ \Delta \text{Earnings} \equiv \text{Wage Gap} & = & \text{Earnings}_\text{female} - \text{Earnings}_\text{male} \\ & = & \beta_0 + \beta_{1} + (\beta_2 + \beta_{3}) \text{Education} + \epsilon - (\beta_0 + \beta_2 \text{Education} + \epsilon) \\ & = & \beta_{1} + \beta_{3}\text{Education} \\ \end{eqnarray} \]

  • Then, if we estimate that \(\beta_{3} > 0\) and is statistically different than zero, it means that the gap in earnings between males and females is reduced by education.

To test this hypothesis we are going to use the CPSSW8 database from the AER package. This dataset includes five variables:

  • earnings : average hourly earnings in dollars.
  • gender : factor indicating gender.
  • age : age in years.
  • region : factor indicating region.
  • education : number of years of education.

We are going to estimate the potential experience of the individual by computing the gap difference between age and years of education. I’m using a natural logs of earnings because of potentially high outliers (usually few people earn more than the average). I’m also adding a quadratic term for age – why is that a reasonable assumption? –

Real World example: Gender and Education interaction (2/2)

Let’s load the data and run the specified regression:

# Loading CPSSW8 data
suppressMessages(library(AER))
data("CPSSW8")

reg1 <- lm(I(log(earnings)) ~ gender + education + gender*education + age + I(age^2) + region, data = CPSSW8)
stargazer(reg1, type = "html",
          header = FALSE)
Dependent variable:
I(log(earnings))
genderfemale -0.507***
(0.022)
education 0.084***
(0.001)
age 0.062***
(0.001)
I(age2) -0.001***
(0.00002)
regionMidwest -0.056***
(0.006)
regionSouth -0.073***
(0.005)
regionWest -0.027***
(0.006)
genderfemale:education 0.020***
(0.002)
Constant 0.375***
(0.030)
Observations 61,395
R2 0.268
Adjusted R2 0.268
Residual Std. Error 0.474 (df = 61386)
F Statistic 2,812.963*** (df = 8; 61386)
Note: p<0.1; p<0.05; p<0.01

Interpretation of the interaction term:

  • We found that the interaction term is statistically different different to zero. Therefore, the wage gap is reduced over the years of education.
  • Because the model is log-linear, the interpretation of the interaction terms is: (1) for males, each year of education increase earnings by about \(8.4\%\); (2) for females, each year of education increase earnings by about \((8.4\% + 2\% = 10.4\%)\). So, each year of education reduces the gender wage gap by about \(2\%\).
  • For 12 years of education, the gender wage gap is estimated to be (\(\beta_{1} + \beta_{3}\text{Education} = -0.507 + 0.02 \times 12 =\) 26.7 percent. For 16 years of education, the gender wage gap is estimated to be (\(\beta_{1} + \beta_{3}\text{Education} = -0.507 + 0.02 \times 16 =\) 18.7 percent.

Nested models

  • When you have an interaction, the effect of the interacted variables gets split among the various pathways – the direct effect, and the interaction effect.

  • This may lead you to not reject the null that some parameters are individually statistically equal to zero, because each individual causal pathway is weaker now that we split the overall effect into direct and indirect effects. Then, a simple hypothesis test based on the t-statistic may not be appropriate.

  • Instead, we need a method that simultaneously test the significance of a variable and it’s transformations/interactions.

  • We are going to call a regression with including non-linear or interaction terms as the complete model, because the model is completely representing the different causal relations between \(x\) and \(y\) – under the assumption that the specification is correct –.

  • Then we are going to call nested model to all the more basic models of the complete model; that is, those specifications that don’t separate all the different effects.

For example, the complete model:

\[\text{Earnings} =\beta_0 + \beta_1 \text{Gender} + \beta_2 \text{Education} + \beta_3 \left(\text{Gender} \times \text{Education} \right) + \epsilon\] includes the nested model,

\[\text{Earnings} =\beta_0 + \beta_1 \text{Gender} + \beta_2 \text{Education} + \epsilon\]

The F-Test (Again!)

  • Another F test! The F test is great for testing joint hypothesis (simultaneously testing different hypothesis).

  • Generally we use the F-Statistic to test if all the parameters of the model are simultaneously equal to zero.

  • This time, we are not testing that all the independent variables are equal to zero at once; instead, we are testing one set of variables (the complete model) against a subset of variables (the nested model, where we have dropped 1 or more terms from the complete model).

\[H_0 = \text{Jointly, the parameters of the complete model that are not in the nested model are equal to zero}\]

\[H_1 = \text{At least one of the parameters of the complete model that are not in the nested model is different than zero}\]

Calculating F

\[F = \frac{(R_c^2 - R_n^2) / df_1}{(1-R_c^2)/df_2}\]

  • \(df_1 =\) number of additional variables in the complete model (the difference in parameters between the complete and nested model);

  • \(df_2 = n - k -1\) for the complete model (i.e., \(k\) is the total number of independent variables in the complete model).

  • \(R_c^2\) is the \(R^2\) for the complete model, and similarly \(R_n^2\) for the nested model.

The larger the difference in \(R^{2}\) between the two specifications (\(R_c^2 - R_n^2\)), the larger the gain in explanatory power of the complete model over the reduced model, and thus the more likely the extra variables in the complete model are jointly different than zero.

Testing nested models in R

Let’s run the F-test on a couple different nested models on the regression of earnings:

complete <- lm(I(log(earnings)) ~ gender + education + gender*education + age + I(age^2) + region, data = CPSSW8)
nested1  <- lm(I(log(earnings)) ~ gender + education + age + I(age^2) + region, data = CPSSW8)
nested2  <- lm(I(log(earnings)) ~ gender + education + gender*education + age + region, data = CPSSW8)
nested3  <- lm(I(log(earnings)) ~ gender + education + age + region, data = CPSSW8)

anova1 <- anova(nested1, complete)
anova2 <- anova(nested2, complete)
anova3 <- anova(nested3, complete)

stargazer(anova1, summary = FALSE, type = "html", header = FALSE, title = "Anova 1")
Anova 1
Res.Df RSS Df Sum of Sq F Pr(> F)
1 61,387 13,841.060
2 61,386 13,805.080 1 35.975 159.967 0
stargazer(anova2, summary = FALSE, type = "html", header = FALSE, title = "Anova 2")
Anova 2
Res.Df RSS Df Sum of Sq F Pr(> F)
1 61,387 14,155.320
2 61,386 13,805.080 1 350.236 1,557.369 0
stargazer(anova3, summary = FALSE, type = "html", header = FALSE, title = "Anova 3")
Anova 3
Res.Df RSS Df Sum of Sq F Pr(> F)
1 61,388 14,190.590
2 61,386 13,805.080 2 385.506 857.101 0

Interpretation:

  • For each nested specification of the complete model we reject the null that, jointly, the parameters of the complete model that are not in the nested model are equal to zero.

  • That is, the interaction term between gender and education and the quadratic term for each should be included in the model.

When to include polynomial, natural logarithms or interaction terms

  • Don’t try to go fishing and include a quadratic term for every variable, or interact every pair of variables you have. You are likely to find something significant just by chance if you do that. You should have some motivation to add the non-linear terms, either a scatter plot suggest a non-linear relation or you want to test the specific hypothesis of non-linearity – generally based on some a priori knowledge of the relation between \(x\) and \(y\)

Be careful of false positives – but be mindful that interactions or non-linear effects can tell both important and interesting stories!

  • Use the F-test to verify if the extra term should be part of the model.