This lesson introduces interaction terms to linear regressions and how to select a functional form
After completing this module, students should be able to:
Schumacker, Chapter 17.
Example (1): Effect of school distance on years of completed education
In the last homework we tried to estimate the effect of distance to college and years of completed education.
The main conclusion was that distance was negatively related with years of completed education.
Some may argue that the reason for this is that there is an important transportation and moving cost - not only a monetary cost -. Also, we may think that this is caused because individuals prefer to live close to their families. So, the effect may be moderated by the interaction of this variables. That is, the relationship between college distance and education may be moderated, making it stronger or weaker, if we control for the cost of transportation and/or moving.
If the previous hypothesis about transportation cost is true, we should observe that the effect of distance on completed education is more important for individuals of lower income than those with high income.
Moreover, if is true that individuals prefer to stay close to their families, we should find that individuals with less numerous or less attached to their families, are less worried about living far away from their families than the rest.
In both cases, the key is that some variables can make the relationship between the independent variable of interest (distance to college) and the dependent variable (years of education) stronger or weaker.
Definition: When a variable \(z\) moderates the effect of \(x\) on \(y\), we say that \(x\) is interacting with the variable \(z\).
We can introduce interaction terms by adding the product of the variables that are interacting. In this example we are assuming that the effect of \(x_{1}\) is moderated by \(x_{2}\):
Formally:
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon\]
Conceptually:
The effect of \(x_1\) varies according to the values of \(x_2\).
The effect of each variable in the interaction depends on the values of the other .
The effects of \(\beta_1\) and \(\beta_2\) can no longer be interpreted independently.
For example, let’s figure out what’s the effect on \(y\) when \(x_{1}\) increase by one unit in a model with an interaction term:
\[ \begin{eqnarray} y(x_{1}) & = & \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \\ y(x_{1} + 1) & = & \beta_0 + \beta_1 (x_1 + 1) + \beta_2 x_2 + \beta_3 (x_1 + 1) x_2 + \epsilon \\ \Delta y & = & y(x_{1} + 1) - y(x_{1}) \\ & = & \beta_0 + \beta_1 (x_1 + 1) + \beta_2 x_2 + \beta_3 (x_1 + 1) x_2 + \epsilon -( \beta_0 + \beta_1 x_{1} + \beta_2 x_2 + \beta_3 x_{1} x_2 + \epsilon)\\ & = & \beta_{1} + \beta_3 x_2 \\ \end{eqnarray} \]
We can also show this by rearranging the terms the original model as:
\[y = \beta_0 + (\beta_1 + \beta_3 x_{2} ) x_1 + \beta_2 x_2 + \epsilon\]
Interpretation:
The effect of changing \(x_1\) by +1 unit is equal to its coefficient (\(\beta_1\)) plus the value of \(x_2\) modified by \(\beta_3\). Basically, the slope of \(x_{1}\) on \(y\) is \((\beta_1 + \beta_3 x2 )\).
The effect of changing \(x_2\) by +1 unit is equal to its coefficient (\(\beta_2\)) plus the value of \(x_1\) modified by \(\beta_3\).
If \(x_2\) is 1, the total effect is \(\beta_1 + \beta_3\), whereas when \(x_2\) is, say, -5, the total effect is \(\beta_1 - 5\beta_3\).
Let’s simulate some data to illustrate the interaction between two variables
n <- 1000 #Number of observations
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- 10 + 2*x1 - 2*x2 + x1*x2 + rnorm(n)
df <- data.frame(y=y,x1=x1,x2=x2)
intreg <- lm(y ~ x1 + x2 + x1*x2,data=df)
suppressMessages(library(stargazer))
stargazer(intreg, type = "html", header = FALSE)
Dependent variable: | |
y | |
x1 | 1.954*** |
(0.030) | |
x2 | -1.936*** |
(0.029) | |
x1:x2 | 1.018*** |
(0.029) | |
Constant | 10.065*** |
(0.030) | |
Observations | 1,000 |
R2 | 0.913 |
Adjusted R2 | 0.913 |
Residual Std. Error | 0.954 (df = 996) |
F Statistic | 3,475.546*** (df = 3; 996) |
Note: | p<0.1; p<0.05; p<0.01 |
Interpretation:
The effect of \(x_1\) now depends on \(x_2\).
The parameter of \(x1*x2\) indicates the nature of the interaction of the two variables. If the parameter is positive, then the moderating effect increases the effect of \(x_{1}\) on \(y\). If the parameter is negative, then the moderating effect reduces the effect of \(x_{1}\) on \(y\).
The coefficients of \(x_{i}\) indicates the effect of \(x_{i}\) when the other variable is zero.
Previously we have found that the effect of gender on earnings is negative for woman, that is; on average females earn less than males and the difference is statistically significant. Also, we have found that education has a positive effect on earnings. But, do you think that the gender gap is moderated by education? that is, do educated woman suffer the same reduction in income, relative to man, than females with less education?
We can check if that’s the case by specifying the following regression model, and interacting the variables gender and education:
\[\text{Earnings} =\beta_0 + \beta_1 \text{Gender} + \beta_2 \text{Education} + \beta_3 \left(\text{Gender} \times \text{Education} \right) + \epsilon\]
\[ \begin{eqnarray} \text{Earnings (Gender = 0)} \equiv \text{Earnings}_\text{male} & = & \beta_0 + \beta_2 \text{Education} + \epsilon \\ \text{Earnings (Gender = 1)} \equiv \text{Earnings}_\text{female} & = & \beta_0 + \beta_{1} + (\beta_2 + \beta_{3}) \text{Education} + \epsilon \\ \Delta \text{Earnings} \equiv \text{Wage Gap} & = & \text{Earnings}_\text{female} - \text{Earnings}_\text{male} \\ & = & \beta_0 + \beta_{1} + (\beta_2 + \beta_{3}) \text{Education} + \epsilon - (\beta_0 + \beta_2 \text{Education} + \epsilon) \\ & = & \beta_{1} + \beta_{3}\text{Education} \\ \end{eqnarray} \]
To test this hypothesis we are going to use the CPSSW8
database from the AER
package. This dataset includes five variables:
earnings
: average hourly earnings in dollars.gender
: factor indicating gender.age
: age in years.region
: factor indicating region.education
: number of years of education.We are going to estimate the potential experience of the individual by computing the gap difference between age and years of education. I’m using a natural logs of earnings because of potentially high outliers (usually few people earn more than the average). I’m also adding a quadratic term for age
– why is that a reasonable assumption? –
Let’s load the data and run the specified regression:
# Loading CPSSW8 data
suppressMessages(library(AER))
data("CPSSW8")
reg1 <- lm(I(log(earnings)) ~ gender + education + gender*education + age + I(age^2) + region, data = CPSSW8)
stargazer(reg1, type = "html",
header = FALSE)
Dependent variable: | |
I(log(earnings)) | |
genderfemale | -0.507*** |
(0.022) | |
education | 0.084*** |
(0.001) | |
age | 0.062*** |
(0.001) | |
I(age2) | -0.001*** |
(0.00002) | |
regionMidwest | -0.056*** |
(0.006) | |
regionSouth | -0.073*** |
(0.005) | |
regionWest | -0.027*** |
(0.006) | |
genderfemale:education | 0.020*** |
(0.002) | |
Constant | 0.375*** |
(0.030) | |
Observations | 61,395 |
R2 | 0.268 |
Adjusted R2 | 0.268 |
Residual Std. Error | 0.474 (df = 61386) |
F Statistic | 2,812.963*** (df = 8; 61386) |
Note: | p<0.1; p<0.05; p<0.01 |
Interpretation of the interaction term:
When you have an interaction, the effect of the interacted variables gets split among the various pathways – the direct effect, and the interaction effect.
This may lead you to not reject the null that some parameters are individually statistically equal to zero, because each individual causal pathway is weaker now that we split the overall effect into direct and indirect effects. Then, a simple hypothesis test based on the t-statistic may not be appropriate.
Instead, we need a method that simultaneously test the significance of a variable and it’s transformations/interactions.
We are going to call a regression with including non-linear or interaction terms as the complete model, because the model is completely representing the different causal relations between \(x\) and \(y\) – under the assumption that the specification is correct –.
Then we are going to call nested model to all the more basic models of the complete model; that is, those specifications that don’t separate all the different effects.
For example, the complete model:
\[\text{Earnings} =\beta_0 + \beta_1 \text{Gender} + \beta_2 \text{Education} + \beta_3 \left(\text{Gender} \times \text{Education} \right) + \epsilon\] includes the nested model,
\[\text{Earnings} =\beta_0 + \beta_1 \text{Gender} + \beta_2 \text{Education} + \epsilon\]
Another F test! The F test is great for testing joint hypothesis (simultaneously testing different hypothesis).
Generally we use the F-Statistic to test if all the parameters of the model are simultaneously equal to zero.
This time, we are not testing that all the independent variables are equal to zero at once; instead, we are testing one set of variables (the complete model) against a subset of variables (the nested model, where we have dropped 1 or more terms from the complete model).
\[H_0 = \text{Jointly, the parameters of the complete model that are not in the nested model are equal to zero}\]
\[H_1 = \text{At least one of the parameters of the complete model that are not in the nested model is different than zero}\]
\[F = \frac{(R_c^2 - R_n^2) / df_1}{(1-R_c^2)/df_2}\]
\(df_1 =\) number of additional variables in the complete model (the difference in parameters between the complete and nested model);
\(df_2 = n - k -1\) for the complete model (i.e., \(k\) is the total number of independent variables in the complete model).
\(R_c^2\) is the \(R^2\) for the complete model, and similarly \(R_n^2\) for the nested model.
The larger the difference in \(R^{2}\) between the two specifications (\(R_c^2 - R_n^2\)), the larger the gain in explanatory power of the complete model over the reduced model, and thus the more likely the extra variables in the complete model are jointly different than zero.
Let’s run the F-test on a couple different nested models on the regression of earnings:
complete <- lm(I(log(earnings)) ~ gender + education + gender*education + age + I(age^2) + region, data = CPSSW8)
nested1 <- lm(I(log(earnings)) ~ gender + education + age + I(age^2) + region, data = CPSSW8)
nested2 <- lm(I(log(earnings)) ~ gender + education + gender*education + age + region, data = CPSSW8)
nested3 <- lm(I(log(earnings)) ~ gender + education + age + region, data = CPSSW8)
anova1 <- anova(nested1, complete)
anova2 <- anova(nested2, complete)
anova3 <- anova(nested3, complete)
stargazer(anova1, summary = FALSE, type = "html", header = FALSE, title = "Anova 1")
Res.Df | RSS | Df | Sum of Sq | F | Pr(> F) | |
1 | 61,387 | 13,841.060 | ||||
2 | 61,386 | 13,805.080 | 1 | 35.975 | 159.967 | 0 |
stargazer(anova2, summary = FALSE, type = "html", header = FALSE, title = "Anova 2")
Res.Df | RSS | Df | Sum of Sq | F | Pr(> F) | |
1 | 61,387 | 14,155.320 | ||||
2 | 61,386 | 13,805.080 | 1 | 350.236 | 1,557.369 | 0 |
stargazer(anova3, summary = FALSE, type = "html", header = FALSE, title = "Anova 3")
Res.Df | RSS | Df | Sum of Sq | F | Pr(> F) | |
1 | 61,388 | 14,190.590 | ||||
2 | 61,386 | 13,805.080 | 2 | 385.506 | 857.101 | 0 |
Interpretation:
For each nested specification of the complete model we reject the null that, jointly, the parameters of the complete model that are not in the nested model are equal to zero.
That is, the interaction term between gender and education and the quadratic term for each should be included in the model.
Be careful of false positives – but be mindful that interactions or non-linear effects can tell both important and interesting stories!