This lesson introduces a variety of non-linear functional forms specifications
After completing this module, students should be able to:
Schumacker, Chapter 17.
Non-linear vs linear relations
The linear regression model assumes that the relation between \(y\) and the different \(x\) variables is linear. That is, the effect of the independent variable (\(x\)) on the dependent variable (\(y\)) is constant over all the range of \(x\).
Sometimes that’s not the case. For example, we know that years of education have a positive effect on earnings. Also, we know that after you have a master or PhD degree, staying more years in school has no or little effect on earnings - in fact, some may argue that the effect may turn negative for very high levels of education as staying for a long time in school reduces the amount of accumulated job market experience -.
The following graph was generated using the CASchools
dataset from lesson 9.3 and it shows the relation between between the average income of a school district (measured in thousands of dollars) and the average math test scores.
Quadratic Terms
\[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon\]
The estimated parameters for \(\beta_0, \beta_1\) and \(\beta_2\) will then determine the shape of the quadratic relation, including whether it is convex or concave.
Note that by simply raising the variable \(x\) to the power of \(2\), we are allowing for \(x\) to be related with \(y\) in a non-linear way. In this model, when \(x\) increase by one unit, the effect on \(y\) is non-constant. Say that we want to know that’s the effect on \(y\) of increasing \(x\) by one unit starting a some particular value \(\bar{x}\),
\[ \begin{eqnarray} y(x = \bar{x}) & = & \beta_0 + \beta_1 \bar{x} + \beta_2 \bar{x}^2 + \epsilon & \;\; (\text{Value of}\; y \;\text{at} \; x=\bar{x}) \\ y(x = \bar{x} + 1) & = & \beta_0 + \beta_1 (\bar{x} + 1) + \beta_2 (\bar{x} + 1)^2 + \epsilon & \;\; (\text{Value of}\; y \;\text{at} \; x=\bar{x}+1) \end{eqnarray} \]
Now let’s compute the difference between the values of \(y\) \((\Delta y)\) to have a measure of the effect,
\[ \begin{eqnarray} \Delta(y) & = & y(x = \bar{x} + 1) - y(x = \bar{x}) & \\ & = & \left(\beta_0 + \beta_1 (\bar{x} + 1) + \beta_2 (\bar{x} + 1)^2 + \epsilon \right) - \left(\beta_0 + \beta_1 \bar{x} + \beta_2 \bar{x}^2 + \epsilon \right) & \\ & = & \left( \beta_{0} - \beta_{0} \right) + \beta_{1}\left(\bar{x} + 1 - \bar{x}\right) + \beta_{2}\left( (\bar{x}+1)^2 - \bar{x} \right) + \left( \epsilon - \epsilon \right) & \text{(Rearranging terms)}\\ & = & \beta_1 + \beta_{2}\left( (\bar{x}^2 + 2\bar{x} + 1 - \bar{x} \right) & \text{(Expanding} \;\; (\bar{x}+1)^2 = \bar{x}^2 + 2\bar{x} + 1) \\ & = & \beta_1 + \beta_2 (\bar{x}^2 + \bar{x} + 1) & \end{eqnarray} \]
\[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_{3} x^3 + ... + \beta_{k} x^{k} + \epsilon\]
Quadratic relation
Let’s use some simulated data before we turn back to the income and test scores example. In this simulation we are going to make a non-linear relation between \(x\) and \(y\) and try to estimate the parameters of the model. The goal is to compare how well the linear regression represents the relation between \(x\) and \(y\) vs. a non-linear model.
set.seed(1)
n <- 1000 # Number of observations
x <- rnorm(n) # Generating x
y <- 10 + 2*x - x^2 + rnorm(n) # Generating y using a quadratic function
df <- data.frame(y=y,x=x) # creating data frame
Now let’s examine the data by making a scatterplot:
graph2 <- ggplot(data=df,aes(x=x,y=y)) + geom_point()
graph2
Clearly we have a curvilinear relationship.
Let’s see what happens when we fit a linear regression vs a regression with a quadratic term:
# Adding linear regression to the graph
graph2 <- graph2 + geom_smooth(method = lm, formula = (y ~ x), color = "blue")
# Adding quadratic regression to the graph
graph2 <- graph2 + geom_smooth(method = lm, formula = (y ~ x + I(x^2)), color = "red")
graph2
See how the quadratic model (red line) fits the data better than the linear model (blue line).
Not only we get a better fit to the data with the quadratic model, but the linear model is incorrectly estimating the effect of \(x\) on \(y\) for very low and high values of \(x\). Only on average values of \(x\) is the linear regression a reasonable approximation.
Let’s see how a cubic relation looks like:
set.seed(1)
n <- 1000 # Number of observations
x <- rnorm(n) # Generating x
y <- 10 + 2*x - x^2 + 0.5*x^3 + rnorm(n) # Generating y using a cubicfunction
df <- data.frame(y=y,x=x) # creating data frame
Now let’s examine the data:
graph2 <- ggplot(data=df,aes(x=x,y=y)) + geom_point()
graph2
Again, we find a curvilinear relationship, but this time the data seems to have two distinct parts: first the effect of \(x\) on \(y\) is diminishing and then increasing – first, the slope is positive with decreasing returns to \(x\), then the slope is still positive but increasing –
Let’s see what happens when we fit a linear regression vs a regression with a quadratic and cubic models:
graph2 <- graph2 + geom_smooth(method = lm, formula = (y ~ x), color = "blue")
graph2 <- graph2 + geom_smooth(method = lm, formula = (y ~ x + I(x^2)), color = "red")
graph2 <- graph2 + geom_smooth(method = lm, formula = (y ~ x + I(x^2) + I(x^3)), color = "green")
graph2
Specifying higher order terms in the lm
command
We can add quadratic and higher order terms into a regression model in at least three different ways:
Manually: Simply create a new variable using \(x^2\), something like this xsq <- x^2
. Then you can add xsq
to the lm
command like any other variable, e.g. lm(y ~ x + xsq)
. This is also true for higher order polynomials, if you want to add a cubic term (\(x^3\)) you can simply save the variable xcu <- x^3
and add it to the regression, e.g. lm(y ~ x + xsq + xcu)
I()
command: Instead of creating a variable every time time you need to add a higher order term, we can use the I()
command. This command allows you to do a transformation to a variable before running a regression. Then, you will add I(x^2)
to the lm
command like this lm(y ~ x + I(x^2))
, which means: first transform x
to x^2
and then run the regression. This is useful when you don’t want to have non-linear transformation of the same variable in your data.frame.
poly()
command: This functions adds all the terms of a polynomial of degree k
for a specific variable. For example, poly(x,2)
will generate the following polynomial: x + x^2
, which can be added to lm
directly, e.g. lm(y ~ poly(x,2))
.This is can save you some time when you are estimating regressions with high degrees terms (I consider high after 3 or 4).
When reading the output of a regression with higher order terms we have to answer the following questions:
Is the relation between \(x\) and \(y\) non-linear?
For example, a quadratic polynomial (\(k=2\)) has only (\(k-1 = 2 - 1 = 1\)) region, so the effect will be diminishing or increasing in all the domain of the function.
On the other hand, a cubic polynomial (\(k=3\)) has (\(k-1 = 3 - 1 = 2\)) regions, so the effect of \(x\) on \(y\) can be:
The sign of the coefficient of the \(x^k\) term determines if the effect of \(x\) on \(y\) is diminishing or increasing for the \(k-1\) part of the function.
For example, a quadratic regression (k=2) has only \(k-1=2-1=1\) region where the effect of \(x\) on \(y\) is non-linear
Second example, consider a cubic regression (k=3), then we have \(k-2=3-1=2\) regions where the effect of \(x\) on \(y\) is non-linear:
You can see how the interpretation can get very complex if \(k \geq 4\).
Let’s illustrate the use of higher order terms using the CASchools
dataset. Let’s take a look at the data again:
graph1
# Adding linear model in blue
graph1 <- graph1 + geom_smooth(method = lm, formula = (y ~ x), color = "blue", se = FALSE)
# Adding quadratic model in red
graph1 <- graph1 + geom_smooth(method = lm, formula = (y ~ x + I(x^2)), color = "red", se = FALSE)
# Adding cubic model in green
graph1 <- graph1 + geom_smooth(method = lm, formula = (y ~ x + I(x^2) + I(x^3)), color = "green", se = FALSE)
graph1
Let’s estimate the three different regression models:
# Estimating linear model
reg1 <- lm(math ~ income, data = CASchools)
# Estimating quadratic model
reg2 <- lm(math ~ income + I(income^2), data = CASchools)
# Estimating cubic model
reg3 <- lm(math ~ income + I(income^2) + I(income^3), data = CASchools)
#Using stargazer for output
suppressMessages(library(stargazer))
stargazer(list(reg1, reg2, reg3), type = "html")
Dependent variable: | |||
math | |||
(1) | (2) | (3) | |
income | 1.815*** | 3.473*** | 3.976*** |
(0.091) | (0.310) | (0.877) | |
I(income2) | -0.036*** | -0.059 | |
(0.006) | (0.038) | ||
I(income3) | 0.0003 | ||
(0.0005) | |||
Constant | 625.539*** | 610.346*** | 607.230*** |
(1.536) | (3.103) | (5.951) | |
Observations | 420 | 420 | 420 |
R2 | 0.489 | 0.525 | 0.525 |
Adjusted R2 | 0.488 | 0.522 | 0.522 |
Residual Std. Error | 13.420 (df = 418) | 12.962 (df = 417) | 12.972 (df = 416) |
F Statistic | 400.257*** (df = 1; 418) | 230.063*** (df = 2; 417) | 153.272*** (df = 3; 416) |
Note: | p<0.1; p<0.05; p<0.01 |
Interpretation:
income
and math
is linear. Therefore, the relation is not linear and we can discard the output of regression (1).income
has no cubic effect on math
. Therefore, we can discard regression (3).income
on math
is diminishing, i.e. for higher values of income
, the effect of income on math
is smaller than for lower values. In other words, the effect income exhibits diminishing returns on academic performance. The effect is greater for lower values (poorer families) than for higher values (richer families).Let’s see how the three models predict an increase in income
from \(10\) to \(20\) and from \(40\) to \(50\) (recall that income is measured in thousands of dollars).
# Linear model approximation
lin_y <- predict(reg1, data.frame(income = c(10, 20, 40, 50)))
lin_deltaY_lowIncome <- lin_y[2] - lin_y[1]
lin_deltaY_highIncome <- lin_y[4] - lin_y[3]
linApprox <- c(lin_deltaY_lowIncome, lin_deltaY_highIncome)
# Quadratic model approximation
qua_y <- predict(reg2, data.frame(income = c(10, 20, 40, 50)))
qua_deltaY_lowIncome <- qua_y[2] - qua_y[1]
qua_deltaY_highIncome <- qua_y[4] - qua_y[3]
quaApprox <- c(qua_deltaY_lowIncome, qua_deltaY_highIncome)
# Cubic model approximation
cub_y <- predict(reg3, data.frame(income = c(10, 20, 40, 50)))
cub_deltaY_lowIncome <- cub_y[2] - cub_y[1]
cub_deltaY_highIncome <- cub_y[4] - cub_y[3]
cubApprox <- c(cub_deltaY_lowIncome, cub_deltaY_highIncome)
approxTable <- cbind(linApprox, quaApprox, cubApprox)
approxTable <- as.data.frame(approxTable)
names(approxTable) <- c("Linear Model",
"Quadratic Model",
"Cubic Model")
rownames(approxTable) <- c("Effect from 10 to 20",
"Effect from 40 to 50")
stargazer(approxTable, summary = FALSE, header = FALSE, type = "html")
Linear Model | Quadratic Model | Cubic Model | |
Effect from 10 to 20 | 18.152 | 24.061 | 24.245 |
Effect from 40 to 50 | 18.152 | 2.731 | 5.037 |
Let’s take a look at the same result with a graph:
graphData <- data.frame( model = rep(colnames(approxTable), each = 2),
income = rep(rownames(approxTable), 3),
effect = c(linApprox, quaApprox, cubApprox) , stringsAsFactors = TRUE)
effectsGraph <- ggplot(data = graphData, aes(x = reorder(model, rep(c(1,2,3),2)), y = effect, fill = income)) + geom_bar(stat="identity", position=position_dodge()) + labs(title = "Estimated effect of income on test scores using \n different poynomial approximations",
x = "Model",
y = "Change in Test Scores \n (Math)")
effectsGraph + theme_minimal()
This are some good practices when working with polynomial regressions:
Always inspect the scatter plot of \(y\) on \(x\). If for different values of \(x\), the effect on \(y\) seems to be drastically different, a linear regression is probably not a good fit and you should consider using a non-linear function.
If the change in the slope is the same (either only increasing or diminishing for all values of \(x\)) then a quadratic term is probably enough to capture the non-linearity of the relationship. If there are more than one region in which the slope is increasing/diminishing then you should consider adding higher order terms. Is rare to add more than \(k \geq 4\).
Consecutively add each next higher polynomial term and run regressions at each stage. That is, start with a linear model, then quadratic, then cubic, etc. Once the next higher polynomial has no longer a significant effect; drop that polynomial and use the previous model.
Use prior research and theory to consider as a justification why the relation may not be linear.
Lower order terms do not need to be significant (often that happens due to inherent multicollinearity). For example, in a \(k=4\) polynomial, the \(k=2\) and \(k=4\) terms will be correlated with each other and that will cause the standard errors to be inflated due to multicollinearity, and in some cases we won’t be able to reject the null that the parameter for the term \(k=2\) is insignificant. This is not a sign of mispecification bias, just the curse of multicollinearity with polynomials. You will only drop a polynomial term if the higher order term is not significant; in the previous example, only if parameter of \(k=4\) is insignificant.
Another way to specify a non-linear relation is to use the natural logarithm function \(\ln(\dot)\). We can transform either \(y\) or \(x\) to using the natural logarithm before estimating the regression.
Logarithms convert changes in variables into percentage changes, and many relationships are naturally expressed in terms of percentages. Additionally, a logarithm transformation can change the scale in which a variable is measured, this is very convenient when there are a few high values outliers in the data - e.g. in the previous example, note how the income variable includes fewer high value observations –
Exponential function and natural logarithm
The exponential function and its inverse, the natural logarithm, are important functional forms in the analysis of nonlinear regressions. The exponential function of \(x\) is \(e^{x}\) (exp(x)
in R
), where \(e\) is the constant \(2.71828...\). The natural logarithm is the inverse of the exponential function; that is, the natural logarithm is the function for which \(x = \ln(e^x)\).
This is how the natural logarithm function looks like:
set.seed(1)
n <- 1000 # Number of observations
x <- rnorm(n, mean = 20, sd = 5) # Generating x
y <- 1 + 100*log(x) + rnorm(n, mean = 0, sd = 1) # Generating y using a log function
df <- data.frame(y=y,x=x) # creating data frame
ggplot(df, aes(y=y,x=x)) + geom_point()
\[ \begin{eqnarray} \ln(1) & = & 0 \\ \ln(0) & = & -\infty \\ \ln(e) & = & 1 \\ \ln(a b) & = & \ln(a) + \ln(b) \\ \ln(a/b) & = & \ln(a) - \ln(b) \\ \ln(1/a) & = & \ln(1) - \ln(a) = -\ln(a)\\ \ln(a^b) & = & b\ln(a) \\ \ln(e^a) & = & a\ln(e) = a \\ \end{eqnarray} \] And the function is only defined for \(x > 0\). Be careful with this, as applying a the natural logarithm to a variable with \(0\) or negative values will return \(-\infty\).
There are three potential non-linear specifications using logarithms:
\[y = \beta_{0} + \beta_{1}\ln(x) + \epsilon\]
This is often called the linear-log model.
\[ \begin{eqnarray} y(x = \bar{x}) & = & \beta_0 + \beta_1 \ln(\bar{x}) + \epsilon \\ y(x = \bar{x} + 1) & = & \beta_0 + \beta_1 \ln(\bar{x} + 1) + \epsilon \\ \Delta(y) = y(x = \bar{x} + 1) - y(x = \bar{x}) & = &\beta_0 + \beta_1 \ln(\bar{x} + 1) + \epsilon - ( \beta_0 + \beta_1 \ln(\bar{x}) + \epsilon) \\ & = & \beta_{1}(\ln(\bar{x} +1) - \ln(\bar{x})) \\ & = & \beta_{1} \ln(\dfrac{\bar{x} + 1}{\bar{x}}) \end{eqnarray} \]
\[\lim_{\bar{x} \rightarrow \infty} \Delta(y) = \lim_{\bar{x} \rightarrow \infty} \ln(\dfrac{\bar{x} + 1}{\bar{x}}) = \ln(1) = 0\]
\[\ln(y) = \beta_{0} + \beta_{1}x + \epsilon\]
This is often called the log-linear model.
\[ \begin{eqnarray} \ln(y(x = \bar{x})) & = & \beta_{0} + \beta_{1}\bar{x} + \epsilon \\ e^{\ln(y(x = \bar{x}))} & = & e^{\beta_{0} + \beta_{1}\bar{x} + \epsilon} \\ y(x = \bar{x}) & = & e^{\beta_{0} + \beta_{1}\bar{x} + \epsilon} \\ y(x = \bar{x} + 1) & = & e^{\beta_{0} + \beta_{1}(\bar{x} +1) + \epsilon} \end{eqnarray} \] Then, \[ \begin{eqnarray} \Delta(y) = y(x = \bar{x} + 1) - y(x = \bar{x}) & = & e^{\beta_{0} + \beta_{1}(\bar{x} +1) + \epsilon} - e^{\beta_{0} + \beta_{1}\bar{x} + \epsilon} \\ & = & e^{\beta_{0} + \beta_{1}\bar{x} + \epsilon} (e^{\beta_{1}} -1) \end{eqnarray} \]
\[\lim_{\bar{x} \rightarrow \infty} \Delta(y) = \lim_{\bar{x} \rightarrow \infty} e^{\beta_{0} + \beta_{1}(\bar{x}) + \epsilon} (e^{\beta_{1}} -1) = \begin{cases} \infty & (e^{\beta_{1}} -1) >1 \\ -\infty & (e^{\beta_{1}} -1) < 1 \end{cases}\] In any case, the absolute value of the effect of \(x\) on \(y\) is increasing.
\[\ln(y) = \beta_{0} + \beta_{1}\ln(x) + \epsilon\]
This is often called the log-log model.
\[ \begin{eqnarray} \ln(y(x = \bar{x})) & = & \beta_{0} + \beta_{1}\ln(\bar{x}) + \epsilon \\ e^{\ln(y(x = \bar{x}))} & = & e^{\beta_{0} + \beta_{1}\ln(\bar{x}) + \epsilon} \\ y(x = \bar{x}) & = & e^{\beta_{0} + \beta_{1} + \epsilon}(\bar{x}) \\ y(x = \bar{x}) & = & e^{\beta_{0} + \beta_{1} + \epsilon}(\bar{x}) \\ y(x = \bar{x} + 1) & = & e^{\beta_{0} + \beta_{1} + \epsilon}(\bar{x} + 1) \\ \end{eqnarray} \] Then, \[ \begin{eqnarray} \Delta(y) = y(x = \bar{x} + 1) - y(x = \bar{x}) & = & e^{\beta_{0} + \beta_{1} + \epsilon}(\bar{x} + 1) - e^{\beta_{0} + \beta_{1} + \epsilon}(\bar{x}) \\ & = & e^{\beta_{0} + \beta_{1} + \epsilon} \end{eqnarray} \]
We can add logarithmic terms to a regression in two ways:
Manually: Simply create a new variable based on \(\ln(x)\) or \(\ln(y)\), something like this logx <- log(x)
. Then you can add logx
to the lm
command like any other variable, e.g. lm(y ~ logx)
for the linear log model.
I()
command: Instead of creating a variable you I()
command, that allows you to do a transformation to a variable before running the regression. Then, you will add I(log(x))
to the lm
command like this lm(y ~ I(log(x)))
, which means: first transform x
to log(x)
and then run the regression.
Because the non-linear transformation are often complex, the interpretation of \(\beta_{1}\) is not as straightforward as for other models.The following table summarizes the interpretation of \(\beta_{1}\) for the three natural log transformations:
Case | Model Name | Regression Specification | Interpretation of \(\beta_{1}\) |
---|---|---|---|
1 | Linear-Log | \(y = \beta_{0} + \beta_{1}\ln(x) + \epsilon\) | A 1% change in \(x\) is associated with a change in \(y\) of \(0.01 \times \beta_{1}\) |
2 | Log-Linear | \(\ln(y) = \beta_{0} + \beta_{1}x + \epsilon\) | A change in \(x\) by one unit is associated with a \(100 \times \beta_{1} \%\) change in \(y\) |
3 | Log-Log | \(\ln(y) = \beta_{0} + \beta_{1}\ln(x) + \epsilon\) | A 1% change in \(x\) is associated with a \(\beta_{1}\%\) change in \(y\) |
Let’s illustrate the use of natural logarithms using the CASchools
dataset. Let’s take a look at the data again:
graph1 <- ggplot(data = CASchools, aes(x = income, y = math)) + geom_point()
graph1 <- graph1 + labs(title = "Scatterplot of Math Test Scores vs. District Average Income",
x = "Disctrict Income \n (Thousands of dollars)",
y = "Test Scores \n (Math)")
graph1
graph1 <- graph1 + geom_smooth(method = lm, formula = (y ~ x), color = "blue", se = FALSE)
graph1 <- graph1 + geom_smooth(method = lm, formula = (y ~ I(log(x))), color = "red", se = FALSE)
graph1
Now let’s estimate the regression models, I’ll estimate all possible specifications using logs just to illustrate the process:
reg1 <- lm(math ~ income, data = CASchools) #linear model
reg2 <- lm(math ~ I(log(income)), data = CASchools) #linear-log model
reg3 <- lm(I(log(math)) ~ income, data = CASchools) #log-linear model
reg4 <- lm(I(log(math)) ~ I(log(income)), data = CASchools) #log-log model
#Using stargazer for output
suppressMessages(library(stargazer))
stargazer(list(reg1, reg2, reg3, reg4), type = "html")
Dependent variable: | ||||
math | I(log(math)) | |||
(1) | (2) | (3) | (4) | |
income | 1.815*** | 0.003*** | ||
(0.091) | (0.0001) | |||
I(log(income)) | 34.664*** | 0.053*** | ||
(1.610) | (0.002) | |||
Constant | 625.539*** | 561.661*** | 6.440*** | 6.342*** |
(1.536) | (4.304) | (0.002) | (0.007) | |
Observations | 420 | 420 | 420 | 420 |
R2 | 0.489 | 0.526 | 0.481 | 0.522 |
Adjusted R2 | 0.488 | 0.525 | 0.480 | 0.521 |
Residual Std. Error (df = 418) | 13.420 | 12.928 | 0.021 | 0.020 |
F Statistic (df = 1; 418) | 400.257*** | 463.806*** | 387.338*** | 456.883*** |
Note: | p<0.1; p<0.05; p<0.01 |
Interpretation of \(\beta_{1}\)
income
cause a \(1.815\) increase in math
.income
cause a change in math
of \(0.01 \times 34.664 = 0.346\).income
cause a \((100 \times 0.003) \% = 0.3\%\) change in math
.Log-Log Model: A 1% increase in income
cause a \(0.053\%\) change in math
.
All the models reject the null that income
is not related with math
.