<

Regression: Derivations and analysis

Overview

This lesson shows the derivation of the regression coefficients, and a number of additional regression-related methods.

Objectives

After completing this module, students should be able to:

  1. Derive the multiple regression coefficients, Standard Errors, \(R^2\), adjusted \(R^2\), and F statistic.
  2. Discuss why the assumption of exogeneity is essential to estimate unbiased parameters.

Readings

Lander, Chapter 16; Schumacker, Chapter 17 through p. 402.

Calculating coefficients in bivariate regression

In the previous learning module we learn how to compute the parameters of the bivariate regression model,

\[y = \beta_{0} + \beta_{1}x + \epsilon\] How did we calculate the coefficients?

Derivation coefficients in multiple regression

  • Unfortunately, the formulae for the coefficients of the multivariate regression model are not as simple as the ones for the bivariate. Because there are other variables, the derivation can be quite tedious, even with just one additional variable.

  • One shortcut is to use linear algebra, we can represent the regression model using matrix notation as follows:

\[\begin{eqnarray*} y & = & \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_k x_k + \epsilon \\ y & = & X \beta + \epsilon \end{eqnarray*}\]

Where,

  • \(X\) is a matrix of explanatory variables. Each column of \(X\) represents an individual variable, and each row an observation.
  • The first column of \(X\) is a constant where each observation is constant and equal to 1. This column will represent the intercept of the model.
  • \(y\) is still simply the explanatory variable. A vector.
  • \(\epsilon\) is still the same error term. A vector.
  • The number of columns of \(X\) is \(k+1\), where \(k\) is the number of variables.
  • The number of rows of \(X\) is \(n\), where \(n\) is the number of observations.

The matrix notation is representing the same model, is simply a shorter way.

Matrix operations

Some notations to recall first:

  • \(A'\) is the transpose of a matrix \(A\), which is what you get when the columns and rows have been swapped.

  • \(A^{-1}\) is the inverse of a squared matrix \(A\), which is a concept related to the inverse of a collection of numbers.

  • If \(a^{-1} = 1/a\), we know that \(a \times a^{-1} = 1\); similarly with matrices, \(AA^{-1} = \textbf{1}\), the identity matrix. See, http://mathworld.wolfram.com/MatrixInverse.html for more information about the inverse of a matrix.

  • If two random variables, \(x\) and \(y\), are independent, \(x'y = 0\).

Solving for \(\beta\)

Start with the regression model in matrix notation:

\[y = X \mathbf{\beta} + \epsilon\]

  • Now multiply both sides with \(X'\):

\[X'y = X'X \mathbf{\beta} + X'\epsilon\]

  • Now multiply both sides by \((X'X)^{-1}\), the inverse of \(X'X\), then we get:

\[(X'X)^{-1}X'y = (X'X)^{-1}X'X \mathbf{\beta} + (X'X)^{-1}X' \mathbf{\epsilon}\]

  • And since \((X'X)^{-1}X'X = \textbf{1}\) by the definition of the inverse, we have:

\[(X'X)^{-1}X'y = \mathbf{\beta} + (X'X)^{-1}X' \mathbf{\epsilon}\]

  • When estimating a regression we compute \((X'X)^{-1}X'y\) to get the values of the estimated parameters. Thus, let’s define \(\beta_{\text{ols}} = (X'X)^{-1}X'Y\) as the estimated parameters. Then,

\[\beta_{\text{ols}} = \mathbf{\beta} + (X'X)^{-1}X' \mathbf{\epsilon}\]

  • If \(X'\epsilon = 0\) then \(\beta_{\text{ols}} = \beta\), which means that our estimated parameters are equal to the real parameters of the regression - the regression is unbiased -.

  • The assumption of \(X'\epsilon = 0\) is the same as saying each variable in \(X\) is independent of \(\epsilon\). This assumption is called exogeneity and it means that the error term is not related with any of the independent variables, \(x\), of the model.

  • If the property of exogeneity is violated, then \(\beta_{\text{ols}} \neq \mathbf{\beta}\). That is, the estimated parameter \(\beta_{\text{ols}}\) is not equal to the real parameter \(\beta\).

  • That’s why not controlling for relevant variables introduce bias in the estimation.

Example in R

Let’s return to the results from the previous lesson:

dataCPS <- read.csv("/Users/CCD-MAC/Documents/Dropbox NEU/Dropbox/Teaching/NEU/2019/PPUA5301/PPUA5301 - Summer 2019/Lectures/cps12.csv",sep=",", stringsAsFactors=FALSE)
mr2 <- lm(ahe ~ bachelor + female + age, data = dataCPS)
summary(mr2)

Call:
lm(formula = ahe ~ bachelor + female + age, data = dataCPS)

Residuals:
    Min      1Q  Median      3Q     Max
-23.688  -6.207  -1.708   4.280  75.302

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.86620    1.18760   1.571    0.116
bachelor     8.31863    0.22739  36.584   <2e-16 ***
female      -3.81030    0.22960 -16.596   <2e-16 ***
age          0.51029    0.03952  12.912   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.678 on 7436 degrees of freedom
Multiple R-squared:  0.1801,    Adjusted R-squared:  0.1798
F-statistic: 544.5 on 3 and 7436 DF,  p-value: < 2.2e-16

Manual Example

Now let’s derive it using matrix algebra. Note that t(X) is R’s notation for \(X'\), and solve(X) is R’s notation for \(X^{-1}\). Also, recall that when multiplying matrices you need to use the inner product operator %*%.

First, built the \(X\) matrix or data matrix,

xmat <- as.matrix(cbind(dataCPS$bachelor, dataCPS$female, dataCPS$age))
xmat <- cbind(1,xmat) # add the columns of 1's

# Let's take a look at it
head(xmat)
     [,1] [,2] [,3] [,4]
[1,]    1    0    0   30
[2,]    1    0    0   29
[3,]    1    0    0   27
[4,]    1    0    1   25
[5,]    1    1    1   27
[6,]    1    1    0   30

Second, use the \(\beta_{\text{ols}} = (X'X)^{-1}X'y\) formula to estimate the parameters,

#now we solve for Beta_ols in one step: (X'X)^-1 X'y :
y <- dataCPS$ahe
betas <- solve( t(xmat) %*% xmat )   %*%   t(xmat) %*% y
betas
          [,1]
[1,]  1.866198
[2,]  8.318628
[3,] -3.810305
[4,]  0.510286

Derivation of the standard errors

To derive the standard errors of the model, we need to find the standard deviation. Thus, let’s start with finding the variance. This requires a bit of knowledge of linear algebra that is beyond the scope of the course. I’ll write the derivation below, but I don’t expect you to understand every single step, The important part is the result as we’ll refer to this in the future.

\[\begin{eqnarray*} \text{Var}(\beta) & = & E[(\beta - E[\beta])^2] \\ & = & E[(\beta - E[\beta])(\beta - E[\beta])'] \\ & = & E[(\beta - \beta_{\text{ols}})(\beta - \beta_{\text{ols}})'] \\ & = & E[(X'X)^{-1}X' \mathbf{\epsilon}((X'X)^{-1}X' \mathbf{\epsilon})'] \\ & = & E[(X'X)^{-1}X' \mathbf{\epsilon}\epsilon' X (X'X)^{-1}] \\ & = & (X'X)^{-1}X' E[\mathbf{\epsilon}\epsilon'] X (X'X)^{-1} \\ & = & (X'X)^{-1}X' \sigma_{y} X (X'X)^{-1} \\ & = & \sigma_{y}^2 (X'X)^{-1}X'X (X'X)^{-1} \\ & = & \sigma_{y}^2 (X'X)^{-1} \end{eqnarray*}\]

Where \(\sigma_{y}^2\) is the variance of \(y\). Then,

\[\begin{eqnarray*} \text{SD}(\beta) & = & \sigma_{y} \sqrt{(X'X)^{-1}} \end{eqnarray*}\]

and,

\[\begin{eqnarray*} \text{SE}(\beta) & = & \text{se}_{y} \text{diag}(\sqrt{(X'X)^{-1}}) \end{eqnarray*}\]

where,

\[\text{se}_{y} = \sqrt{ \frac{\sum (y_i-\hat{y}_i)^2 }{n-k-1}}\]

and \(\text{diag}\) means that we need to extract the main diagonal elements of the matrix.

We can use the standard errors to conduct hypothesis tests on the individual parameters of the model. Just like we did for the bivariate model.

Manual Example

Similar to the estimation of the \(\beta\) coefficients, we can manually compute the standard errors of the coefficients. You need to estimate the parameters first and then compute \(\text{se}_{y}\) and use the formula for the standard errors.:

#Calculating se_y
yhat <- xmat %*% betas # Fitted values
n <- nrow(xmat) # Number of observations, rows
kPlus1 <- ncol(xmat) # columns of xmat = k + 1
se_y <- sqrt(sum( (y - yhat)^2 ) / (n - kPlus1) )
se_y * sqrt( diag( solve( t(xmat) %*% xmat )) )
## [1] 1.18760411 0.22738503 0.22959794 0.03952061

Compare this results to the one obtained using lm

summary(mr2)
##
## Call:
## lm(formula = ahe ~ bachelor + female + age, data = dataCPS)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -23.688  -6.207  -1.708   4.280  75.302
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.86620    1.18760   1.571    0.116
## bachelor     8.31863    0.22739  36.584   <2e-16 ***
## female      -3.81030    0.22960 -16.596   <2e-16 ***
## age          0.51029    0.03952  12.912   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.678 on 7436 degrees of freedom
## Multiple R-squared:  0.1801, Adjusted R-squared:  0.1798
## F-statistic: 544.5 on 3 and 7436 DF,  p-value: < 2.2e-16

Interpretation

The parameters \(\beta_\text{bachelor}\), \(\beta_\text{female}\), and, \(\beta_\text{age}\), are statistically different than zero. The intercept parameter is statistically equal to zero.

\(R^2\)

The \(R^2\) can be as easily derived in the multiple case as in the bivariate case:

\[R^{2} = \frac{TSS - SSE}{TSS}\],

where the total sum of squares \(TSS = \sum_{i} (y_{i} - \bar{y})^{2}\) and the model sum of squared errors is \(SSE = \sum_{i} (y_{i} - \hat{y}_{i})^{2}\).

Neither of these depend on \(X\) directly at all, and can be easily calculated with the fitted values \(\hat{y}\).

We of course could generate \(\hat{y}\) every time - see previous example -, but luckily R has a built-in function for generating predicted values from simple models, predict:

ypred <- predict(mr2) # This is the same as yhat
# and the rest of it is done as we have done before:
tss <- sum((y - mean(y))^2)
sse <- sum((y-ypred)^2)
r2 <- (tss-sse)/tss
r2
[1] 0.1801078

Recall that \(R^2\) measures how much of the variation in \(y\) is explained by the model.

\(R^2\) and risk of over fitting

Over fitting:

  • There is a problem with the \(R^2\), it doesn’t penalize for adding an infinite amount of variables, i.e. \(R^2\) is a non-decreasing function in the number of variables.

  • Theoretically, you could add dozens of variables to the model and, since each one would at worst do nothing to improve \(\hat{y}\) and might accidentally improve it, if you add enough variables you could get a very high \(R^2\) entirely by chance.

  • This is called over fitting, and is an especially severe problem for complex models with many variables, as we will see in a last modules when we use machine learning to specify the model.

  • But even for simple linear models like multiple regression, it can be a problem, leading to misleadingly large \(R^2\) values, when in fact most of the “explanatory” power of the model is due to pure chance.

Adjusted \(R^2\)

  • To compensate for over-fitting, the concept of the adjusted \(R^2\) was developed, which basically penalizes the \(R^2\) for each additional independent variable.

  • The basic idea is that, unlike the \(R^2\), the adjusted \(R^2\) only increases when a new variable increases \(R^2\) more than would be expected by chance alone.

  • The formula for the adjusted \(R^2\) is similar to that for the \(R^2\), except with an adjustment for the degrees of freedom (i.e., the sample size and number of variables):

\[\textrm{adjusted } R^2 = \frac{TSS/df_t - SSE/df_e}{TSS/df_t}\]

  • \(sf_t = n - 1\) and \(sf_e = n - k - 1\) (where k is the number of variables).

  • The adjusted \(R^2\) is smaller than \(R^2\) and when it almost matches it reflects the fact that most of the variables included in the model do have explanatory power (as reflected by their individual p values) and aren’t just boosting the \(R^2\) by chance alone.

Thus to return to our example, we can directly calculate the adjusted \(R^2\):

n <- length(y)
k <- ncol(xmat)-1
dft <- n - 1
dfe <- n - k - 1
(tss/dft - sse/dfe)/ (tss/dft)
[1] 0.179777

F test

  • Recall that we previously use the F test to conduct simultaneous test of mean equality.

  • We can use the F test in a regression model to test if all the parameters are simultaneously equal to zero. This is a way to measure the overall significance of the model. That, is, if all the parameters are jointly equal to zero, then even if we have a high \(R^2\) by chance, none of the parameters are significative.

  • The question is whether at least one of the \(\beta\) coefficients are significantly different from 0.

\[H_0: \text{None of them are significantly different from 0}\]

\[H_1: \text{At least one coefficient is significantly different from 0}\]

  • We can observe whether any of the p values for the \(\beta\) coefficients are below 0.05, but sometimes all of them are about 0.05, but collectively some of them are low enough that we would not expect to see that by chance.

  • For instance, if we had 10 variables each with a p value between 0.05 and 0.10, while none of them would individually be significant, the probability of getting 10 variables with p values all that low would be very small by pure chance, allowing us to reject the null.

  • Unsurprisingly, the F statistic is related to the \(R^2\), since both are measures of overall model fit. A statistically significant F-test indicates that the model predicts the dependent variable better than the mean of the dependent variable alone.

F statistic

The F statistic for the multiple regression model should look familiar:

\[F = \frac{R^2 / k}{(1-R^2)/(n-k-1)}\] As always for the F distribution, we need two different degrees of freedom: \(df_1 = k\) and \(df_2 = n - k -1\).

Therefore:

f <- (r2/k) / ((1-r2)/(n-k-1))
f
[1] 544.4949
pf(f,k,(n-k-1),lower.tail=F)
[1] 6.506845e-320

Interpretation

Because the p-value of the F-test is less than \(\alpha = 0.05\) we reject the null hypothesis that all the estimated coefficient coefficient of the model are significantly equal to zero.

Summary

  • We can compute the parameters of the multivariate regression model in two ways:

    1. Manually: Using the formula \(\beta_{ols} = (X'X)^{-1}X'y\)
    2. In R: Using the command \lm() and specifying all the independent variables, separated by the + symbol.
  • If the assumption of exogeneity is not satisfied, the estimated parameters will be biased.

  • Not controlling for other relevant variables can lead to violation of the exogeneity assumption.

  • We can use the standard errors to conduct individual hypothesis tests of significance for each \(\beta\) coefficient.

  • The F-test is a jointly significance test, it indicates if all the parameters are simultaneously statistically equal to zero.

  • The adjusted \(R^2\) is similar to the \(R^2\) goodness of fit coefficient, but it penalize the introduction of irrelevant variables. Avoiding the risk of over-fitting the model.