Statistical Analysis 9.3: Multicollinearity

Multicollienarity

Overview

This lesson introduces to problem of multicollienarity and how to identify the pros- and cons- of including additional variables to a regression model.

Objectives

After completing this module, students should be able to:

Understand what is perfect and imperfect multicollienarity
Test for multicollienarity in a regression
Identify the pros- and cons- of including additional variables to a regression model

Readings

Perfect multicollienarity

A set of variables (\(X\)) is perfectly multicollinear if one of the variables, \(x_{i}\), in the set is a linear combination of the other variables. That is, the rest of the variables can be used to perfectly predict \(x_{i}\).
For example, imagine that we have a set of three variables \(X = \{x_{1}, x_{2}, x_{3} \}\), where \(x_{1} = 2 \times x_{2} + x_{3}\). Because we can use \(x_{2}\) and \(x_{3}\) to perfectly predict \(x_{1}\) we say that \(X\) is perfectly multicollinear.
Perfect multicollienarity is a problem because if present the inverse of a matrix cannot be computed. Recall that we need to compute the inverse of a matrix for estimating the parameters of a multivariate regression, \(\beta_{\text{ols}}= (X'X)^{-1}X'y\). Then, if some of the variables in \(X\) are a linear combination of the other variables, we won’t be able to compute the parameters of the model. The following brief simulation illustrate this issue.

R Simulations: Perfect multicollienarity

To simulate perfect multicollienarity we are going to generate a data matrix where one of the variables is a linear combination of the rest.

n <- 1000 #number of observations
x1 <- rnorm(n) #generating independent variable x1
x2 <- rnorm(n) #generating independent variable x2
x3 <- 2*x1 + x2 #generating x3 as 2*x1 + x2
y <- 10 + 1*x1 + 2*x2 + 3*x3 + rnorm(n) #generating y

# Creating X-matrix
X <- cbind(rep(1,n), x1, x2, x3)

The next step to compute the parameters using linear algebra is to compute \((X'X)^{-1}\), but if we run the command solve(t(X) %*% (X)), R will display the following error: Error in solve.default(t(X) %*% X) : system is computationally singular. This is caused by perfect multicollienarity.
If we use the lm command, R will automatically exclude the variable that is causing perfect multicollienarity

suppressMessages(library(stargazer)) #loading stargazer
reg1 <- lm(y ~ x1 + x2 + x3)
stargazer(reg1, type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser","f")) #Displaying regressions with stargazer


	Dependent variable:

	y

Constant	9.987^***
	(0.031)

x1	7.038^***
	(0.031)

x2	4.918^***
	(0.032)

x3



Observations	1,000
R²	0.987
Adjusted R²	0.987

Note:	p<0.1; p<0.05; p<0.01

See how R excluded \(x_{3}\) automatically. Also, note how the model without \(x_{3}\) is incorrectly estimating the causal effects between \(x_{1}\) and \(x_{2}\) on \(y\).

Identifying and dealing with perfect multicollinearity

Luckily for us, perfect multicollienarity is very rare and easy to identify.
- To identify perfect multicollienarity we just need to attempt to estimate the parameters of the model. If we can’t compute \((X'X)^{-1}\) that’s a strong signal of perfect multicollienarity.
- Perfect multicollienarity only happens when one variable can be perfectly predicted with the rest of the variables in the dataset and this almost exclusively happens by construction. For example, imagine that you have an indicator variable for gender, in most models this variable generally is saved as a dummy where \(0\) values indicates that the individual is male and \(1\) indicates that the individual is female – this is the most common practice, in any way we are making a political statement here by only considering gender to be binary –. If you construct a new variable called male that is simply male = 1 - female, and run a regression with both male and female as independent variables, the model will suffer from perfect multicollienarity. Because all the variation in male can be explained with the variable female – and individual is either male or female –. This mistake is called the dummy variable trap, when using categorical variables you should always exclude at least one category from the regression to avoid perfect multicollienarity.

Imperfect multicollienarity

A set of variables \(X\) suffers from imperfect multicollienarity when the variation of one of the variables in the set (\(x_{i}\)) can be explained, by the most part, by the rest of the variables.
Note that this concept differs from perfect multicollienarity because we are imposing a weaker restriction on \(x_{i}\). Previously we assumed that \(x_{i}\) was perfectly explained by the other variables, now we are only saying that a big part – but not all – of the variation can be explained by rest of the variables. So the distinction is a matter of degree.
Imperfect multicollienarity does not stop us from estimating the parameters of the model like perfect multicollienarity does, but it introduces a different type of problem: inaccuracy. When a regression suffers from imperfect multicollienarity, at least one of the parameters in the model will be imprecisely estimated.
When we say that an estimation is inaccurate or imprecise, we mean that the estimated standard errors of the parameter are inflated – larger than their actual value –. If the standard errors are very large, then is harder for us to reject a null hypothesis of equality. Recall that the larger the standard errors the wider the confidence intervals when testing a hypothesis; which implies that we’ll be less confident when rejecting the null.
The reason why this happens is that when one of the independent variable changes, then the other imperfectly multicollinear variables will also vary along with it; and if all the variables are changing simultaneously is harder to determine which one of the variable is causing the variation in \(Y\). A formal proof of this requires us to do some linear algebra that is beyond the scope of this class. Instead we’ll use some more R simulations so you can see why imperfect multicollienarity is a problem.

R Simulations: Imperfect multicollienarity

To simulate imperfect multicollienarity we are going to generate a data matrix where one of the independent variables is a correlated with the other independent variables. By tweaking the degree of association between the variables, we can make the problem of imperfect multicollienarity more or less severe.

n <- 1000 #number of observations
x1 <- rnorm(n) #generating independent variable x1
x2 <- rnorm(n) #generating independent variable x2
x3 <- 20*x1 + 20*x2 + rnorm(n) #generating x3 as an imperfect function of x1 and x2
y <- 10 + 1*x1 + 1*x2 + 1*x3 + rnorm(n) #generating y

reg1 <- lm(y ~ x1 + x2 + x3)
stargazer(reg1, type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser","f")) #Displaying regressions with stargazer


	Dependent variable:

	y

Constant	10.010^***
	(0.032)

x1	2.339^***
	(0.630)

x2	2.357^***
	(0.629)

x3	0.930^***
	(0.031)


Observations	1,000
R²	0.999
Adjusted R²	0.999

Note:	p<0.1; p<0.05; p<0.01

Note that imperfect multicollienarity allow us to estimate the parameters for all the variables, not like perfect multicollienarity.

Weaking the link

Now, let’s see what happens when we reduce the degree of association between the independent variables.

x3 <- 0.01*x1 + 0.01*x2 + rnorm(n) #reducing the strenght of the relation between independent variables
y <- 10 + 1*x1 + 1*x2 + 1*x3 + rnorm(n) #generating y

reg2 <- lm(y ~ x1 + x2 + x3)
stargazer(list(reg1, reg2), type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser","f")) #Displaying regressions with stargazer


	Dependent variable:

	y
	(1)	(2)

Constant	10.010^***	10.045^***
	(0.032)	(0.031)

x1	2.339^***	0.998^***
	(0.630)	(0.031)

x2	2.357^***	1.001^***
	(0.629)	(0.032)

x3	0.930^***	1.023^***
	(0.031)	(0.031)


Observations	1,000	1,000
R²	0.999	0.750
Adjusted R²	0.999	0.749

Note:	p<0.1; p<0.05; p<0.01

Note how the standard errors in the second regression are drastically reduced. In the first model, the parameters are not biased, but are imprecisely estimated - with very large standard errors -. Testing hypothesis based on a model with imperfect multicollienarity is unreliable.

Variance inflation factor

The variance inflation factor (VIF) is a measure that indicates by how much is the variance (and thus standard errors) of a regression model are inflated by the presence of imperfect multicollienarity.
To compute the VIF you need to follow this simple steps:
1. Run a regression of one of the independent variables (\(x_{i}\)) on the rest of the variables of the model: \[x_{i} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} ... + \epsilon\]
2. Compute the \(R^{2}\) of the previous regression.
3. Compute the VIF using this formula, \[VIF = \dfrac{1}{1-R^{2}}\]

Because \(R^{2}\in (0,1)\), when \(R^2 \rightarrow 1\), is an indication of almost perfect multicollinearity (the variation of \(x_{i}\) is almost perfectly explained by the rest of the variables). In that case \(VIF \rightarrow \infty\), i.e. the standard errors will tend to infinity, in that case we cannot reject any null hypothesis because the confidence interval will be infinitely wide (the t-statistic will always be within the confidence interval). On the other hand, if \(R^{2} \rightarrow 0\) then \(VIF \rightarrow 1\), and the variance is not inflated by imperfect multicollienarity.

Almost all multivariate models will have some degree of imperfect multicollienarity because there is almost always some correlation among independent variables in a model, when you add a new variable you are making the estimated parameters less precise. In practice, imperfect multicollienarity is a major problem when \(VIF \geq 10\). We can categorize imperfect multicollinearity using the \(VIF\) using the following rules:

Perfect multicollinearity

\(VIF \rightarrow \infty\). You won’t be able to estimate the parameter of the regression.
Severe imperfect multicollinearity

\(VIF \geq 10\). Any inference based on the standard errors of the parameters is unreliable.
Moderate imperfect multicollinearity

\(VIF \in (5, 10)\). Inference based on the standard errors is somewhat affected by multicollinearity. You should consider using very high confidence levels (\(\alpha \geq 0.99\)) – or very low p-values (p-value \(\leq 0.01\)) – when deciding to reject the null hypothesis. Because barely rejecting the null at \(\alpha = 0.95\) may be caused by the moderate inflation of the standard errors.
Weak imperfect multicollinearity

\(VIF \leq 5\). Perfect multicollinearity is not an issue. You can use the estimated standard errors to conduct hypothesis tests at the normal confidence level.

Using VIF to diagnose imperfect multicollienarity (case 1)

We are going to use the data from the previous simulation to see how we can use the \(VIF\) to check if imperfect multicollienarity is a problem.

Case 1

n <- 1000 #number of observations
x1 <- rnorm(n) #generating independent variable x1
x2 <- rnorm(n) #generating independent variable x2
x3 <- 20*x1 + 20*x2 + rnorm(n) #generating x3 as an imperfect function of x1 and x2
y <- 10 + 1*x1 + 1*x2 + 1*x3 + rnorm(n) #generating y

#STEP 1: Regression of x_3 on x1 and x2
regX3 <- lm(x3 ~ x1 + x2)

#STEP 2: Get R2 from regression
output <- summary(regX3)
r2 <- output$r.squared

#STEP 3: Compute VIF
vif <- 1 / (1 - r2)
vif

## [1] 815.59

#Diagnose
ifelse(vif > 10,
       "Imperfect multicollienarity is severe",
       "Imperfect multicollienarity is not severe")

## [1] "Imperfect multicollienarity is severe"

According to this results we cannot rely on the hypothesis tests derived from the model, i.e. we cannot conclude if we reject or not the null hypothesis that the parameters are significant. That’s because with a \(VIF \approx\) 815.59 \(\geq 10\), with severe multicollinearity we cannot trust on hypothesis tests based on the standard errors of the parameters.

Using VIF to diagnose imperfect multicollienarity (case 2)

Case 2

n <- 1000 #number of observations
x1 <- rnorm(n) #generating independent variable x1
x2 <- rnorm(n) #generating independent variable x2
x3 <- 0.01*x1 + 0.01*x2 + rnorm(n) #generating x3 as an imperfect function of x1 and x2
y <- 10 + 1*x1 + 1*x2 + 1*x3 + rnorm(n) #generating y

#STEP 1: Regression of x_3 on x1 and x2
regX3 <- lm(x3 ~ x1 + x2)

#STEP 2: Get R2 from regression
output <- summary(regX3)
r2 <- output$r.squared

#STEP 3: Compute VIF
vif <- 1 / (1 - r2)
vif

## [1] 1.000528

#Diagnose
ifelse(vif > 10,
       "Imperfect multicollienarity is severe",
       "Imperfect multicollienarity is not severe")

## [1] "Imperfect multicollienarity is not severe"

According to this results we can rely on the hypothesis tests derived from the model, i.e. we can conclude if we reject or not the null hypothesis that the parameters are significant. That’s because with a \(VIF \approx\) 1, the standard errors of the parameters are about \(\sqrt{VIF} \approx\) 1 times larger than its actual value, so not really different.

Adding variables

By now it should be obvious that when deciding what variables to include in a model, the statistician face a very important trade-off:

Unbiasdness: In the last lesson we discussed the issue of establishing causality in a regression model, and how excluding relevant variables that are correlated with both \(X\) and \(Y\) from the regression leads to omitted variable bias.
Precission: But, we just learn that adding independent variables that are correlated among each others will make the estimated parameters less precise.

The researcher has to balance bias and inaccuracy when deciding to include or not an additional variable to the model. There is no straightforward method to determine what’s the optimal number of variables to include in a model, as that depends on the topic and data at hand.

Good practices

Keep in mind the following good practices when deciding what variables to include in a regression:

Avoid a \(VIF \geq 10\). If that’s not possible because because you consider that reducing the bias is more important than having imprecise parameters, then note that the hypothesis tests based on the standard errors of the parameters are not reliable.
If \(VIF \in (5,10)\) only reject the null at very high confidence levels.
Add omitted relevant variables (\(Z\)) in sequence, starting with the ones that are simultaneously more correlated with \(X\) and \(Y\) – this are the ones that will cause parameters to be more biased –. In each step you can compute the \(VIF\) and check if imperfect multicollienarity is a problem, I personally think is better if you correct for omitted variable bias first and then deal with multicollinearity.
Because standard errors decrease with the number of observations (\(n\)). Increasing \(n\) will reduce the problem of imprecision by making confidence intervals tighter, even in the presence of multicollinearity. Unless there is a good reason to avoid using it, always work with all the observations available.
Generally, in a regression model there is one or a limited number variables that are essential to test the research question (hypothesis). You should prioritize correcting the bias for the variables of interest. If adding an additional variable does not change much the parameter for the variables of interest, it may not be worth to include in the regression, even if it corrects the bias for the controls.
Maybe you should simply not run a regression. If by adding a variable the model suffers of severe multicollinearity, but excluding the variable causes a very large bias, then using a regression model will not be reliable research methodology. Sometimes is better not know the answer to a research question than to follow the incorrect implications of a bad regression. Don’t think that not finding the correct specification of a regression is sign of failure on the part of the researcher; sometimes there is no correct specification.

Real World Example : Class size and academic performance

Imagine that you are hired by the government of the state of California to conduct research on education. One of the things that they want us to find out is to verify if there is any relation between class size and academic performance. Additionally, if there is a relationship they want to have an estimate of the relation between the two variables in order to decide how many teachers to hire per school.

To complete this task we are going to proceed as follows:

State the null hypothesis
Gather and explore the dataset
Estimate a bivariate model to test the initial hypothesis
Compute a correlation matrix to verify the assumption of exogeneity
Specify and estimate multivariate models to deal with OVB and multicollinearity
Discuss the weakness of the estimated model

Hypothesis

The hypothesis that we want to test is:

\[ \begin{eqnarray} H_{0}: \beta_{\text{class size}} = 0 \\ H_{1}: \beta_{\text{class size}} \neq 0 \end{eqnarray} \] where \(\beta_{\text{class size}}\) is the parameter of a regression with academic performance as dependent variable.

Data

We are going to use data from the California Department of Education (www.cde.ca.gov). This data can be downloaded to R by loading (and installing if necessary) the AER package and using the command data(CASchool).

#install.package(AER)
suppressMessages(library(AER)) #Quietly loading AER Package
data("CASchools") # Loading CASchools dataset

First let’s read the definition for each variable (type ?CASchools in the console for more info on this dataset).

district : District code.
school : School name.
county : factor indicating county.
grades : factor indicating grade span of district.
students : Total enrollment.
teachers : Number of teachers.
calworks : Percent qualifying for CalWorks (income assistance).
lunch : Percent qualifying for reduced-price lunch.
computer : Number of computers.
expenditure : Expenditure per student.
income : District average income (in USD 1,000).
english : Percent of English learners.
read : Average reading score.
math : Average math score.

Let’s check the data type for each variable

str(CASchools)

## 'data.frame':    420 obs. of  14 variables:
##  $ district   : chr  "75119" "61499" "61549" "61457" ...
##  $ school     : chr  "Sunol Glen Unified" "Manzanita Elementary" "Thermalito Union Elementary" "Golden Feather Union Elementary" ...
##  $ county     : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
##  $ grades     : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
##  $ students   : num  195 240 1550 243 1335 ...
##  $ teachers   : num  10.9 11.1 82.9 14 71.5 ...
##  $ calworks   : num  0.51 15.42 55.03 36.48 33.11 ...
##  $ lunch      : num  2.04 47.92 76.32 77.05 78.43 ...
##  $ computer   : num  67 101 169 85 171 25 28 66 35 0 ...
##  $ expenditure: num  6385 5099 5502 7102 5236 ...
##  $ income     : num  22.69 9.82 8.98 8.98 9.08 ...
##  $ english    : num  0 4.58 30 0 13.86 ...
##  $ read       : num  692 660 636 652 642 ...
##  $ math       : num  690 662 651 644 640 ...

We don’t have a direct observation of class size, but we can use the student-teacher ratio as an indicator of class size,

CASchools$stRatio <- CASchools$students / CASchools$teachers #Adding student-teacher ratio to dataset

As an indicator of academic performance we can use both the math average or reading average scores. I’ll complete this example using only the math scores, you can try to reproduce this example using reading and comparing the results (additionally, you can use the average of math and reading).

Bivariate Regression

The first model we are going to estimate is a bivariate model with the math average scores as dependent variable and the student-teacher ratio as independent variable:

\[\text{Math Average Score} = \beta_{0} + \beta_{1}\text{Student-Teacher Ratio} + \epsilon\]

If the estimation of \(\beta_{1}\) is statistically different than zero then we reject the null that there is no relation between class size and academic performance.

bv1 <- lm(math ~ stRatio, data = CASchools)
stargazer(bv1, type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser","f")) #Displaying regressions with stargazer


	Dependent variable:

	math

Constant	691.417^***
	(9.382)

stRatio	-1.939^***
	(0.476)


Observations	420
R²	0.038
Adjusted R²	0.036

Note:	p<0.1; p<0.05; p<0.01

Interpretation

The bivariate model indicates that there is an statistically different than zero relation between the student-teacher ratio and the average math scores.
If the student-teacher ratio increase by a one unit, the average math score is reduced by about 1.94 points.
The student-teacher ratio is not a good predictor of of the math average score as the \(R^2\) is very close to zero.

This results are consistent with the idea that smaller class size improves academic performance. But, we cannot rely only on the results of a bivariate regression model (why?).

Correlation Matrix

Because we know that excluding relevant variables may lead to omitted variable bias, we are going to take a look at the correlation matrix to see if there is any reason to believe that the previously estimated parameter is biased. To compute the correlation matrix we can use only numeric variables.

# Creating data.frame without non-numeric variables

# I'll also exclude the variables students and teachers
# as we are already using the ratio of the two variables
# to compute ther student-teacher ratio.
#
# Also, I'll make grades a binary variable instead of
# a factor.

CASchools$grades <- as.numeric(CASchools$grades) - 1
corData <- CASchools[ ,  !(names(CASchools) %in% c("district",
                                                   "school",
                                                   "county",
                                                   "teachers",
                                                   "students"))]

corMat <- cor(corData, cbind(corData$stRatio, corData$math)) #Computing correlation with stRatio and Math only
colnames(corMat) <- c("stRatio", "math") #Adding column names
stargazer(corMat, digits = 2, summary = FALSE, type = "html")


	stRatio	math

grades	0.09	-0.17
calworks	0.02	-0.62
lunch	0.14	-0.82
computer	0.23	-0.03
expenditure	-0.62	0.15
income	-0.23	0.70
english	0.19	-0.57
read	-0.25	0.92
math	-0.20	1
stRatio	1	-0.20

Because there are some variables that are simultaneously correlated with math and stRatio the assumption of exogeneity was probably violated in the bivariate model.

In particular, the variables income, expenditure, lunch, and, english seems to be relevant variables, so we should consider controlling for this set of variables first.
grades is less of a priority because the correlation with stRatio is not as strong as the other variables.
Excluding calworks is probably not causing any bias as it is not correlated with stRatio, adding this variable will probably increase the \(R^2\) of the regression, but will not help us in correctly estimating the parameter for student-teacher ratio (the variable of interest).
computer a priori should not be part of the model as it is not correlated with math.

Multivariate Models

We are going to estimate a set of multivariate models by adding additional regressors in order of how strong is their relation to \(X\) and \(Y\). Then, we’ll check if there was in fact any bias in the estimation. After checking for \(OVB\), we’ll check if the specification suffers from imperfect multicollienarity, if not then we are done. If there is imperfect multicollienarity we’ll have to exclude some of the variables, starting with the ones with less influence in correcting \(OVB\).

# Estimating models
mv2 <- lm(math ~ stRatio + expenditure, data = CASchools)
mv3 <- lm(math ~ stRatio + expenditure + income, data = CASchools)
mv4 <- lm(math ~ stRatio + expenditure + income + lunch, data = CASchools)
mv5 <- lm(math ~ stRatio + expenditure + income + lunch + english, data = CASchools)
mv6 <- lm(math ~ stRatio + expenditure + income + lunch + english + grades, data = CASchools)

# Comparing models using stargazer
stargazer(list(bv1, mv2, mv3, mv4, mv5, mv6), type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser"),
          df = FALSE)


	Dependent variable:

	math
	(1)	(2)	(3)	(4)	(5)	(6)

Constant	691.417^***	676.184^***	670.453^***	670.560^***	666.132^***	671.320^***
	(9.382)	(19.411)	(13.995)	(10.609)	(10.583)	(10.676)

stRatio	-1.939^***	-1.602^***	-1.172^***	-0.435	-0.300	-0.307
	(0.476)	(0.606)	(0.438)	(0.334)	(0.333)	(0.331)

expenditure		0.002	-0.004^***	0.0004	0.0002	-0.0001
		(0.002)	(0.001)	(0.001)	(0.001)	(0.001)

income			1.861^***	0.617^***	0.712^***	0.709^***
			(0.095)	(0.101)	(0.104)	(0.103)

lunch				-0.452^***	-0.384^***	-0.372^***
				(0.026)	(0.033)	(0.033)

english					-0.118^***	-0.135^***
					(0.037)	(0.037)

grades						-3.846^***
						(1.418)


Observations	420	420	420	420	420	420
R²	0.038	0.040	0.502	0.715	0.722	0.727
Adjusted R²	0.036	0.035	0.499	0.712	0.718	0.723
F Statistic	16.620^***	8.708^***	140.016^***	260.010^***	214.690^***	182.879^***

Note:	p<0.1; p<0.05; p<0.01

Interpretation

Compared to the bivariate regression (1), all multivariate estimations show a reduction in the magnitude of the parameter for stRatio. This is evidence that there was \(OVB\) and the assumption of exogeneity was not satisfied in the bivariate model.
As we add more regressors, the parameter for stRatio changes from statistically different than zero to statistically equal to zero.
The \(R^2\) greatly improves after adding income and lunch to the regression.
Adding grades didn’t cause any change in the estimated value of \(\beta_{1}\). Which means that we can probably exclude this variable from the regression. Then, we’ll continue our analysis assuming that regression (5) correctly controlled for \(OVB\).
All models reject the null of the f-test that all parameters are simultaneously equal to zero.
Starting with model (5), we have to check if imperfect multicollienarity is an issue.

Multicollienarity test

# Computing VIF for model (5)

# Running auxiliary regressions
aux1_mv5 <- lm(stRatio ~ expenditure + income + lunch + english, data = CASchools)
aux2_mv5 <- lm(expenditure ~ stRatio + income + lunch + english, data = CASchools)
aux3_mv5 <- lm(income ~ stRatio + expenditure + lunch + english, data = CASchools)
aux4_mv5 <- lm(lunch ~ stRatio + expenditure + income + english, data = CASchools)
aux5_mv5 <- lm(english ~ stRatio + expenditure + income + lunch , data = CASchools)

# Getting r2
aux1_r2 <- summary(aux1_mv5)$r.squared
aux2_r2 <- summary(aux2_mv5)$r.squared
aux3_r2 <- summary(aux3_mv5)$r.squared
aux4_r2 <- summary(aux4_mv5)$r.squared
aux5_r2 <- summary(aux5_mv5)$r.squared

# Computing VIF
aux1_vif <- 1 / (1 - aux1_r2)
aux2_vif <- 1 / (1 - aux2_r2)
aux3_vif <- 1 / (1 - aux3_r2)
aux4_vif <- 1 / (1 - aux4_r2)
aux5_vif <- 1 / (1 - aux5_r2)

vifs <- c(aux1_vif, aux2_vif, aux3_vif, aux4_vif,aux5_vif)
vifs

## [1] 1.681200 1.829956 2.389763 3.415967 1.937555

# Testing if VIF are greater than 10
vifs > 10

## [1] FALSE FALSE FALSE FALSE FALSE

# Testing if VIF are greater than 5
vifs > 5

## [1] FALSE FALSE FALSE FALSE FALSE

Because for all regressors \(VIF\) is less than 5 we can be confident that imperfect multicollienarity is not an issue in regression (5). And, if is not an issue in regression (5) - which includes the larger number of independent variables - then it won’t be an issue in models (1) to (4).

Discussion

Using regression (5) as our final specification we can conclude that the parameter of student-teacher ratio is not statistically different than zero. Therefore, we cannot reject the null hypothesis that class size is not related with academic performance.
What seems to be a strong negative relation got weaker as we controlled for other variables (expenditure, income, lunch, and, english).
To determine the reason why the relation disappears after controlling the other variables requires us to examine the causal pathways in which the student-teacher ratio is related with academic performance. It may be the case that the relation is spurious or that student-teacher ratio may only be a mediator for other variables.

For instance, is not unreasonable to think that students from counties with higher average income go to more exclusive schools that are only available to those from more wealthy backgrounds. By itself, this will reduce the average number of students in those schools. This schools in turn can afford to hire better teachers which improves academic performance. Additionally, students from wealthier households can afford to improve their education with tutors and other extra-curricular activities, which will also improve their academic performance. This line of argumentation is consistent with a the idea that there is a spurious relationship between class-size and academic performance caused by income. There may be a relation between class size and academic performance but we cannot observe that in the data if the wealth and academic gap is very wide and correlated, – that is, we only observe schools with good academic performance in counties with high income and shcools with bad academic performance in counties with low income –. We are not going to go deeper into the causal pathways analysis here, but I just wanted to illustrate the potential interpretations of our results.
Regression (5) is not controlling for other demographic/economic factors that may cause \(OBV\), we really don’t have any idea of how educated is the average individual in each counties. The average education of the parents and how they value education will also play an important role in how they select what schools to send their kids. Of course, at this point this is mere speculation as our dataset don’t include that information; but future research should address this issue.