Statistical Analysis 9.2: Causality

Causality

Overview

This lesson introduces some ideas in causal analysis.

Objectives

After completing this module, students should be able to:

Distinguish various causal pathways, including spurious causes and chained causation.
Use regression analysis and R to disentangle causal interactions between independent variables.

Readings

Causal pathways

Last week: We learn how to estimate multivariate regressions. The main difference between a bivariate regression and a multivariate regression is the inclusion of other explanatory variables. We can represent each model in the following diagrams:

Bivariate regression:

\[x \rightarrow y\]

Multiple regression:

\[x_{1},x_{2},x_{3},...,x_{k} \rightarrow y \]

In this lesson we are going to discuss the issue of causality in regression models.
Causation or causality is defined as the capacity of a variable \(x\) to produce a change in another variable \(y\). Note that this is essentially different from statistical association or correlation. When talking about causation, we are referring to the ability of the \(x\) variable to make \(y\) change. The fact that we observe both \(x\) and \(y\) changing at then same time is not evidence of causation.
We cannot proof causation with statistics alone. In the matter of causality, statistics can be used to reject the hypothesis of causation. Therefore, the goal of a regression model in terms of causality, is to test the null hypothesis that \(x\) is causing \(y\). If we cannot reject the hypothesis of causation, then we’ll be providing evidence in favor that the estimated parameter for \(x\) is a close approximation of the causal effect of \(x\) on \(y\).
Sometimes when we regress \(y\) on \(x\), we get a strong significative relationship (very low p-value), even if \(x\) does not cause \(y\). There are several reasons why this may be the case, the five main reasons are:

1. Spurious relation

2. Chain relation

3. Direct and indirect effects

4. Moderating effects

5. Simultaneity (double causation)

All of the five reasons imply a violation of the exogeneity assumption - the independent variable is independent of the error term -.

Omitted variable bias (OVB)

Recall from the last learning module that we can estimate the parameters of a multiple regression using the following formula: \[\beta_{\text{ols}} = (X'X)^{-1}X'y\] and:

\[\beta_{\text{ols}} = (X'X)^{-1}X'y = \beta + (X'X)^{-1}X'\epsilon\]

This implies that the estimated parameters (\(\beta_{\text{ols}}\)) are equal to the real parameters (\(\beta\)) if and only if \(X'\epsilon = 0\). This is the assumption of exogeneity. This assumption is violated whenever \(X'\epsilon \neq 0\), that is, whenever the errors (\(\epsilon\)) are related in some way with \(X\). When the exogeneity assumption is violated the regression suffers from omitted variable bias or OVB.

A biased estimate is an estimated parameter that will systematically differ from the actual value of the parameter. Regressions that don’t account for additional regressors, will generally suffer from OVB. Let’s have a closer look at this:

Consider the regression model:

\[y = X\beta + \epsilon\]

Note that the model is additive, it assumes that the variation of \(y\) comes from two distinct sources: (1) explained variation \((X\beta)\), and, (2) unexplained variation \((\epsilon)\). Therefore, whenever we don’t include a variable related to \(y\) in the \(X\) matrix, that variation is left unexplained and becomes part of \(\epsilon\). By itself that’s not a problem, only when the excluded variable is related to \(\mathbf{X}\) we cannot guarantee that \(X'\epsilon = 0\). In other words, if \(z\) is variable that is not included in the regression, and \(z\) is related to \(y\) then \(z\) will be part of unexplained variation of \(y\), \(\epsilon\). If \(z\) is part of \(\epsilon\) and \(z\) is related to \(x\), then \(X'\epsilon \neq 0\) and the regression will suffer from \(OVB\). The five causal pathways all rely in some mechanism in which simultaneously \(z\) is related to both \(y\) and \(x\).

There is no formal test for OVB as we don’t have a way of observing \(\epsilon\). The best thing we can do is to use our knowledge of the causal pathways and try to observe if the changes in the estimated parameters are consistent with our hypothesis. In the next sections we’ll describe the different ways in which exogeneity can be violated.

Spurious relation

For now, consider the variables \(x\), \(y\), and, \(z\). We are interested in testing the hypothesis that \(x\) does not cause \(y\) using a regression model. Additionally, imagine that we can observe the actual causal relations between the variables. The actual relationships can be represented with the following diagram:

\[\begin{eqnarray*} z & \rightarrow & x \\ z & \rightarrow & y \\ \end{eqnarray*}\]

As you can see, \(z\) is causing both \(x\) and \(y\). But \(x\) is not causing \(y\). Yet, if we estimate a regression of \(y\) on \(x\), we are very likely to reject the null hypothesis that the parameter for \(x\) is equal to zero. This is an spurious relation. \(z\) is creating an artificial relation between \(x\) and \(y\), but if we keep \(z\) constant and change \(x\), then \(y\) should not change - i.e., if we include \(z\) in the regression, the parameter for \(x\) should be zero -.

Example:

Let \(x\) be percentage of people that owns a smartphone in a particular country, \(y\) is the infant mortality rate, and, \(z\) the average income of the country. In this example is obvious that \(z\) will be positively related with \(x\) (richer countries will have a higher rate of smartphones ownership) and \(y\) (richer countries generally have better healthcare systems and can that reduces infant mortality). Because \(z\) is simultaneously associated with \(x\) and \(y\), it creates an artificial (spurious) relation between \(x\) and \(y\) (countries with a lot of smartphones have lower infant mortality rate). A bivariate regression will reflect this association, and a naive researcher may conclude that using smartphones reduce infant mortality. But that statistical association (correlation) should not be interpret as causation.

Spurious relationships are one of the biggest pitfalls of empirical analysis because without some a priori knowledge of the relation between \(z\) to \(x\) and \(y\) is very hard, and often impossible for us to discredit the merits of a regression of \(y\) on \(x\) alone. Bad statisticians or good statisticians with bad intentions will often read any regression output as evidence of causation - this is very common in research topics where there are many variables interacting in very complex manners, e.g. nutrition, or research areas that are relatively new, e.g. new technology adoption -. As a consumer and practitioner of statistics you should always be critical of the possibility of a spurious relationship in a regression analysis, and never trust the result of a bivariate regression alone.

Simulating an spurious relation

Now, let’s run a series of simulation so you can see why determining causality using a regression is not a straightforward process.

In a spurious relationship we have that

\[z \rightarrow x \;\; \text{and} \;\; z \rightarrow y, \;\; \text{but} \;\; x \not \rightarrow y\]

To simulate this process we are going to:

Generate a random variable \(z\)
Generate \(x\) and \(y\), by multiplying \(z\) by some factor and adding some random noise. The key for generating a spurious relationship is that \(y\) is generated without using \(x\), but \(x\) and \(y\) both depend on \(z\).
Then we’ll run some regressions and see if we can correctly estimate the parameters of the model.

n <- 1000 #Number of observations
z <- rnorm(n) # Generating Z
x <- 3*z + rnorm(n) # Generating X
y <- 10 + 5*z + rnorm(n) # Generating Y

According to this simulation, the effect of \(x\) on \(y\) should be statistically equal to \(0\) because we didn’t use \(x\) to create \(y\). In other words, there is no causal effect of \(x\) on \(y\). Let’s see what we get when we use a bivariate regression to establish the causal relationship between \(x\) and \(y\).

Bivariate Regression

suppressMessages(library(stargazer)) # For nice output regression tables
bv1 <- lm(y ~ x) # Running bivariate regression
stargazer(bv1, type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser")) #Displaying regressions with stargazer


	Dependent variable:

	y

Constant	10.047^***
	(0.061)

x	1.520^***
	(0.019)


Observations	1,000
R²	0.859
Adjusted R²	0.859
F Statistic	6,094.929^*** (df = 1; 998)

Note:	p<0.1; p<0.05; p<0.01

According to this regression. The effect of \(x\) on \(y\) is statistically different than zero (p-value \(\leq\) 0.05). Also, if we look at the \(R^2\), it seems like this is a really good predictor of \(y\); by itself \(x\) seems to be able to predict about \(86\%\) of the variation in \(y\)!
But we know better, we generated the data in a way that \(x\) and \(y\) are not related. This is just an illusion. Let’s see what happens if we add \(z\).

Multiple Regression

Now, regressing \(y\) as a function of both \(x\) and \(z\):

mv1 <- lm(y ~ x + z) # Running multivariate regression
stargazer(mv1, type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser", "f")) #Displaying regressions with stargazer


	Dependent variable:

	y

Constant	9.959^***
	(0.032)

x	-0.050
	(0.033)

z	5.168^***
	(0.102)


Observations	1,000
R²	0.961
Adjusted R²	0.961

Note:	p<0.1; p<0.05; p<0.01

Surprise, surprise! \(x\) is not statistically significant when we add \(z\) - when we control for the variation of \(z\) -.

To make the case that the relation between \(x\) and \(y\) is spurious stronger, we can run regression of \(z\) on \(x\) and \(y\) alone:

mv2 <- lm(y ~ z) # Running bivariate regression
mv3 <- lm(x ~ z) # Running bivariate regression
stargazer(list(mv2,mv3), type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser","f")) #Displaying regressions with stargazer


	Dependent variable:

	y	x
	(1)	(2)

Constant	9.962^***	-0.052^*
	(0.032)	(0.031)

z	5.020^***	2.964^***
	(0.032)	(0.031)


Observations	1,000	1,000
R²	0.961	0.900
Adjusted R²	0.961	0.900

Note:	p<0.1; p<0.05; p<0.01

This results are consistent with the idea of a spurious relationship between \(x\) and \(y\).
That’s the best we can do in terms of disproving the idea that \(x\) is causing \(y\).

Thinking about exogeneity

Now that we know how a spurious relationship is generated, is a good idea to think again in the assumption of exogeneity and how it plays a very important role in explaining the previous results. In the bivariate regression, is \(x\) an exogenous variable?

Yes, because \(x\) is unrelated to \(y\): Nope. \(x\) does not cause \(y\), but \(x\) is statistically related to \(y\).
Yes, because \(x\) is unrelated to \(\epsilon\): Nope.
No, because \(x\) is related to \(\epsilon\): Correct! \(X\) is related with \(\epsilon\) (\(X'\epsilon \neq 0\)) in the bivariate model because \(Z\) is related with both \(X\) and \(Y\) and not included in the model.
No, because \(x\) is related with \(y\): Nope. The fact that \(X\) is related to \(Y\) does not violate the exogeneity assumption (\(X'\epsilon = 0\))

Chain relationship (full mediation)

Let’s consider the variables \(x\), \(y\), and, \(z\) again. Same as before, we are interested in testing the hypothesis that \(x\) does not cause \(y\) using a regression model. The actual relationships can be represented with the following diagram:

\[\begin{eqnarray*} x & \rightarrow & z \\ z & \rightarrow & y \\ \end{eqnarray*}\]

\(x\) is causing \(y\), only via \(z\). If you take away \(z\) from the equation, the effect of \(x\) on \(y\) disappears because the effect is caused exclusively via \(z\). So, when you control for \(z\), the variation of \(x\) that is not associated with \(z\) will not be able to explain \(y\). This is known as full mediation or a chain relationship.

Example:

Let \(x\) be price of oil, \(z\) the price of gasoline, and \(y\) the price of shipping a product between two given locations. If we run a regression of \(x\) on \(y\) we’ll probably get a significant positive effect (the higher the price of oil, the more expensive it is to ship a product), but if we include \(z\), the effect of \(x\) will banish because all variation in \(y\) caused by \(x\) can be explained by the variation in \(z\). That is, the price of oil and the cost of shipping a product are only related via the price of gasoline, any variation in the price of oil that affects the price of shipping a product is caused by variation in the price of gasoline.

Simulating a cahin relationship

In a chain relationship we have that,

\[x \rightarrow z \;\; \text{and} \;\; z \rightarrow y\]

For that we are going to:

Generate a random variable \(x\)
Generate \(z\) as a linear function of \(x\).
Generate \(y\) as a linear function of \(z\).
Then we’ll run some regressions and see if we can correctly estimate the parameters of the model.

n <- 1000 #Number of observations
x <- rnorm(n) # Generating X
z <- 3*x + rnorm(n) # Generating Z
y <- 10 + 5*z + rnorm(n) # Generating Y

Regressions

bv1 <- lm(y ~ x) # Running bivariate regression of y on x
mv1 <- lm(y ~ x + z) # Running multivariate regression of y on x and z
stargazer(list(bv1, mv1), type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser","f")) #Displaying regressions with stargazer


	Dependent variable:

	y
	(1)	(2)

Constant	10.008^***	9.935^***
	(0.155)	(0.032)

x	15.045^***	0.012
	(0.153)	(0.106)

z		4.986^***
		(0.033)


Observations	1,000	1,000
R²	0.906	0.996
Adjusted R²	0.906	0.996

Note:	p<0.1; p<0.05; p<0.01

Note how the bivariate regression estimates a strong positive statistically significant relation between \(x\) and \(y\) and how after adding \(z\), that relation is now statistically insignificant!
This result is very similar to what we got in the previous part.

Testing for full mediation

To check if there is in fact a chain relationship we can do the following:

Run a regression of \(y\) on \(z\) alone.
Run a regression of \(z\) on \(x\) alone.
Run a regression of \(y\) on both \(x\) and \(z\).

bv1 <- lm(y ~ z) # Running bivariate regression of y on z
bv2 <- lm(z ~ x) # Running bivariate regression of z on x
mv1 <- lm(y ~ x + z) # Running multivariate regression of y on x and z
stargazer(list(bv1, bv2, mv1), type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser","f")) #Displaying regressions with stargazer


	Dependent variable:

	y	z	y
	(1)	(2)	(3)

Constant	9.935^***	0.014	9.935^***
	(0.032)	(0.030)	(0.032)

z	4.990^***		4.986^***
	(0.010)		(0.033)

x		3.015^***	0.012
		(0.030)	(0.106)


Observations	1,000	1,000	1,000
R²	0.996	0.910	0.996
Adjusted R²	0.996	0.909	0.996

Note:	p<0.1; p<0.05; p<0.01

This results are consistent with the idea that \(z\) is mediating the relationship between \(x\) and \(y\).

Mediation or Spurious Relation? (1/3)

You may wonder, but what if we incorrectly assume that the relation between \(x\) and \(y\) is spurious and we proceed like in the previous example:

bv1 <- lm(y ~ z) # Running bivariate regression of y on z
bv2 <- lm(x ~ z) # Running bivariate regression of x on z
mv1 <- lm(y ~ x + z) # Running multivariate regression
stargazer(list(bv1, bv2, mv1), type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser","f")) #Displaying regressions with stargazer


	Dependent variable:

	y	x	y
	(1)	(2)	(3)

Constant	9.935^***	-0.008	9.935^***
	(0.032)	(0.010)	(0.032)

x			0.012
			(0.106)

z	4.990^***	0.302^***	4.986^***
	(0.010)	(0.003)	(0.033)


Observations	1,000	1,000	1,000
R²	0.996	0.910	0.996
Adjusted R²	0.996	0.909	0.996

Note:	p<0.1; p<0.05; p<0.01

In principle, this results are also consistent with a spurious relationship (\(z\) causes \(y\) and \(x\), but \(x\) does not causes \(y\)). How can we tell if \(z\) is mediating between \(x\) and \(y\) or \(z\) is creating a spurious relationship between \(x\) and \(y\)?

Mediation or Spurious Relation? (2/3)

Case 1: Spurious Relation: Let’s write the assumptions of a spurious relationship using the language of regression models (\(z\) is the independent variables in the regressions of \(x\) and \(y\), but the regression of \(y\) should not include \(x\)). Then,

\[ \begin{eqnarray} x &=& z\beta_{1} + \epsilon_{1} \;\; \text{, and,} \\ y &=& z\beta_{2} + \epsilon_{2} \end{eqnarray} \] We are using the sub-indices on the parameters \((\beta)\) and the error terms (\(\epsilon\)) just to differentiate the two models. Combining both equations yields, \[ \begin{eqnarray} x & = & z\beta_{1} + \epsilon_{1} \\ z & = & (x-\epsilon_{1})/\beta_{1} \\ \\ y & = & z\beta_{2} + \epsilon_{2} \\ y & = & \dfrac{x-\epsilon_{1}}{\beta_{1}}\beta_{2} + \epsilon_{2} \\ y & = & x\dfrac{\beta_{2}}{\beta_{1}} - \dfrac{\epsilon_{1}}{\beta_{1}} + \epsilon_{2} \\ y & = & x\beta_{3} + \epsilon_{3} \\ \end{eqnarray} \] where \(\beta_{3} = \dfrac{\beta_{2}}{\beta_{1}}\) and \(\epsilon_{3} = - \dfrac{\epsilon_{1}}{\beta_{1}} + \epsilon_{2}\)

Case 2: Chained Causation: Let’s write the assumptions of a chained relationship using the language of regression models (\(x\) is the independent variables in a regressions of \(x\) and \(z\) is the independent variable in a regression of \(y\), but the regression of \(Y\) should not include \(x\)). Then, \[ \begin{eqnarray} z & = & x\beta_{4} + \epsilon_{4} \;\; \text{, and,} \\ y & = & z\beta_{2} + \epsilon_{2} \end{eqnarray} \]

The math here is a bit more direct, \[ \begin{eqnarray} z & = & x\beta_{4} + \epsilon_{4} \\ y & = & z\beta_{2} + \epsilon_{2} \\ y & = & (x\beta_{4} + \epsilon_{4})\beta_{2} + \epsilon_{2} \\ y & = & x\beta_{4}\beta_{2} + \epsilon_{4}\beta_{2} + \epsilon_{2} \\ y & = & x\beta_{3} + \epsilon_{3} \\ \end{eqnarray} \] where \(\beta_{3} = \beta_{4}\beta_{2}\) and \(\epsilon_{3} = \epsilon_{4}\beta_{2} + \epsilon_{2}\)

Therefore, the parameter of the a regression of \(y\) on \(x\) can help us distinguish between the two cases.

Mediation or Spurious Relation? (3/3)

bv1 <- lm(x ~ z) # Running bivariate regression of z on x
bv2 <- lm(y ~ z) # Running bivariate regression of x on z
bv3 <- lm(y ~ x) # Running bivariate regression of y on z
bv4 <- lm(z ~ x) # Running bivariate regression of x on z
stargazer(list(bv1, bv2, bv3, bv4), type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser","f")) #Displaying regressions with stargazer


	Dependent variable:

	x	y		z
	(1)	(2)	(3)	(4)

Constant	-0.008	9.935^***	10.008^***	0.014
	(0.010)	(0.032)	(0.155)	(0.030)

z	0.302^***	4.990^***
	(0.003)	(0.010)

x			15.045^***	3.015^***
			(0.153)	(0.030)


Observations	1,000	1,000	1,000	1,000
R²	0.910	0.996	0.906	0.910
Adjusted R²	0.909	0.996	0.906	0.909

Note:	p<0.1; p<0.05; p<0.01

If the relation is spurious then, \(\beta_{3} = \dfrac{\beta_{2}}{\beta_{1}}\).
If there is full mediation then, \(\beta_{3} = \beta{4}\beta_{2}\).

The estimated parameters are:

Parameter	Approx. Values
\(\beta_{1}\)	0.3
\(\beta_{2}\)	5
\(\beta_{3}\)	15
\(\beta_{4}\)	3

The fact that \(\beta_{3} \neq \dfrac{\beta_{2}}{\beta_{1}} \approx 16.666\) and \(\beta_{3} = \beta{4}\beta_{2} \approx 15\) is evidence in favor of full-mediation and against an spurious relation.

Direct and indirect effects (partial mediation)

Let’s consider the variables \(x\), \(y\), and, \(z\) again. Same as before, we are interested in testing the hypothesis that \(x\) does not cause \(y\) using a regression model. The actual relationships can be represented with the following diagram:

\[\begin{eqnarray*} x & \rightarrow & y \\ & \text{and,} & \\ x & \rightarrow & z \\ z & \rightarrow & y \end{eqnarray*}\]

\(x\) is actually causing \(y\) in from two different mechanism. Directly, \(x \rightarrow y\), and indirectly, \(x \rightarrow z \rightarrow y\). This is an issue because the indirect effect may outweigh the direct effect lead us to think that there is causal relationship, when in fact there is.

Example:

Let \(x\) be the unemployment rate in a given geographical area (a zip code, county, etc.), \(z\) the average income, and \(y\) the percentage of the population with health insurance. Note that is easy to see that there will be a direct relationship between \(x\) and \(y\) (in the U.S., large part of the population get their health insurance via their employer, so not having a job increase the chance of not having health insurance), but \(x\) is also related to \(y\) indirectly via \(z\) (being unemployed reduces your income, and without income individuals cannot buy a health insurance).

This type of relationship between \(x\) and \(z\) is very common in social sciences where is impossible to design experiments that eliminate the indirect effect of \(x\) on \(y\) via \(z\).

Simulating direct and indirect Effects

In the presence of partial moderation, we have the following model:

\[\begin{eqnarray*} x & \rightarrow & y \\ & \text{and,} & \\ x & \rightarrow & z \\ z & \rightarrow & y \end{eqnarray*}\]

To simulate this scenario we are going to:

Generate a random variable \(x\)
Generate \(z\) as a linear function of \(x\).
Generate \(y\) as a linear function of \(z\) and \(x\).
Then we’ll run some regressions and see if we can correctly estimate the parameters of the model.

n <- 1000 # number of observations
x <- rnorm(n) # generating X
z <- 5 + x + rnorm(n) # generating Z
y <- 10 - 4*x + 4*z + rnorm(n) # generating Y

Regressions

bv1 <- lm(y ~ x) # Running bivariate regression of y on x
mv1 <- lm(y ~ x + z) # Running multivariate regression of y on x and z
stargazer(list(bv1, mv1), type = "html",
          header = FALSE,
          intercept.bottom = FALSE,
          omit.stat=c("LL","ser","f")) #Displaying regressions with stargazer


	Dependent variable:

	y
	(1)	(2)

Constant	29.873^***	9.753^***
	(0.128)	(0.166)

x	0.043	-4.054^***
	(0.128)	(0.046)

z		4.041^***
		(0.033)


Observations	1,000	1,000
R²	0.0001	0.938
Adjusted R²	-0.001	0.938

Note:	p<0.1; p<0.05; p<0.01

Again, when using the bivariate model we end up incorrectly estimating the causal effect of \(x\) on \(y\). In fact, the relation is not even statistically significant in the bivariate model and the \(R^2\) is almost zero!
The most observant will notice that this may have to do with the fact that the parameters for \(x\) and \(z\) when generating the data for \(y\) have the same magnitude but opposite sign (\(x\) is positively related, but \(z\) is negatively related). In this case we say that \(z\) is suppressing the relation between \(x\) and \(y\), because is canceling the negative effect of \(x\) on \(y\).

Thinking about exogeneity (again)

Let’s take a look again at the formula for the estimated parameters of a multivariate regression:

\[\beta_{\text{ols}} = (X'X)^{-1}X'y = \beta + (X'X)^{-1}X'\epsilon\]

Note that the difference between the estimated parameters and the actual values of the parameters (the bias) is explained by:

\[\beta_{\text{ols}} - \beta \equiv \text{Bias} = (X'X)^{-1}X'\epsilon\]

Then, we can develop the following rules regarding the sign of the bias in a regression:
1. If \(X'\epsilon > 0\) then the bias will be positive, i.e. \(\beta_{\text{ols}} > \beta\), and, the estimated parameter will be overestimating the relation between \(x\) and \(y\).
2. If \(X'\epsilon < 0\) then the bias will be negative, i.e. \(\beta_{\text{ols}} < \beta\), and, the estimated parameter will be underestimating the relation between \(x\) and \(y\).
How can we determine the sign of \(X'\epsilon\) if we don’t observe \(\epsilon\)?

We know that if \(z\) is not included in the regression, then it becomes part of \(\epsilon\). Then, if we know the relation between \(z\) and \(x\) and \(z\) and \(y\), we have a good chance determining the relation between \(\epsilon\) and \(x\).
In our previous example the relation between \(x\) and \(z\) and \(x\) and \(y\) is positive. Then, we expect that the relation between \(\epsilon\) and \(x\) to be positive if we run a regression of \(y\) on \(x\) only. This will cause an overestimation of the effect of \(x\) over \(y\). In the previous bivariate model the estimated parameter is zero, but the actual effect is negative - i.e. the bivariate model is overestimating the effect of \(x\) on \(y\).

Moderating effects

In this case, the slope of \(x\) on \(y\) is moderated by the variable \(z\). That is, how much change \(x\) will cause on \(y\), depends on the value of \(z\).

Example:

Let \(x\) be the hours of work of an individual or some measure of labor input, \(z\) the health status of an individual (0 means healthy, 1 means sick), \(y\) some measure of production (output). Note that the productivity of an individual is moderated by the health status: a healthy individual will have a different relation between hours of work and output than a sick individual, it will get more work done in less time.

We’ll explore moderating effects in more depth when we discuss non-linear terms in the regression model.

Simultaneity or double causation

This happens when \(x\) causes \(y\) and \(y\) causes \(x\) via some variable \(z\). This is also known as endogeneity. The model looks something like this:

\begin{eqnarray} x & & y \ y & & z \ z & & x \ \begin{eqnarray}

Example:

Let \(x\) be the population of sheep in an ecosystem and \(y\) the amount of grass. Note that the more sheep, the more grass they will eat - that’s the direct effect of \(x\) on \(y\), what we want to estimate with a regression -. Now, the less grass available, the higher the mortality rate of sheep \(z\), which cause the population of sheep to go down \((y \rightarrow z \rightarrow x)\). This means that we have two effects: (1) \(x\) is causing \(y\) (sheep are negatively related with grass, because sheep eat the grass), but simultaneously, (2) \(y\) is causing \(x\) (without enough grass, the population of sheep will eventually go down). Imagine that we have data for sheep populations and the density of grass in some predetermined geographical areas, when we regress \(y\) on \(x\) what we may end up estimating is that the two variables are not related because the two effects are cancelling each other out - that does not mean that there are no causal effects between the two variables, just that they can’t be simply estimated with a regression of \(y\) on \(x\) -.

The problem of double causation, simultaneity, or endogeneity is very common in research on biology and economics, and in general whenever there are self-balancing systems. This type of problem cannot be solved by simply controlling for other variables with a technique known as instrumental variables estimation or two-steps OLS. If we have time, we may see some examples of endogeneity in the following learning modules.

Warning

Be careful with jumping to conclusions:

Regression doesn’t see the direction of the arrows and it requires expert judgment (and often some other tests) to determine which causal pathways are likely.
Also, all the examples we discussed there was only one \(z\) under consideration. What if there are many \(z\)? each affecting the relation between \(x\) and \(y\) in a different manner?
The process of running a regression and interpreting its output is a very mechanical and simple process - is easy to imagine an algorithm that can run and interpret the output of a multivariate regression -. The low-cost of this technique has made it very popular, that does not mean that it is always appropriate to use or that the actual relation between variables is that simple.
Because in most applications it is impossible to have a dataset with all potential \(z\) variables that can influence both \(x\) and \(y\), generally the results of a regression cannot be used to prove causation, a regression simply adds a piece of evidence in support or against an hypothesis.

Take away

Leaving out variables can cause omitted variable bias (\(OVB\)) – either seeing causation when it’s not there, thinking it’s not there when it is, or over-/under- estimating the effect of a variable –.
Identifying the causal pathway is not trivial. There is no straightforward method that allow us to prove causation. The best we can do is to provide evidence in favor or against certain hypothesis by identifying patterns in the relationship among the different variables.
The solution to \(OVB\) is generally to add control variables. This may lead you to think that you should include as much variables as possible in a regression model, just to make sure that you reduce the possibility of omitting a relevant variable. As we’ll see in the next lesson adding variables is not free, there are problems that arise when adding additional variables to a regression, even if the variables are relevant – sometimes the cure is worse than the disease –
OVB is only a problem if you are interested in correctly estimating the causal effect between two variables using a regression model. If your goal is to exclusively predict \(y\), then the fact that \(x\) is related to \(y\) in a complex manner is not really an issue. As long as \(x\) has sufficient predictive power you can use a regression to make predictions on \(y\).