Computational Statistics 7.1: Statistical Association

Statistical Association

Overview

This lesson introduces the concept of association between two variables.

Objectives

After completing this module, students should be able to:

Understand the difference between: statistical association, correlation, and, causation.
Calculate the covariance and correlation between two variables.
Explain the meaning and limitations of the correlation coefficient.

Readings

Lander, Chapter 15.2. Schumacker, Chapter 15.

Background and Motivation

In previous learning modules we’ve learned the basic tools to test a hypotheses given a sample and a set of assumptions about the probability distribution of the population.
As an advanced consumer and producer of statistics, you should be aware of the limitations of the techniques presented to you.
Moreover, the use of statistics is not only limited to hypothesis testing. Statistics is also used to estimate the quantitative relations among variables, determine causation, and, make predictions. Yet, at the core of any statistical application, we’ll find the philosophy and language of hypothesis testing.
Besides expanding the techniques of hypothesis testing, for the rest of the semester we are going to develop a series of method to determine the statistical association between variables.
In the next slide, you’ll find is a set of questions that highlight some of the things that can go wrong when testing an empirical hypothesis.

Common problems

Is the population distribution biased? asymetric? or, in general, not consistent with the assumptions of normality?

For instance, it is well known that the distribution of grades in is not necessarily symmetric, one of the core assumptions of the normal distribution. This implies that all the probability calculations (the confidence interval thresholds, the p-value, etc.) based on the normal or the t-distribution, will be inaccurate.

Are the groups under consideration, the relevant groups to correctly test the proposed hypothesis?

For example, if we want to test the hypothesis that the probability of having a disease \(X\) is higher for older people. To conduct this test we may collect a random sample of individuals from the population and ask them if they have the disease \(X\) and their age. To test the hypothesis we can specify the null hypothesis as \(\mu_{\text{young}} < \mu_{\text{old}}\). But, what is the relevant cut-off age threshold to divide the two groups? is 65 old enough? maybe 45? This may depend on the type of disease. Maybe our own biases as researchers are getting in the way, and we can specify the groups in a way that we obtain the results that are the most convenient.

Is the sample really randomly selected?

Imagine that I am interested in predicting the winner of the next presidential elections. For that I collect two samples, one in a historically Democratic state and one in a historically Republican state. The results from both samples will probably contradict each other and will not represented the overall distribution of the population.

Is the apparent lack of independence between two variables the result of not controlling for other factors?

More on this on the following learning modules.

Association

Informally, two variables, \(x\) and \(y\), are statistically associated when the variation of one the two variables, on average, coincides with the variation of the other variable.
Example: If on average, when there is unusually good weather - variation in the variable weather - there is unusually more traffic - variation in the variable traffic -, then we can say that good weather and traffic are associated variables.
In statistics, the term related is a synonym of statistical association. From the previous example, we can say that good weather and traffic are related.
Note that variable can be related or associated, but that does not necessarily implies that one variable is causing the other. That is, the fact that the variation of two variables coincides in a given sample does not mean that one is causing the other.
Example: When there is a big fire we expect an unusually high damage done to the burnt property and, also, we expect a high number of firemen to be send to deal with the fire. Statistically speaking, the number of firemen and the severity of the fire are related variables - on average, when a fire is really bad the average number of fireman sent is unusually high -. But that does not imply that the high number of firemen sent will cause an increase the severity of the fire. (Check this link for more examples).

Covariance

We can formalize the idea of statistical association between \(x\) and \(y\) by using the product of the mean difference of two variables.

Mean difference: The mean difference of a variable \(x\) is simply the difference between each observation of the variable and the mean. Formally, \[x_{i} - \bar{x}\] Note that for values of \(x\) above the mean, the mean difference is positive and for values below the mean negative.
Product of mean difference: We can multiply the mean difference of two variables \(x\) and \(y\) and obtain, \[(x_{i} - \bar{x})(y_{i} - \bar{y})\] Note that if both \(x\) and \(y\) are unusually different from the mean, the value of the product will be relatively high to when the values are close to the mean.
Sample covariance: Recall that the concept of statistical association depends on the idea that on average there is a coincidence in the variation of the two variables. Then, we can compute the sample average of the product of mean difference as a measure of statistical association: \[\sigma_{x,y} = \text{cov}(x,y) = \dfrac{1}{n-1}\sum_{i=1}^{n}(x_{i} - \bar{x})(y_{i} - \bar{y})\] \(\text{cov}(x,y)\) or \(\sigma_{x,y}\) is the sample covariance. Is simply the average of the product of the mean difference of two variables.

R-Example: mean difference and covariance

Let’s simulate some data in R to illustrate the relation between mean difference, covariance, and, statistical association:

First, let’s generate two random variables that are not related:

set.seed(50)
n <- 10 # Sample Size
x1 <- runif(n)
y1 <- runif(n)

Second, let’s generate two random variables that are related. To make the \(y\) related with \(x\), we can use a simple equation such as: \[y_{i} = 2 \times x_{i} + \text{random value}_{i}\]

x2 <- runif(n)
y2 <- 2*x2 + runif(n)

Third, now let’s compute the mean difference for each variable:

# Making data.frame 
df <- data.frame(x1, y1, x2, y2)

#Let's compute the mean difference
meanDiff <- apply(df, 2, FUN = function(x) x - mean(x))

The following tables show the mean difference for each set of variables.

x1	y1	x2	y2
0.29	0.01	0.13	0.19
0.02	-0.11	-0.29	-0.61
-0.22	0.26	0.23	0.46
0.35	-0.3	0.36	0.9
0.1	-0.1	-0.18	-0.56
-0.37	0.3	0.17	0.53
0.28	0.46	0.09	-0.13
0.23	-0.01	-0.23	-0.19
-0.37	-0.3	-0.16	-0.38
-0.31	-0.21	-0.12	-0.21

Note how the sign of mean difference for the second group tends to coincide more often than for the first group. This is evidence of statistical association, when x2 is above its mean, we observe that y2 is above its respective mean, and, when x2 is below its mean, observe that y2 is below as well. We can compute the covariance by using the command cov() in R:

cov1 <- cov(x1,y1)
cov2 <- cov(x2,y2)
c(cov1, cov2)

## [1] 0.002610834 0.103184060

The firs group has a covariance very close to zero, which indicates a low degree of statistical association (the variables are not related).

R-Example: Graphical Representation

We can have an idea of when two variables are related by inspecting a scatter plot. Let’s make a scatter plot for the previous data.frames,

# Loading packages for graphs
library(ggplot2) 

# The grid and gridExtra packages make it
# really easy for you to display graphs
# side-by-side.
library(grid)
suppressMessages(library(gridExtra))

df$diff1 <- (df$x1-mean(df$x1))*(df$y1-mean(df$y1))
df$diff2 <- (df$x2-mean(df$x2))*(df$y2-mean(df$y2))
g1 <- ggplot(data = df, aes(x = x1, y = y1, color=diff1)) + geom_point(size = 5)
g1 <- g1 + geom_hline(yintercept = mean(df$y1), linetype = "dashed")
g1 <- g1 + geom_vline(xintercept = mean(df$x1), linetype = "dashed")
g1 <- g1 + guides(color = F) + xlim(0, 1) + ylim(0,3)
g1 <- g1 + scale_color_gradient2(midpoint=mean(c(df$diff1, df$diff2)), low="red3", mid="yellow3", high="green3", space ="Lab" )
g1 <- g1 + theme_minimal()

g2 <- ggplot(data = df, aes(x = x2, y = y2, colour=diff2)) + geom_point(size = 5)
g2 <- g2 + geom_hline(yintercept = mean(df$y2), linetype = "dashed")
g2 <- g2 + geom_vline(xintercept = mean(df$x2), linetype = "dashed")
g2 <- g2 + guides(color = F) + xlim(0, 1) + ylim(0,3)
g2 <- g2 + scale_color_gradient2(midpoint=mean(c(df$diff1, df$diff2)), low="red3", mid="yellow3", high="green3", space ="Lab" )
g2 <- g2 + theme_minimal() 

grid.arrange(g1, g2, ncol = 2)

In the previous graph, the dashed lines represent the mean of each variable. Note how the points of the data set seems to be all over the place when there is no statistical association.

Interpretation of the sample covariance

When the values of \(x\) and \(y\) are above their respective means, the product \((x_{i} - \bar{x})(y_{i} - \bar{y})\) is positive.
When the values of \(x\) and \(y\) are are below their respective means, the product \((x_{i} - \bar{x})(y_{i} - \bar{y})\) is positive. (a negative times a negative is positive)
When we see \(x\) above its mean while any \(y\) is below its mean, the product \((x_{i} - \bar{x})(y_{i} - \bar{y})\) will be negative, and viceversa.

Using this rules we can conclude that:

The variables \(x\) and \(y\) are positively associated (related) if the covariance is positive. This is implied by (1) and (2).
The variables \(x\) and \(y\) are negatively associated (related) if the covariance is negative. This is implied by (3).

Degrees and types of association

Informally, two variables have a strong association when the magnitude of the covariance is relatively high. You may be disappointed by this vague definition, probably you were expecting a threshold value to decide what is high or what is low or maybe an statistical test. But, unfortunately, the covariance by itself cannot objectively determine the degree of association between two variables.
The reason for this is that the covariance is not a unit-free measure. In other words, is not in a standardize scale, and it depends on the units in which both \(x\) and \(y\) are measured.

Correlation Coefficient

Because of the previously mentioned issues, when describing the statistical association between two variables we generally use the correlation coefficient instead. In particular, we are going to use Pearson’s correlation coefficient.
The correlation coefficient is a unit-free measure of the degree of linear association (more about linearity in a minute) between two variables. This is the formula for the correlation coefficient:

\[\rho_{x,y} = \text{corr}(x,y) = \dfrac{\sigma_{x,y}}{\sigma_{x}\sigma_{y}}\]

Where,

\(\sigma_{x,y}\) is the covariance.
\(\sigma_{x}\) is the standard deviation of \(x\).
\(\sigma_{y}\) is the standard deviation of \(y\).
The correlation coefficient is bounded by \((-1,1)\). This makes the interpretation of the correlation coefficient very straightforward:
- Positive correlation coefficient \(\rightarrow\) positive association.
- Negative correlation coefficient \(\rightarrow\) negative association.
- Values close to zero \(\rightarrow\) weak association.
- Values close to the bounds \(\rightarrow\) strong association.

R-Example: Correlation and Degree of Association (part 1)

This time we are going to generate several pairs of \((x,y)\) variables, each with a different degree of association. Then, we’ll compute the correlation coefficient. In R we can compute the correlation coefficient using the command cor(). Each pair of data are going to be derived from the following general model:

\[y_{i} = \beta\times x_{i} + \varepsilon_{i}\]

\(\beta\) is going to represent, in part, the degree of association between \(x_{i}\) and \(y_{i}\). \(\epsilon_{i}\) is simply random variable, we are adding this to the model so that \(y_{i}\) is not a perfect multiplication of \(x_{i}\). If \(\beta > 0\) the relation will be positive, if \(\beta < 0\) the relation will be negative, and, if \(\beta = 0\), then \(y_{i} = \epsilon_{i}\) and is unrelated with \(x_{i}\).

n <- 1000 #number of observations
betas <- -10:10 # Values of beta
k <- length(betas) #number of different models

x <- rnorm(n*k) # generating data for x
dim(x) <- c(n, k) # Reshaping x in matrix form

y <- matrix(NA, n, k) # Creating empty matrix for y
rho <- vector("numeric", k) # Creating empty vector for correlation
# Generating y data using a for loop 
for(i in 1:k) {
    y[,i] <- betas[i]*x[,i] + rnorm(n)
    rho[i] <- cor(y[,i], x[,i]) # Calculating correlation
}

plot(betas, rho, type="b")

The previous graph illustrates the fact that the correlation coefficient is bounded by \((-1,1)\). And, the stronger the degree of association (the greater the value of \(\beta\) in absolute value), the closer the correlation coefficient to \((-1,1)\).

R-Example: Correlation and Degree of Association (part 2)

Let’s take a look at the scatter plot of the previous data for some values of \(\beta\).

library(latex2exp) # To add latex to graph
corrPlot <- function(beta) { 
    index <- which(betas == beta)
    datum <- data.frame(x = x[, index], y = y[, index])
    g1 <- ggplot(data = datum, aes(x = x, y = y)) + geom_point(alpha = 0.5) 
    g1 <- g1 + ggtitle(TeX(paste(sprintf("$\\beta = %d$", beta),",", sprintf("$\\rho = %g$", round(rho[index],2)))))
    return( g1 )
}

grid.arrange(corrPlot(-10), corrPlot(-7), corrPlot(-3),
             corrPlot(-1), corrPlot(0), corrPlot(1),
             corrPlot(3), corrPlot(7), corrPlot(10), ncol = 3, nrow = 3)

Linearity and Correlation

From the last example you may think that is sufficient to observe a correlation coefficient sufficiently close zero to argue that two variables are not statistically associated. But, unfortunately, that’s not the case. The correlation coefficient is only able to detect statistical association, if the relationship between the variables is linear, i.e. constant.
For instance, consider the following model:

\[y_{i} = \beta x_{i}^2 + \varepsilon_{i}\]

Now the relation between, \(x\) and \(y\) is not linear, but quadratic. Let’s see if by simulating this model the correlation coefficient can detect the statistical association among the variables.

set.seed(1)
n <- 1000 # Sample size
x <- rnorm(n)
y <- 2*x^2 + rnorm(n)
cor(x,y)

## [1] -0.02558696

That’s very close to zero. Let’s take a look at the scatter plot of the data set.

datum <- data.frame(x=x, y=y)
g1 <- ggplot(data = datum, aes(x = x, y = y)) + geom_point(alpha = 0.5) 
g1 <- g1 + ggtitle(TeX(paste(sprintf("$\\beta = %d$", 2),",", sprintf("$\\rho = %g$", round(cor(x,y),2)))))
g1

Therefore, the correlation coefficient is only a good indicator of the existence and degree of the statistical association between two variables if the relation is linear.
In the next modules we’ll learn to deal with non-linear relationships. For now we’ll take the assumption of linearity as given.

Empirical Examples

Let’s play with some real-world data. We are going to explore two real-world topics:

Is there a statistical association between \(C0_{2}\) and average temperature?
Does the amount of sugar increase the caloric content of food?

Example: Climate Change

For this example, we are going to use the climate_change.csv dataset. You can download this from Course Resources > Datasets. There are several variables in this dataset, feel free to explore the other variables. For reference about the data, take a look at this link.

Let’s Load the data first,

dataFile <- "/Users/CCD-MAC/Documents/Dropbox NEU/Dropbox/Teaching/NEU/2019/PPUA5301/PPUA5301 - Summer 2019/Lectures/homeworks/homework_assignments/datasets/climate_change.csv"

climate <- read.csv(dataFile)

# Taking a quick look at the first couple of observations
# just to be sure that everything is fine.
head(climate)

##   Year Month   MEI    CO2     CH4     N2O  CFC.11  CFC.12      TSI
## 1 1983     5 2.556 345.96 1638.59 303.677 191.324 350.113 1366.102
## 2 1983     6 2.167 345.52 1633.71 303.746 192.057 351.848 1366.121
## 3 1983     7 1.741 344.15 1633.22 303.795 192.818 353.725 1366.285
## 4 1983     8 1.130 342.25 1631.35 303.839 193.602 355.633 1366.420
## 5 1983     9 0.428 340.17 1648.40 303.901 194.392 357.465 1366.234
## 6 1983    10 0.002 340.30 1663.79 303.970 195.171 359.174 1366.059
##   Aerosols  Temp
## 1   0.0863 0.109
## 2   0.0794 0.118
## 3   0.0731 0.137
## 4   0.0673 0.176
## 5   0.0619 0.149
## 6   0.0569 0.093

Now let’s compute the covariance and correlation

climateCov <- cov(climate$CO2, climate$Temp)
climateCov

## [1] 1.695341

climateCor <- cor(climate$CO2, climate$Temp)
climateCor

## [1] 0.7485046

A positive covariance indicates that the two variables have a positive association. Is it sufficiently different from zero? is hard to tell, specially considering the scale in which the temperature is measured.
The correlation coefficient confirms the positive relation. Also, it a value of about 0.75 indicates that the relationship is moderately strong.

Finally, let’s take a peek at the scatter plot to verify that the linearity assumption is valid:

g1 <- ggplot(data = climate, aes(x = CO2, y = Temp)) + geom_point(alpha = 0.5) 
g1 <- g1 + ggtitle(TeX(paste(sprintf("$\\sigma_{x,y} = %g$", round(climateCov,2)),",", sprintf("$\\rho = %g$", round(climateCor,2)))))
g1

The relationship seems to be linear for in the 340-370 range of the domain. After that, the relationship changes from positive to negative, indicating some form of global non-linearity. Therefore, we have to be conservative with our conclusions and wait until we learn more about non-linear relations to deal with this problem.

Regardless, keep in mind that with the correlation and covariance we can only determine if there is a relation between two variables, but we cannot establish causality. In other words, correlation does not imply causation.

Example: Sugar

For this example we are going to use the USDA.csv dataset. You can download this from Course Resources > Datasets. Again, feel free to explore the other variables. This dataset contains the USDA nutritional label information for a very large set of food products.

Let’s Load the data first,

dataFile <- "/Users/CCD-MAC/Documents/Dropbox NEU/Dropbox/Teaching/NEU/2019/PPUA5301/PPUA5301 - Summer 2019/Lectures/homeworks/homework_assignments/datasets/USDA.csv"

sugar <- read.csv(dataFile)

# Taking a quick look at the first couple of observations
# just to be sure that everything is fine.
head(sugar)

##     ID              Description Calories Protein TotalFat Carbohydrate
## 1 1001         BUTTER,WITH SALT      717    0.85    81.11         0.06
## 2 1002 BUTTER,WHIPPED,WITH SALT      717    0.85    81.11         0.06
## 3 1003     BUTTER OIL,ANHYDROUS      876    0.28    99.48         0.00
## 4 1004              CHEESE,BLUE      353   21.40    28.74         2.34
## 5 1005             CHEESE,BRICK      371   23.24    29.68         2.79
## 6 1006              CHEESE,BRIE      334   20.75    27.68         0.45
##   Sodium SaturatedFat Cholesterol Sugar Calcium Iron Potassium VitaminC
## 1    714       51.368         215  0.06      24 0.02        24        0
## 2    827       50.489         219  0.06      24 0.16        26        0
## 3      2       61.924         256  0.00       4 0.00         5        0
## 4   1395       18.669          75  0.50     528 0.31       256        0
## 5    560       18.764          94  0.51     674 0.43       136        0
## 6    629       17.410         100  0.45     184 0.50       152        0
##   VitaminE VitaminD
## 1     2.32      1.5
## 2     2.32      1.5
## 3     2.80      1.8
## 4     0.25      0.5
## 5     0.26      0.5
## 6     0.24      0.5

Now let’s compute the covariance and correlation

sugarCov <- cov(sugar$Sugar, sugar$Calories)
sugarCov

## [1] NA

sugarCor <- cor(sugar$Sugar, sugar$Calories)
sugarCor

## [1] NA

Ops! we found an NA. This generally happens because in some datasets there are some missing information (maybe some food products are not reporting calories or the amount of sugar). To deal with NA in this instance, we add the options use = "pairwise.complete.obs". This will tell R to only use the observations for which both variables are not NA:

sugarCov <- cov(sugar$Sugar, sugar$Calories, use = "pairwise.complete.obs")
sugarCov

## [1] 833.978

sugarCor <- cor(sugar$Sugar, sugar$Calories, use = "pairwise.complete.obs")
sugarCor

## [1] 0.3099887

A positive covariance indicates that the two variables have a positive association. This looks like it is very different from zero, but again, we cannot be sure as the scales for calories and sugar are not the most intuitive.
The correlation coefficient confirms the positive relation. Yet, a value of about 0.3 indicates that the relationship is moderate. This maybe caused because the relation is non-linear. Let’s take a look at the scatter plot:

g1 <- ggplot(data = sugar, aes(x = Sugar, y = Calories)) + geom_point(alpha = 0.5) 
g1 <- g1 + ggtitle(TeX(paste(sprintf("$\\sigma_{x,y} = %g$", round(sugarCov,2)),",", sprintf("$\\rho = %g$", round(sugarCor,2)))))
g1

Wow! this seems very odd…and interesting. I think it is clear that the relationship is somewhat linear. If you pay attention, the data makes some sort of triangle. Note how for lower values of sugar, the data is more disperse than for higher values of sugar. This indicates that the relation may be determined by the type of food: for instance, fruits tend to have a moderate among of sugar but generally have lower caloric content than processed sweets and/or animal products. This highlight the fact that simply by looking at the correlations between two variables is simply not enough to clearly establish the relation between two variables in some instances.

Summary

If the relation between two variables \(x\) and \(y\) is linear, we can use the covariance and the correlation coefficient to determine the if the variables are statistically associated.
The covariance can be used to determine the sign of the association, but not the degree or strength of the relationship.
The correlation coefficient is a unit-free measure and is bounded by \((-1,1)\). The stronger the statistical association, then the closer to the bounds will the correlation coefficient be.
In R we can compute the covariance using the command cov() and the correlation with cor().
Always use the option use = "pairwise.complete.obs" to avoid the problem of dealing with NA observations.
Unfortunately, if the statistical association between the variables is non-linear we cannot use the correlation coefficient to determine the existence nor degree of the association. We can still inspect the scatter plot to guess if the variables are related in a non-linear manner.