Correlation and Regression
Understanding the relationship between two or more variables can be a valuable tool in life. For example, you might want to know if there is a relationship between reading this blog and increasing your salary (there probably isn’t), or if there is a relationship between smoking and lung cancer (there is). Correlation and regression analysis are the tools we use to examine these issues, and they are the subject of this post.
While we will attempt to break down the concepts into simpler principles, we are also assuming a basic understanding of inferential statistics.
A scatter plot shows the relationship between two variables. For example we might want to plot the relationship between the amount of smoking a person does and their life expectancy. To save time let’s imagine this scatter plot from Wikipedia represents the data.
This plot shows that the points cluster together in a straight line. If the X axis is smoking, and the Y axis is life expectancy, this clustering would imply the two variables have a strong negative linear relationship. As smoking increases, life expectancy decreases.
This graphical representation serves us well, but what if we wanted to represent this information numerically?
Covariance and Correlation
Say we want to know how two different variables move in relation to each other but we also want a way of measuring this movement. Covariance is a tool for doing just that. Covariance takes the average product of how two variables move in relation to their means.
X = smoking packs per day = [1, 2, 3]
Y = Life expectancy = [90, 80, 70]
Covariance = ∑(Xi – Xavg)(Yi – Yavg) / n – 1
Covariance = (1 – 2)(90-80) + (2-2)(80-80) + (3-2)(70-80) / 2 = -10
Conceptually, what we are doing is seeing how both variables vary from their mean and them multiplying those variations together. For example, if smoking tends to move up while life expectancy is moving down, the product of those movements would be negative. In this case, when X is one below its mean, Y is 10 above its mean. When X is one above its mean, Y is 10 below its mean. If they moved in the same direction, the covariance would be positive. But since they move in opposite directions, the covariance is negative.
Covariance is a useful measure, but our final answer of -10 is in the complicated units of packs*years. This doesn’t lend itself to easy analysis. Standardizing the covariance into a unit-less measure is where correlation comes in.
Correlation = Cov(X,Y) / SxSy
Sx = 1
Sy = 10
Correlation = -10 / 10 = -1
Correlation is a way to standardize our covariance by the product of the two standard deviations. If two variables have a very strong linear relationship, the absolute value of their correlation will be close to 1. A weak linear relationship would be close to 0. The correlation of -1 in our example, implies a perfect negative correlation. As one variable moves, the other moves in the exact opposite direction.
Testing our correlation
What if we want to know how significant our resulting correlation is? What factors go into this? Well the largest determinants of the significance of our correlation are the sample size and the correlation result.
By using a T-test with n-2 degrees of freedom, we can test the hypothesis that our correlation is actually equal to zero.
T-critical for a 5% level of significance and 2 degrees of freedom = 4.303
T-statistic = r*sqrt(n-2) / sqrt(1-r^2)
If our t-stat > t-crit we can reject the hypothesis that the true population correlation equals zero.
It is interesting to note that the formula for the test statistic results in a higher value as both n and r get higher. For smaller samples, it is easy to get a strong correlation by chance, and it should not be viewed as a significant result.
Limitations of correlation analysis
Outliers have an out-sized impact on the correlation between two variables. It is important that we review our data graphically to determine which outliers are useful information and which should be thrown out.
Correlation does not capture non linear relationships. For example, a correlation of 0 would mean there is no linear relationship, but a strong non-linear relationship might still exist between the variables. Just because the correlation is 0, it doesn’t allow us to say there is no relationship between the variables.
Correlation does not imply causation. More so, correlations can be spurious and point to misleading associations. For example, the sales of ice cream might be highly correlated with the rate of pool drownings. To conclude that ice cream sales cause drownings would be a spurious conclusion. In reality, a third confounding variable is likely to be causing both. In this case, it would be the summer heat.
Like correlation, regression analyzes the relationship between multiple variables. Regression takes this a step forward by seeking to predict one of the variables by using the other variables in a mathematical equation. The variable we are predicting is the dependent variable and the variables we are using to make the prediction are the independent variables.
Basic regression formula:
Y = b0 + b1X + error
- Y is the dependent variable that we are trying to explain
- X is the independent variable that we use to predict Y
- b0 is the intercept, or the value of Y given that X is 0
- b1 is the slope of the regression line, or given that X increases by one unit, how many units will Y increase by
- error is the error term, or the difference between our predicted value of Y and the actual value of Y
Regression is basically the attempt to find the best fit line between variables, and use the line’s equation to make predictions. However, in order to do this, some assumptions need to hold.
Assumptions of normal linear regression:
- A linear relation exists between the dependent variable and the independent variable.
- This model is not useful in cases where the relationship of the variables is anything but linear. Scatter plots can help see if this assumption holds.
- The independent variable is not random.
- If the variable that we are using to make predictions is random, our predictions will hold no value.
- The expected value of the error term is 0.
- The average of all our error terms should be 0.
- The variance of the error term is the same for all observations (homoskedasticity).
- If the variance of the error term is fluctuating across a scatter plot, this probably means the relationship isn’t linear.
- The error term is uncorrelated across observations (autocorrelation).
- Autocorrelation occurs when the residuals are dependent on each other, for example a stock price might be dependent on its price yesterday.
- The error term is normally distributed.
- The error term should be random and not skewed.
Regressions are optimized by minimizing the sum of the squared regression residuals. This means that we square and add up all of the differences between what we predicted and what the actual observed and then find the equation that minimizes that number. After optimizing and creating a regression model, we now need to figure out whether or not the model does a good job of describing the relationship between the variables.
The standard error of the estimate or SEE is the first concept we will explore for judging the performance of our regression model. The standard error of the estimate measures how well the regression model fits the data. SEE is the average of the square differences in the residual values.
SEE = ∑(Yactual – Ypredicted)^2 / n-2
If all of our residuals are small, our SEE will be small, and we can say the model fits the data well.
The coefficient of determination, or r squared, measures the fraction of total variation in the dependent variable that is explained by the independent variable. Total variance is simply the variance of Y. Unexplained variance is what we tried to minimize earlier. It is the variance of the difference between the observed Y and the predicted Y. Explained variance is the variance of the difference between the predicted y and the average y. This is all confusing so lets organize it a little better.
Coefficient of determination:
r^2 = 1 – (unexplained variance / total variance)
r^2 = explained variance / total variance
total variance = total sum of squares = ∑(Yactual – Yavg)^2
unexplained variance = residual sum of squares = ∑(Yactual – Ypred)^2
explained variance = regression sum of squares = ∑(Ypred – Yavg)^2
Conceptually, imagine Yavg as our base case prediction. For every actual observation of Y, the distance between that observation and Yavg is our total variation.
Now, our model is going to predict a Y that is somewhere between the two. It is attempting to make a prediction that is better than our base case.
So, of that total distance between Yavg and Yactual, our model only explains part of it. The total distance is our total variation. The distance that is explained by our model is the explained variance. Dividing these two gets us our r-squared.
We can also form confidence intervals for our regression coefficient. To calculate this, all we need is the standard error of the estimated coefficient and the critical t-distribution value.
Confidence interval = Estimated Coefficient +- Tcrit(standard error of coef)
This interval allows us to create a range that we can be confident the true regression coefficient lands in. If this range happens to include 0, we can also say that the coefficient isn’t significant.
Dividing the total variability of a variable into components that are attributable to different sources, is know as ANOVA (analysis of variance). ANOVA is useful when we have multiple independent variables. In these situations, we use an F-statistic to test whether all the slope coefficients in a linear regression are equal to 0. Rather than just testing single coefficients for their significance, the F-statistic tests the whole model for significance.
ANOVA requires the following:
- Total number of observations (n)
- Total number of parameters to be estimated (intercept + coefficients)
- Residual Sum of squares (SSE), the unexplained variance
- Regression sum of squares (RSS), the explained variance
Formula for the F-statistic:
F = (explained variance / number of slope params) / (unexplained variance / n – all number of params)
R-squared divided explained variance by total variance. The F-statistic divides average explained variance by average unexplained variance. The lower the F-statistic, the less the independent variables explain the variation in the dependent variable.
Now, if we are confident that our model is doing a good job, we can begin making predictions. All we need to do is plug in values of X into our model in order to output new predictions for Y.
How confident can we be in our predicted Y?
Prediction confidence interval: Ypred +- Tcrit(variance of the prediction error)
The variance of the prediction error depends on the squared standard error of the estimate, the number of observations, the value of the independent variable, the mean of the independent variable, and the variance of the independent variable. It is a deeper concept than we have time to get into.
The following is an example of a regression output:
X = smoking packs per day = [1, 2, 3]
Y = Life expectancy = [90, 80, 50]
Limitations of regression analysis
Regression relationships can change over time, just as correlation relationships can. This is referred to as parameter instability. For example, a regression ran with data from a bull market will have different results than a regression ran with data from a bear market.
Regression awareness can influence the usefulness of the regression. When the results of a regression are public knowledge, the future usefulness of the regression might be negated. For example, if someone discovers that every Monday stock XYZ goes up, and that information becomes public, other agents in the market will begin to bid up XYZ on Sunday. The relationship will break down.
The main limitation of regression analysis are the assumptions that are need to be true. We listed them earlier, and if any of them aren’t held up, the model will fall apart. There are tests to determine whether an assumption holds, but it is not always clear cut.
As it relates to investing
Knowing how variables are related is the cornerstone of diversification. We could have a portfolio of 1000 different stocks, but if they were all correlated with each other, our diversification would essentially be the same as having a portfolio of just 1 stock. This concept is important, and understanding the math behind it will hopefully help us make better decisions going forward.
Knowing how two things are related will help us make judgements about the world. Correlation and regression are the mental models that serve us in this endeavor.