In the last post on numeracy, we talked about correlation, regression, and how two variables could be related. Regression, specifically, was a way to use one variable to predict another. This was useful but in real-world situations, when making predictions, we are required to weigh a multitude of different variables. Lucky for us there is a regression model that is specifically designed to handle multiple variables. This model is called multiple linear regression (MLR) and it will be the subject of this post.
At its core, MLR is a model that tries to find a mathematical relationship between three or more variables.
Multiple linear regression structure
In MLR, the variable we seek to explain is called the dependent variable, and the variables that explain it are called the independent variables. The dependent variable is often denoted as Y and the independent variables are often denoted as X1, X2, …, Xn.
The general form of the MLR model is as follows:
Y = b0 + b1X1 + b2X2 + … + bnXn + ε
Y is the dependent variable
X1-Xn are the independent variables
b0 is the intercept
b1-bn are the slope coefficent
ε is the error term
Interpreting regression coefficients
Unlike with single linear regression, the coefficients of MLR can’t be thought of as how much the dependent variable will change given a one-unit increase of the independent variable. The reason for this is that when we add more independent variables to the model, the correlations between those variables will have an impact on each individual coefficient.
Conceptually, if two independent variables are correlated, the two variables will move together, and then they will overlap in explaining the variance of the dependent variable. If the two independent variables retained their absolute regression coefficients, they would be redundantly explaining parts of the same thing. For this reason, MLR uses something called partial regression coefficients, which measure the expected change in the dependent variable for a one-unit increase in an independent variable, but remembering to hold all other independent variables constant.
How do we derive this intuitively? Well, if we want to have coefficients that only explain the non-overlapping variance in the dependent variable, we need to break out what is uniquely explained by each independent variable. Do we already have a tool for finding how much of a variable’s variation can’t be explained by other variables? We do, and it is regression.
For example, imagine we have two independent variables, X1 and X2. We could regress X1 on X2 and the result would be a regression that shows how much of X1 is explained by X2. More importantly, this regression would also show how much of X1 is not explained by X2. This unexplained part of X1 would be the residuals of that new regression. We can then take those residuals(parts of X1 that are unique to X1) and use them instead of X1 to regress Y. Doing this, we arrive at our partial regression coefficient and have derived the key concept behind MLR.
Testing individual coefficients
We use a T-test for analyzing the significance of every individual independent variable. The equation for the T-Statistic is as follows:
T-statistic = Observed Coefficient – Predicted Coefficient / Standard Error
Degrees of freedom = N – K – 1
K = number of slope coefficients
If our calculated T-statistic is greater than our T-critical value at a stated significance level, we can reject the null hypothesis the true coefficient is equal to our predicted coefficient.
Regression software will often express this significance with a P-value. The P-value is the level of significance that our result would be rejected at. The lower the P-value, the more significant the result.
Testing our predictions
Lets take it for granted right now that our MLR model is significant and properly specified. When making predictions with this MLR model, it is important that we don’t use input values that are outside the range of data we used to make the model. For example, if we have height and weight data for people who are between 5ft and 6ft tall, and then try to make a prediction for a person who is 7ft tall, our model will probably fail. We can only extrapolate so far outside of our original data range.
Making predictions with MLR models also comes with two primary sources of uncertainty. The first source is the uncertainty of the model as a whole. This is measured by the standard error of the regression. The second source is the uncertainty of the individual coefficients. This is measured by the standard error of the coefficients.
In order to properly create a confidence interval for our prediction, we need to incorporate both of these types of uncertainty. This takes a level of matrix algebra that we won’t get into today.
Qualitative variables using dummy variables
If we want to use qualitative independent variables in our MLR model, we need to incorporate dummy variables. Dummy variables are variables that take on the value of 0 or 1. For example, is it raining could take on the value of 1 for yes and 0 for no. If we need to distinguish between n category states, the regression would need to include n-1 dummy variables. For example, if we wanted to use what quarters of the year it is as an independent variable, we would need to include three dummy variables in our model. The first dummy variable would take on the value of 1 if it is the first quarter and 0 if it is not the first quarter. This would be repeated for the second and third quarter. The reason we don’t use a dummy variable for the fourth quarter is that the intercept will capture the average value for it.
If we want to predict a qualitative dependent variable in our MLR model, we can’t actually use MLR. Instead, we need to use other models such as probit, logit, and discriminate analysis. We won’t get into these, but conceptually imagine if we wanted to predict something like whether or not a company would go bankrupt. MLR wouldn’t be able to accomplish this.
Analyzing a model’s predictive power
There are four main factors that we want to look at when analyzing how well our model fits the data: the overall significance of the model, the individual significance of the coefficients, the magnitude our error terms, and the amount of variance that the model explains.
First, we should estimate the overall significance of the model. Since the individual tests of the regression coefficients do not take into account the effect of interactions among the variables, it is possible for us to have significant individual variables and an insignificant model as a whole. We use an F-test for estimating the overall significance of the model. The F-test tests whether all the regression coefficients are equal to zero. The equation for the F-statistic is:
F-statistic = (explained variance / k) / (unexplained variance / n-k-1)
If our independent variables do not explain the dependent variable at all, the F-statistic will be zero.
Second, we should estimate the significance of the individual regression coefficients. We already showed how that is done through the use of a T-test.
Third, we should estimate the standard error of the model. This is a measure of how spread out our error terms are. The higher the standard error, the worse of a fit our model is.
Lastly, we should estimate the amount of variance in the dependent variable that our model actually explains. For single regression, this was accomplished with the use of r-squared. However, when we add more variables to the model, even if they aren’t significant, they will have a large impact on our r-squared. We could simply keep adding useless variables and our r-squared value would continue to increase. For this reason, we need to use an r-squared that adjusts for degrees of freedom. This is called adjusted r-squared and the equation for it is as follows:
Adjusted r-squared = 1 – (n – 1 / n – k – 1)(1 – r-squared)
Conceptually, this takes the explained variance of the regression and divides it by the total variance but, in this case, the number of added variables has less effect. A high adjusted r-squared means that our model does a good job of explaining the variance in the dependent variable.
Assumptions of multiple linear regression
- A linear relation exists between the dependent variable and the independent variables.
- The independent variables are not random and no exact linear relation exists between two or more of them.
- The expected value of the error term is 0.
- The variance of the error term is the same for all observations.
- The error term is uncorrelated across observations
- The error term is normally distributed
Some of these are self-explanatory, but we will dive deeper into a few.
Assumption #4, that the variance of the error term is the same for all observations, when violated is known as heteroskedasticity. The following plot shows a graphical representation of Heteroskedasticity. It can be seen that variability of the dependent variable is unequal across the range of values of the independent variable.
For example, imagine that we have data on the ages and annual incomes of people that shows people who are 10 make between 0 dollars and 1 dollar, and people who are 50 make between 0 dollars and 10 million dollars. If we regress income on age, our error terms will increase as age increases. We would make great predictions for people who are 10 and horrible predictions for people who are 50.
The consequences of heteroskedasticity are that the standard errors will be lower and the test statistics will be higher. This will lead us to find significance where there is none.
To test for heteroskedasticity, we use the Breush-Pagan test. Conceptually, since we want to know if the residuals are correlated with the independent variables, using a regression between the two makes sense. The Breush-Pagan test regresses the squared residuals from the regression on the independent variables of the regression. Once this is done, the equation for the Breush-Pagan statistic is as follows:
Breush-Pagan-Statistic = n * r-squared
Remember, this is using the r-squared of the regression between the squared residuals and the independent variables. The higher the Breush-Pagan statistic, the more likely there is conditional heteroskedasticity.
To correct for heteroskedasticity, we need to adjust our standard errors. We wont get into that today.
Assumption #5, that the error term is uncorrelated across observations, when violated is known as serial correlation. Serial correlation is often a characteristic of time series analysis where errors are correlated through time.
The consequences of serial correlation are the same as heteroskedasticity. The standard error will be underestimated and the test statistics will be overestimated, leading us to find significance where there is none.
To test for serial correlation, we use the Durbin-Watson test. The Durbin-Watson test statistic is provided by most regression software, but a good estimate of it is to take 2(1-correlation between lagged residuals). If the errors of our regression are homoskedastic and not serially correlated, the Durbin-Watson statistic will be close to 2. If they are positively correlated, the statistic will be close to 0, and if they are negatively correlated the statistic will be close to 4.
To correct for serial correlation, we need to use more robust standard errors. A popular method for this is Hansen’s method which we won’t talk about today.
Assumption #2, that no exact linear relationship exists between two or more independent variables, when violated is referred to as multicollinearity. You can think of this as redundancy.
For example, using both height and weight as independent variables would be redundant since they are both highly correlated with each other.
The consequences of multicollinearity are that our standard errors will be overestimated and our test statistics will be underestimated. Conceptually, if everything is highly correlated, it becomes increasingly difficult to distinguish the individual impacts of the independent variables on the dependent variable.
Testing for multicollinearity is more an art than a science. Unlike with heteroskedasticity and serial correlation, multicollinearity is often a matter of degree rather than presence or absence. That being said, cases where we have high F-statistics but low individual T-statistics, are common signs of multicollinearity.
To correct for multicollinearity, we need to remove variables from our model.
MLR is wonderful tool, but it is only as good as the inputs it is built upon. Model specification refers to the set of variables included in the regression and the regression equation’s functional form. In order to better specify our models, we should follow a few principles:
- The model should be grounded in sound economic reasoning.
- The function form chosen for the variables in the regression should be appropriate given the nature of the variables.
- The model should be simple, using only essential variables.
- The model should be examined for violations of assumptions.
- The model should be tested and found useful with data not included in the sample.
If our model turns out to be misspecified, any predictions we make with it will be invalid. Some common errors in specifying models include, omitting important variables, not transforming variables that need to be, and incorrectly pooling data that shouldn’t be pooled.
In relation to investing
MLR is often used to determine how specific factors such as interest rates, GDP, the price of commodities, cash flows, etc, influence the price movement of an asset. These models are never precise enough to use as the sole criterion for an investment decision, but they are still extremely useful for understanding what variables are relevant to your analysis.
The general purpose of MLR is to learn more about the relationship between several independent variables and an independent variable. These models are never 100% accurate for describing the world, but understanding the assumptions that they are built upon helps us determine if they can still be valuable in a given situation.