What is collinearity and how to deal with
What is collinearity
When you perform a regression with independent and dependent variables, coefficient estimates from a model represent
A change in a dependent variable for each 1 unit change in an independent variable when other independent variables remain the same.
This statement holds only when independent variables are truly independent. What does this mean?
Let’s say there are two independent variables A and B and a dependent variable Z. When you perform a regression on these variables and get a coefficient estimate of 10 for A, it means a 1 unit increase in A will increase Z by 10 when other variables remain constant — just like a slope from a linear equation. However, when A and B are correlated, increasing A leads to change in B as well. Therefore, since B changes, Z could increase more or less than 10.
This latter state is called collinearity — two predictors are correlated with each other. When this relationship involves three or more variables, it is called multicollinearity.
Types of collinearity
There are two types of multicollinearity: structural and data-based multicollinearity. As you can infer from its name, structural multicollinearity occurs because of how you structure the variables. So when you create an independent variable using other independent variables — a polynomial term or interaction term — you introduce structural multicollinearity into a dataset. For instance, a relationship between X and X² is described as structural multicollinearity. However, isn’t this how polynomial regression works? Yes, it is. So naturally, when you perform a polynomial regression, you are likely to bring structural multicollinearity into its dataset.
In contrast, data-based multicollinearity is caused by poorly designed experiments or issues innate in a dataset. Generally, observational experiments are more likely to have data-based multicollinearity.
Problems with collinearity
However, is it always a problem to have collinearity in a dataset? The answer is no. It depends on the goal of your project. Regardless of potential issues that multicollinearity brings to a model, it does not diminish the model’s predictive power. Therefore, when your goal is purely on prediction, having multicollinearity is hardly an issue.
However, when your primary goal is to understand relationships among variables or to give a numerical precision, then multicollinearity will be a problem. Because multicollinearity does not promise you the base assumption about coefficient estimates — variables remaining constant when one independent variable changes — a model gives you unstable and unreliable coefficient estimates. Even with a slight change, coefficient estimate values can differ greatly. Multicollinearity also leads to small t-statistic and large standard errors. Thus, coefficient confidence intervals will be wide, which makes it harder to reject the null hypothesis. Therefore, it becomes difficult to understand how independent variables affect a dependent variable.
How to detect it
Now that you know collinearity isn’t good to have when you want to understand associations in a dataset, how do you detect it? First of all, there are two ways to detect collinearity.
First, you can use pairs() in R to plot each independent variable against each other to see whether there is a definite pattern or not. However, since interpreting patterns is subjective, one might find a more concrete method.
For those who are looking for a numerical approach, they can use cor() in R or corrplot package to produce a correlation matrix. However, there is no magic number since the threshold should be determined project-by-project.
To detect multicollinearity, you can use the VIF — variance inflation factor. Generally, a VIF value between 5 and 10 is considered high, and over 10 is considered extremely high. Since VIF is useful in statistical analysis, lots of packages provide a function to calculate it. Few to name, there are car, faraway, mctest, and HH packages.
How to deal with it
When you decide to reduce multicollinearity, there are several ways you can try. First, you can standardize independent variables by subtracting mean from them. This will lead to reducing VIF scores. Another way you can use is removing some of the highly correlated variables or combining them so that they become one independent variable. Lastly, you could collect additional data with reducing correlations among predictors.
If any of these does not work, you can use methods that can handle multicollinearity. To provide few examples, there are ridge regression, lasso, principal component analysis, and partial least squares.
Reference
[1] “10.8 — Reducing Data-Based Multicollinearity.” 10.8 — Reducing Data-Based Multicollinearity | STAT 462, online.stat.psu.edu/stat462/node/181/.
[2] Frost, Jim. “Multicollinearity in Regression Analysis: Problems, Detection, and Solutions.” Statistics By Jim, 20 Sept. 2017, statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/.
[3] Iacobucci, Dawn, et al. “Mean Centering Helps Alleviate ‘Micro’ but Not ‘Macro’ Multicollinearity.” Behavior Research Methods, Springer US, 7 July 2015, link.springer.com/article/10.3758/s13428–015–0624-x.
[4] Monica, Reinstate. “When Can We Speak of Collinearity.” Cross Validated, 28 May 2014, stats.stackexchange.com/questions/100175/when-can-we-speak-of-collinearity/100272#100272.
[5] Witten, Daniela. An Introduction to Statistical Learning: with Applications in r. Springer, 2013.