What is the Variance Inflation Factor (VIF)?
The Variance Inflation Factor (VIF) measures the severity of multicollinearity in regression analysis. It is a statistical concept that indicates the increase in the variance of a regression coefficient as a result of collinearity.
- Variance inflation factor (VIF) is used to detect the severity of multicollinearity in the ordinary least square (OLS) regression analysis.
- Multicollinearity inflates the variance and type II error. It makes the coefficient of a variable consistent but unreliable.
- VIF measures the number of inflated variances caused by multicollinearity.
Variance Inflation Factor and Multicollinearity
In ordinary least square (OLS) regression analysis, multicollinearity exists when two or more of the independent variables demonstrate a linear relationship between them. For example, to analyze the relationship of company sizes and revenues to stock prices in a regression model, market capitalizations and revenues are the independent variables.
A company’s market capitalization and its total revenue is strongly correlated. As a company earns increasing revenues, it also grows in size. It leads to a multicollinearity problem in the OLS regression analysis. If the independent variables in a regression model show a perfectly predictable linear relationship, it is known as perfect multicollinearity.
With multicollinearity, the regression coefficients are still consistent but are no longer reliable since the standard errors are inflated. It means that the model’s predictive power is not reduced, but the coefficients may not be statistically significant with a Type II error.
Therefore, if the coefficients of variables are not individually significant – cannot be rejected in the t-test, respectively – but can jointly explain the variance of the dependent variable with rejection in the F-test and a high coefficient of determination (R2), multicollinearity might exist. It is one of the methods to detect multicollinearity.
VIF is another commonly used tool to detect whether multicollinearity exists in a regression model. It measures how much the variance (or standard error) of the estimated regression coefficient is inflated due to collinearity.
Use of Variance Inflation Factor
VIF can be calculated by the formula below:
Where Ri2 represents the unadjusted coefficient of determination for regressing the ith independent variable on the remaining ones. The reciprocal of VIF is known as tolerance. Either VIF or tolerance can be used to detect multicollinearity, depending on personal preference.
If Ri2 is equal to 0, the variance of the remaining independent variables cannot be predicted from the ith independent variable. Therefore, when VIF or tolerance is equal to 1, the ith independent variable is not correlated to the remaining ones, which means multicollinearity does not exist in this regression model. In this case, the variance of the ith regression coefficient is not inflated.
Generally, a VIF above 4 or tolerance below 0.25 indicates that multicollinearity might exist, and further investigation is required. When VIF is higher than 10 or tolerance is lower than 0.1, there is significant multicollinearity that needs to be corrected.
However, there are also situations where high VFIs can be safely ignored without suffering from multicollinearity. The following are three such situations:
1. High VIFs only exist in control variables but not in variables of interest. In this case, the variables of interest are not collinear to each other or the control variables. The regression coefficients are not impacted.
2. When high VIFs are caused as a result of the inclusion of the products or powers of other variables, multicollinearity does not cause negative impacts. For example, a regression model includes both x and x2 as its independent variables.
3. When a dummy variable that represents more than two categories has a high VIF, multicollinearity does not necessarily exist. The variables will always have high VIFs if there is a small portion of cases in the category, regardless of whether the categorical variables are correlated to other variables.
Correction of Multicollinearity
Since multicollinearity inflates the variance of coefficients and causes type II errors, it is essential to detect and correct it. There are two simple and commonly used ways to correct multicollinearity, as listed below:
1. The first one is to remove one (or more) of the highly correlated variables. Since the information provided by the variables is redundant, the coefficient of determination will not be greatly impaired by the removal.
2. The second method is to use principal components analysis (PCA) or partial least square regression (PLS) instead of OLS regression. PLS regression can reduce the variables to a smaller set with no correlation among them. In PCA, new uncorrelated variables are created. It minimizes information loss and improves the predictability of a model.
CFI is the official provider of the global Business Intelligence & Data Analyst (BIDA)® certification program, designed to help anyone become a world-class analyst. To keep advancing your career, the additional resources below will be useful: