What is a Ridge?
Ridge regression is the method used for the analysis of multicollinearity in multiple regression data. It is most suitable when a data set contains a higher number of predictor variables than the number of observations. The second-best scenario is when multicollinearity is experienced in a set.
Multicollinearity happens when predictor variables exhibit a correlation among themselves. Ridge regression aims at reducing the standard error by adding some bias in the estimates of the regression. The reduction of the standard error in regression estimates significantly increases the reliability of the estimates.
- Ridge regression is a technique used to eliminate multicollinearity in data models.
- In a case where observations are fewer than predictor variables, ridge regression is the most appropriate technique.
- Ridge regression constraint variables form a circular shape when plotted, unlike the LASSO plot, which forms a diamond shape.
Variables Standardization in Ridge Regression
Variables standardization is the initial procedure in ridge regression. Both the independent and dependent variables require standardization through subtraction of their averages and a division of the result with the standard deviations. It is common practice to annotate in a formula whether the variables therein are standardized or not.
Therefore, all ridge regression computations use standardized variables to avoid the notations on whether individual variables have been standardized. The coefficients can then be reverted to their original scales in the end.
Ridge Regression vs. Least Squares
Ridge regression is a better predictor than least squares regression when the predictor variables are more than the observations. The least squares method cannot tell the difference between more useful and less useful predictor variables and includes all the predictors while developing a model. This reduces the accuracy of the model, resulting in overfitting and redundancy.
All of the above challenges are addressed by ridge regression. Ridge regression works with the advantage of not requiring unbiased estimators – rather, it adds bias to estimators to reduce the standard error. It adds bias enough to make the estimates a reliable representation of the population of data.
Shrinkage and Regularization
A ridge estimator is a shrinkage tool used in ridge regression. A shrinkage estimator is a parameter that produces new estimators that have been shrunk to give a value closer to the real population parameters. A least squares estimate can be shrunk using a ridge estimator to improve the estimate, especially when there is multicollinearity in the data.
Regularization in ridge regression includes the application of a penalty to coefficients. The shrinkage involves the application of the same factor on the coefficients. This means that no coefficient will be left out when building the model.
Multicollinearity is the existence of a correlation between independent variables in modeled data. It can cause inaccuracy in the regression coefficient estimates. It can also magnify the standard errors in the regression coefficients and reduce the efficiency of any t-tests. It can produce deceiving results and p-values and increase the redundancy of a model, making its predictability inefficient and less reliable.
Multicollinearity can be introduced into the data from various sources, such as during data collection, from the population or linear model constraints, or an over-defined model, outliers, or model specification or choice.
Data collection may cause multicollinearity when it is sourced using an inappropriate sampling procedure. The data may come from a smaller subset than expected – hence, the effect. Population or model constraints cause multicollinearity due to physical, legal, or political constraints, which are natural, regardless of the type of sampling method used.
Over-defining a model will also cause multicollinearity due to the existence of more variables than observations. It is avoidable during the development of a model. The model’s choice or specification can also cause multicollinearity due to the use of independent variables previously interacting in the initial variable set. Outliers are extreme variable values that can cause multicollinearity. The multicollinearity can be reversed by the elimination of the outliers before applying ridge regression.
Multicollinearity Detection and Correction
The detection of multicollinearity is key to the reduction of standard errors in models for predictability efficiency. First, one can detect by investigating independent variables for correlation in pairwise scatter plots. High pairwise correlations of independent variables can mean the presence of multicollinearity.
Secondly, one can detect multicollinearity through the consideration of Variance Inflation Factors (VIFs). A VIF score of 10 or more shows that variables are collinear. Thirdly, one can detect multicollinearity by checking whether the correlation matrix eigenvalues are close to zero. One should use the condition numbers, as opposed to using the eigenvalue numerical sizes. The larger the condition numbers, the more the multicollinearity.
Multicollinearity correction depends on the cause. When the source of collinearity is data collection, for example, the correction will involve collecting additional data from the proper subpopulation. If the cause is the linear model choice, the correction will include simplifying the model by the proper variable selection methods. If the causes of multicollinearity are certain observations, eliminate the observations. Ridge regression is also an effective eliminator of multicollinearity.
To keep advancing your career, the additional CFI resources below will be useful: