12.4 Multicollinearity and Variable Transformation
3 min read•july 23, 2024
in regression can mess up your results. It happens when your predictor variables are too closely related, making it hard to figure out which ones are really important. This can lead to weird coefficient estimates and unreliable predictions.
There are ways to spot and fix multicollinearity. You can use correlation matrices, VIF, or condition numbers to detect it. If you find it, try transforming your variables through centering, , or more advanced techniques like PCA or PLS regression.
Multicollinearity and Variable Transformation
Multicollinearity in regression analysis
Top images from around the web for Multicollinearity in regression analysis
Frontiers | In Epigenomic Studies, Including Cell-Type Adjustments in Regression Models Can ... View original
Is this image relevant?
Determining Multicollinearity for the given ScatterPlot Matrix - Cross Validated View original
Is this image relevant?
multicollinearity - How to interpret ridge regression plot - Cross Validated View original
Is this image relevant?
Frontiers | In Epigenomic Studies, Including Cell-Type Adjustments in Regression Models Can ... View original
Is this image relevant?
Determining Multicollinearity for the given ScatterPlot Matrix - Cross Validated View original
Is this image relevant?
1 of 3
Top images from around the web for Multicollinearity in regression analysis
Frontiers | In Epigenomic Studies, Including Cell-Type Adjustments in Regression Models Can ... View original
Is this image relevant?
Determining Multicollinearity for the given ScatterPlot Matrix - Cross Validated View original
Is this image relevant?
multicollinearity - How to interpret ridge regression plot - Cross Validated View original
Is this image relevant?
Frontiers | In Epigenomic Studies, Including Cell-Type Adjustments in Regression Models Can ... View original
Is this image relevant?
Determining Multicollinearity for the given ScatterPlot Matrix - Cross Validated View original
Is this image relevant?
1 of 3
High correlation among independent variables in a multiple regression model
Occurs when two or more predictor variables are linearly related (income and education level)
Leads to unstable and unreliable estimates of regression coefficients
Standard errors of the coefficients may be inflated, making it difficult to assess the significance of individual predictors (price and quality ratings for products)
Reduces the model's predictive power and interpretability
Can cause the coefficients to have unexpected signs or magnitudes (negative coefficient for a positive relationship)
Diagnostic measures for multicollinearity
examines pairwise correlations between independent variables
High correlations (above 0.8 or 0.9) indicate potential multicollinearity (age and years of experience)
(VIF) measures the extent to which the variance of a regression coefficient is inflated due to multicollinearity
VIF=1−Rj21, where Rj2 is the -squared value obtained by regressing the jth predictor on the remaining predictors
VIF value greater than 5 or 10 suggests the presence of multicollinearity (VIF of 8 for a predictor variable)
is the ratio of the largest to the smallest eigenvalue of the correlation matrix of the independent variables
Condition number greater than 30 indicates severe multicollinearity (condition number of 50)
Variable transformation for multicollinearity
Centering subtracts the mean value of each independent variable from its respective values
Reduces multicollinearity caused by interaction terms in the model (centering age and income variables)
Standardization (Z-score normalization) subtracts the mean and divides by the standard deviation for each independent variable
Scales the variables to have a mean of 0 and a standard deviation of 1 (standardizing test scores)
(PCA) transforms the original variables into a new set of uncorrelated variables called principal components
Principal components are linear combinations of the original variables and can be used as predictors in the regression model (PCA on a set of correlated financial ratios)
Partial Least Squares (PLS) regression combines features of PCA and multiple regression
Constructs new predictor variables (latent variables) that maximize the covariance between the predictors and the response variable (PLS regression for customer satisfaction analysis)
Interpretation after variable transformation
Assess the significance of the transformed variables by examining the p-values associated with the coefficients
P-value less than the chosen significance level (0.05) indicates that the transformed variable has a significant impact on the response variable (p-value of 0.02 for a transformed predictor)
Interpret the coefficients of the transformed variables
Coefficients represent the change in the response variable for a one-unit change in the transformed predictor variable, holding other variables constant (a one-unit increase in the standardized income leads to a 0.5 unit increase in the response)
Interpretation depends on the specific transformation applied (centering, standardization, PCA)
Evaluate the model's goodness of fit using the R-squared value
R-squared measures the proportion of variance in the response variable explained by the transformed predictor variables
Higher R-squared value indicates a better fit of the model to the data (R-squared of 0.8 suggests a good fit)
Assess the model's predictive power using techniques such as cross-validation or holdout sample
Compare the predicted values with the actual values to assess the model's predictive accuracy (mean absolute error of 0.1 indicates high predictive accuracy)