You have 3 free guides left 😟
Unlock your guides
You have 3 free guides left 😟
Unlock your guides

12.4 Multicollinearity and Variable Transformation

3 min readjuly 23, 2024

in regression can mess up your results. It happens when your predictor variables are too closely related, making it hard to figure out which ones are really important. This can lead to weird coefficient estimates and unreliable predictions.

There are ways to spot and fix multicollinearity. You can use correlation matrices, VIF, or condition numbers to detect it. If you find it, try transforming your variables through centering, , or more advanced techniques like PCA or PLS regression.

Multicollinearity and Variable Transformation

Multicollinearity in regression analysis

Top images from around the web for Multicollinearity in regression analysis
Top images from around the web for Multicollinearity in regression analysis
  • High correlation among independent variables in a multiple regression model
    • Occurs when two or more predictor variables are linearly related (income and education level)
  • Leads to unstable and unreliable estimates of regression coefficients
    • Standard errors of the coefficients may be inflated, making it difficult to assess the significance of individual predictors (price and quality ratings for products)
  • Reduces the model's predictive power and interpretability
  • Can cause the coefficients to have unexpected signs or magnitudes (negative coefficient for a positive relationship)

Diagnostic measures for multicollinearity

  • examines pairwise correlations between independent variables
    • High correlations (above 0.8 or 0.9) indicate potential multicollinearity (age and years of experience)
  • (VIF) measures the extent to which the variance of a regression coefficient is inflated due to multicollinearity
    • VIF=11Rj2VIF = \frac{1}{1-R_j^2}, where Rj2R_j^2 is the -squared value obtained by regressing the jth predictor on the remaining predictors
    • VIF value greater than 5 or 10 suggests the presence of multicollinearity (VIF of 8 for a predictor variable)
  • is the ratio of the largest to the smallest eigenvalue of the correlation matrix of the independent variables
    • Condition number greater than 30 indicates severe multicollinearity (condition number of 50)

Variable transformation for multicollinearity

  • Centering subtracts the mean value of each independent variable from its respective values
    • Reduces multicollinearity caused by interaction terms in the model (centering age and income variables)
  • Standardization (Z-score normalization) subtracts the mean and divides by the standard deviation for each independent variable
    • Scales the variables to have a mean of 0 and a standard deviation of 1 (standardizing test scores)
  • (PCA) transforms the original variables into a new set of uncorrelated variables called principal components
    • Principal components are linear combinations of the original variables and can be used as predictors in the regression model (PCA on a set of correlated financial ratios)
  • Partial Least Squares (PLS) regression combines features of PCA and multiple regression
    • Constructs new predictor variables (latent variables) that maximize the covariance between the predictors and the response variable (PLS regression for customer satisfaction analysis)

Interpretation after variable transformation

  • Assess the significance of the transformed variables by examining the p-values associated with the coefficients
    • P-value less than the chosen significance level (0.05) indicates that the transformed variable has a significant impact on the response variable (p-value of 0.02 for a transformed predictor)
  • Interpret the coefficients of the transformed variables
    • Coefficients represent the change in the response variable for a one-unit change in the transformed predictor variable, holding other variables constant (a one-unit increase in the standardized income leads to a 0.5 unit increase in the response)
    • Interpretation depends on the specific transformation applied (centering, standardization, PCA)
  • Evaluate the model's goodness of fit using the R-squared value
    • R-squared measures the proportion of variance in the response variable explained by the transformed predictor variables
    • Higher R-squared value indicates a better fit of the model to the data (R-squared of 0.8 suggests a good fit)
  • Assess the model's predictive power using techniques such as cross-validation or holdout sample
    • Compare the predicted values with the actual values to assess the model's predictive accuracy (mean absolute error of 0.1 indicates high predictive accuracy)
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary