Linear regression models are powerful tools, but they rely on key assumptions. Model diagnostics help us check if these assumptions hold true, ensuring our results are reliable and meaningful.
We'll explore residual analysis, influential observations , and multicollinearity . These techniques allow us to spot issues like non-linearity or outliers , and make necessary adjustments to improve our models' accuracy and usefulness.
Residual Diagnostics
Understanding Residuals and Their Properties
Top images from around the web for Understanding Residuals and Their Properties Chapter 16 Regression | Untitled View original
Is this image relevant?
R - Checking homoscedasticity between sets - Stack Overflow View original
Is this image relevant?
Chapter 16 Regression | Untitled View original
Is this image relevant?
R - Checking homoscedasticity between sets - Stack Overflow View original
Is this image relevant?
1 of 2
Top images from around the web for Understanding Residuals and Their Properties Chapter 16 Regression | Untitled View original
Is this image relevant?
R - Checking homoscedasticity between sets - Stack Overflow View original
Is this image relevant?
Chapter 16 Regression | Untitled View original
Is this image relevant?
R - Checking homoscedasticity between sets - Stack Overflow View original
Is this image relevant?
1 of 2
Residuals represent differences between observed and predicted values in a regression model
Homoscedasticity assumes constant variance of residuals across all levels of predictors
Normality of residuals indicates errors follow a normal distribution
Linearity assumes a linear relationship between predictors and the response variable
Independence of errors means residuals are not correlated with each other
Residual plots help visualize patterns and detect violations of assumptions
Breusch-Pagan test assesses homoscedasticity (constant variance of residuals)
Shapiro-Wilk test evaluates normality of residuals
Durbin-Watson test checks for autocorrelation in residuals
Detecting and Addressing Assumption Violations
Non-constant variance can be addressed through weighted least squares regression
Non-normality may require transformation of the response variable (log, square root)
Non-linearity might be resolved by adding polynomial terms or using non-linear regression
Autocorrelation in time series data can be handled with autoregressive models
Residual vs. Fitted plot reveals patterns indicating non-linearity or heteroscedasticity
Scale-Location plot helps assess homoscedasticity across the range of predicted values
Normal Q-Q plot compares residuals to a theoretical normal distribution
Influential Observations
Identifying Outliers and High Leverage Points
Outliers deviate significantly from other observations in the dataset
Leverage points have extreme values in the predictor variables
Influential observations substantially impact the regression model's coefficients
Cook's distance measures the overall influence of an observation on the model
Standardized residuals help identify potential outliers (values > 3 or < -3)
Hat values measure leverage, with high values indicating potential influence
DFFITS (Difference in Fits) assesses an observation's influence on its own predicted value
DFBETAS evaluates an observation's influence on specific regression coefficients
Handling Influential Observations
Investigate the cause of outliers or influential points before taking action
Remove truly erroneous data points after careful consideration
Robust regression techniques can minimize the impact of outliers
Winsorization replaces extreme values with less extreme ones
Transformation of variables may reduce the impact of influential observations
Adding interaction terms or polynomial features can capture non-linear relationships
Cross-validation helps assess model stability in the presence of influential points
Diagnostic Plots
Interpreting Q-Q Plots
Q-Q (Quantile-Quantile) plots compare sample quantiles to theoretical quantiles
Straight line in a Q-Q plot indicates normally distributed residuals
Deviations from the line suggest non-normality in the tails of the distribution
S-shaped curve in Q-Q plot indicates skewness in the residuals
Light tails (points below the line at both ends) suggest under-dispersion
Heavy tails (points above the line at both ends) indicate over-dispersion
R function qqnorm()
creates Q-Q plots, while qqline()
adds a reference line
Analyzing Residual Plots
Residual plots show residuals against fitted values or predictors
Random scatter around zero line indicates good model fit
Funnel shape suggests heteroscedasticity (non-constant variance)
U-shaped or inverted U-shaped patterns indicate non-linearity
Clusters in residual plots may suggest missing predictor variables
Residual plots help detect outliers and influential observations
R function plot()
applied to a linear model object creates diagnostic plots
Partial residual plots assess linearity for individual predictors
Multicollinearity
Detecting and Measuring Multicollinearity
Multicollinearity occurs when predictor variables are highly correlated
Variance Inflation Factor (VIF) quantifies the severity of multicollinearity
VIF values > 5 or 10 indicate problematic multicollinearity
Correlation matrix reveals pairwise correlations between predictors
Condition number of the design matrix indicates overall multicollinearity
Eigenvalues of the correlation matrix can identify sources of multicollinearity
Tolerance (1/VIF) measures the proportion of variance not explained by other predictors
R function vif()
from the car
package calculates VIF for regression models
Addressing Multicollinearity Issues
Remove one of the highly correlated predictors from the model
Combine correlated predictors into a single composite variable
Principal Component Analysis (PCA) creates uncorrelated linear combinations
Ridge regression adds a penalty term to reduce the impact of multicollinearity
Lasso regression can perform variable selection in the presence of multicollinearity
Partial Least Squares regression handles multicollinearity in high-dimensional data
Centering or standardizing predictors may help reduce multicollinearity
Collecting additional data or increasing sample size can sometimes mitigate the issue