You have 3 free guides left 😟
Unlock your guides
You have 3 free guides left 😟
Unlock your guides

Linear regression models are powerful tools, but they rely on key assumptions. Model diagnostics help us check if these assumptions hold true, ensuring our results are reliable and meaningful.

We'll explore residual analysis, , and . These techniques allow us to spot issues like non-linearity or , and make necessary adjustments to improve our models' accuracy and usefulness.

Residual Diagnostics

Understanding Residuals and Their Properties

Top images from around the web for Understanding Residuals and Their Properties
Top images from around the web for Understanding Residuals and Their Properties
  • represent differences between observed and predicted values in a regression model
  • assumes constant variance of residuals across all levels of predictors
  • of residuals indicates errors follow a normal distribution
  • Linearity assumes a linear relationship between predictors and the response variable
  • of errors means residuals are not correlated with each other
  • Residual plots help visualize patterns and detect violations of assumptions
  • assesses homoscedasticity (constant variance of residuals)
  • evaluates normality of residuals
  • checks for autocorrelation in residuals

Detecting and Addressing Assumption Violations

  • Non-constant variance can be addressed through weighted least squares regression
  • Non-normality may require transformation of the response variable (log, square root)
  • Non-linearity might be resolved by adding polynomial terms or using non-linear regression
  • Autocorrelation in time series data can be handled with autoregressive models
  • Residual vs. Fitted plot reveals patterns indicating non-linearity or heteroscedasticity
  • helps assess homoscedasticity across the range of predicted values
  • compares residuals to a theoretical normal distribution

Influential Observations

Identifying Outliers and High Leverage Points

  • Outliers deviate significantly from other observations in the dataset
  • points have extreme values in the predictor variables
  • Influential observations substantially impact the regression model's coefficients
  • measures the overall influence of an observation on the model
  • help identify potential outliers (values > 3 or < -3)
  • measure leverage, with high values indicating potential influence
  • (Difference in Fits) assesses an observation's influence on its own predicted value
  • evaluates an observation's influence on specific regression coefficients

Handling Influential Observations

  • Investigate the cause of outliers or influential points before taking action
  • Remove truly erroneous data points after careful consideration
  • Robust regression techniques can minimize the impact of outliers
  • Winsorization replaces extreme values with less extreme ones
  • Transformation of variables may reduce the impact of influential observations
  • Adding interaction terms or polynomial features can capture non-linear relationships
  • helps assess model stability in the presence of influential points

Diagnostic Plots

Interpreting Q-Q Plots

  • Q-Q (Quantile-Quantile) plots compare sample quantiles to theoretical quantiles
  • Straight line in a indicates normally distributed residuals
  • Deviations from the line suggest non-normality in the tails of the distribution
  • S-shaped curve in Q-Q plot indicates skewness in the residuals
  • Light tails (points below the line at both ends) suggest under-dispersion
  • Heavy tails (points above the line at both ends) indicate over-dispersion
  • R function
    qqnorm()
    creates Q-Q plots, while
    qqline()
    adds a reference line

Analyzing Residual Plots

  • Residual plots show residuals against fitted values or predictors
  • around zero line indicates good model fit
  • suggests heteroscedasticity (non-constant variance)
  • U-shaped or inverted U-shaped patterns indicate non-linearity
  • Clusters in residual plots may suggest missing predictor variables
  • Residual plots help detect outliers and influential observations
  • R function
    plot()
    applied to a linear model object creates diagnostic plots
  • Partial residual plots assess linearity for individual predictors

Multicollinearity

Detecting and Measuring Multicollinearity

  • Multicollinearity occurs when predictor variables are highly correlated
  • quantifies the severity of multicollinearity
  • VIF values > 5 or 10 indicate problematic multicollinearity
  • Correlation matrix reveals pairwise correlations between predictors
  • of the design matrix indicates overall multicollinearity
  • Eigenvalues of the correlation matrix can identify sources of multicollinearity
  • (1/VIF) measures the proportion of variance not explained by other predictors
  • R function
    vif()
    from the
    car
    package calculates VIF for regression models

Addressing Multicollinearity Issues

  • Remove one of the highly correlated predictors from the model
  • Combine correlated predictors into a single composite variable
  • creates uncorrelated linear combinations
  • adds a penalty term to reduce the impact of multicollinearity
  • can perform variable selection in the presence of multicollinearity
  • handles multicollinearity in high-dimensional data
  • Centering or standardizing predictors may help reduce multicollinearity
  • Collecting additional data or increasing sample size can sometimes mitigate the issue
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary