📉Intro to Business Statistics Unit 13 – Linear Regression and Correlation

Linear regression is a powerful statistical tool used to analyze relationships between variables. It helps predict outcomes based on one or more factors, making it valuable in fields like finance, marketing, and healthcare. Understanding its key concepts and applications is crucial for data-driven decision-making. This study guide covers the fundamentals of linear regression, including its mathematical foundations and types. It also explores important assumptions, real-world applications, and common pitfalls to avoid when using this technique. Mastering these concepts will enhance your ability to interpret data and make informed predictions.

Study Guides for Unit 13

What's Linear Regression?

  • Linear regression analyzes the linear relationship between a dependent variable and one or more independent variables
  • Helps predict the value of the dependent variable based on the values of the independent variables
  • Represented by the equation y=β0+β1x+ϵy = \beta_0 + \beta_1x + \epsilon, where yy is the dependent variable, xx is the independent variable, β0\beta_0 is the y-intercept, β1\beta_1 is the slope, and ϵ\epsilon is the error term
  • The goal is to find the line of best fit that minimizes the sum of squared residuals (differences between observed and predicted values)
  • Can be used for both simple linear regression (one independent variable) and multiple linear regression (two or more independent variables)
  • Allows for hypothesis testing to determine the statistical significance of the relationship between variables
  • Provides a measure of the strength of the relationship through the coefficient of determination (R2R^2)

Key Concepts and Terms

  • Dependent variable (response variable) is the variable being predicted or explained by the independent variable(s)
  • Independent variable (predictor variable) is the variable used to predict or explain the dependent variable
  • Slope (β1\beta_1) represents the change in the dependent variable for a one-unit change in the independent variable
  • Y-intercept (β0\beta_0) is the value of the dependent variable when the independent variable is zero
  • Residuals are the differences between the observed values and the predicted values from the regression line
  • Coefficient of determination (R2R^2) measures the proportion of variance in the dependent variable explained by the independent variable(s)
    • Ranges from 0 to 1, with higher values indicating a stronger relationship
  • P-value is used to determine the statistical significance of the relationship between variables
    • A p-value less than the chosen significance level (e.g., 0.05) indicates a statistically significant relationship
  • Confidence interval is a range of values that is likely to contain the true value of a population parameter with a certain level of confidence (e.g., 95%)

The Math Behind It

  • The least squares method is used to estimate the parameters (β0\beta_0 and β1\beta_1) of the linear regression model
  • The goal is to minimize the sum of squared residuals (SSR): SSR=i=1n(yiy^i)2SSR = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, where yiy_i is the observed value and y^i\hat{y}_i is the predicted value
  • The normal equations are used to solve for the parameters:
    • i=1nyi=nβ0+β1i=1nxi\sum_{i=1}^{n} y_i = n\beta_0 + \beta_1 \sum_{i=1}^{n} x_i
    • i=1nxiyi=β0i=1nxi+β1i=1nxi2\sum_{i=1}^{n} x_iy_i = \beta_0 \sum_{i=1}^{n} x_i + \beta_1 \sum_{i=1}^{n} x_i^2
  • The slope (β1\beta_1) is calculated as: β1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2\beta_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}
  • The y-intercept (β0\beta_0) is calculated as: β0=yˉβ1xˉ\beta_0 = \bar{y} - \beta_1 \bar{x}
  • Hypothesis testing is performed using the t-test for the significance of the slope and the F-test for the overall significance of the regression model
  • Confidence intervals for the parameters are calculated using the standard errors and the t-distribution with n2n-2 degrees of freedom

Types of Linear Regression

  • Simple linear regression involves one independent variable and one dependent variable
    • Example: predicting sales revenue (dependent variable) based on advertising expenditure (independent variable)
  • Multiple linear regression involves two or more independent variables and one dependent variable
    • Example: predicting house prices (dependent variable) based on square footage, number of bedrooms, and location (independent variables)
  • Polynomial regression is a type of linear regression that includes polynomial terms of the independent variable(s)
    • Example: predicting the relationship between age (independent variable) and income (dependent variable) using a quadratic term (age2age^2)
  • Stepwise regression is a method for selecting the most relevant independent variables in a multiple linear regression model
    • Forward selection starts with no variables and adds the most significant variable at each step
    • Backward elimination starts with all variables and removes the least significant variable at each step
  • Ridge regression and lasso regression are regularization techniques used to address multicollinearity and improve the stability of the estimates
    • Ridge regression adds a penalty term to the sum of squared residuals to shrink the coefficients towards zero
    • Lasso regression adds a penalty term that can set some coefficients exactly to zero, effectively performing variable selection

Correlation vs. Regression

  • Correlation measures the strength and direction of the linear relationship between two variables
    • Pearson's correlation coefficient (r) ranges from -1 to 1, with -1 indicating a perfect negative correlation, 1 indicating a perfect positive correlation, and 0 indicating no correlation
    • Does not imply causation and does not provide a predictive model
  • Regression analyzes the relationship between a dependent variable and one or more independent variables
    • Provides a predictive model that can be used to estimate the value of the dependent variable based on the values of the independent variables
    • Can be used to infer causality, but only if certain assumptions are met (e.g., no confounding variables, no reverse causality)
  • Correlation is a necessary but not sufficient condition for regression
    • A strong correlation between variables is required for a meaningful regression model
    • However, a strong correlation does not guarantee that a regression model will be accurate or useful
  • Regression can be used to quantify the relationship between variables, while correlation only provides a measure of the strength and direction of the relationship

Assumptions and Limitations

  • Linearity assumes that the relationship between the dependent variable and the independent variable(s) is linear
    • Violations can lead to biased and inefficient estimates
    • Can be checked using residual plots and tests for linearity (e.g., RESET test)
  • Independence assumes that the observations are independent of each other
    • Violations can occur with time series data or clustered data
    • Can be checked using the Durbin-Watson test for autocorrelation
  • Homoscedasticity assumes that the variance of the residuals is constant across all levels of the independent variable(s)
    • Violations (heteroscedasticity) can lead to inefficient estimates and invalid inference
    • Can be checked using residual plots and tests for heteroscedasticity (e.g., Breusch-Pagan test)
  • Normality assumes that the residuals are normally distributed
    • Violations can affect the validity of hypothesis tests and confidence intervals
    • Can be checked using normal probability plots and tests for normality (e.g., Shapiro-Wilk test)
  • No multicollinearity assumes that the independent variables are not highly correlated with each other
    • Violations can lead to unstable and unreliable estimates
    • Can be checked using the variance inflation factor (VIF) and correlation matrix
  • Outliers and influential observations can have a significant impact on the regression results
    • Can be identified using residual plots, leverage values, and Cook's distance
    • May need to be removed or addressed using robust regression techniques

Real-World Applications

  • Finance: predicting stock prices based on economic indicators and company performance metrics
  • Marketing: analyzing the impact of advertising expenditure on sales revenue and market share
  • Healthcare: identifying risk factors for diseases and predicting patient outcomes based on clinical variables
  • Real estate: estimating property values based on location, size, and amenities
  • Economics: studying the relationship between economic growth and factors such as investment, education, and trade
  • Social sciences: investigating the determinants of income inequality, crime rates, and political preferences
  • Environmental studies: modeling the impact of climate change on biodiversity and ecosystem services
  • Sports analytics: predicting player performance based on past statistics and physical attributes

Common Pitfalls and How to Avoid Them

  • Overfitting occurs when a model is too complex and fits the noise in the data rather than the underlying pattern
    • Can be avoided by using cross-validation, regularization techniques, and model selection criteria (e.g., adjusted R2R^2, AIC, BIC)
  • Underfitting occurs when a model is too simple and fails to capture the true relationship between variables
    • Can be avoided by including relevant variables, considering non-linear relationships, and using more flexible models (e.g., polynomial regression, splines)
  • Extrapolation involves making predictions outside the range of the observed data
    • Can lead to unreliable and inaccurate predictions
    • Should be avoided or done with caution, acknowledging the increased uncertainty
  • Confounding variables are variables that are related to both the dependent and independent variables, causing a spurious relationship
    • Can be addressed by including the confounding variables in the model or using techniques such as instrumental variables and propensity score matching
  • Misinterpretation of coefficients can occur when the units or scales of the variables are not carefully considered
    • Standardizing or centering the variables can make the coefficients more interpretable
    • The interpretation should always be in the context of the specific units and scales used
  • Ignoring practical significance focuses solely on statistical significance and overlooks the practical implications of the results
    • The magnitude and direction of the coefficients should be considered in addition to their statistical significance
    • The results should be interpreted in the context of the specific application and the decision-making process


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary