Preparatory Statistics

📈Preparatory Statistics Unit 14 – Simple Linear Regression & Correlation

Simple linear regression models the relationship between two continuous variables, helping us understand how changes in one variable affect another. This powerful statistical tool allows us to make predictions and draw inferences about the relationship between variables. Key concepts include the independent and dependent variables, slope, y-intercept, and residuals. The method uses least squares to find the best-fitting line, calculates important statistics like R-squared, and requires assumptions such as linearity and independence of observations.

What's Simple Linear Regression?

  • Statistical method used to model the linear relationship between two continuous variables
  • Consists of one independent variable (predictor) and one dependent variable (response)
  • Aims to find the best-fitting straight line that minimizes the sum of squared residuals
  • Equation of the line: y=β0+β1x+ϵy = \beta_0 + \beta_1x + \epsilon
    • yy: dependent variable
    • xx: independent variable
    • β0\beta_0: y-intercept
    • β1\beta_1: slope
    • ϵ\epsilon: random error term
  • Can be used for prediction and inference
  • Helps understand how changes in the independent variable affect the dependent variable
  • Requires a linear relationship between the variables

Key Concepts and Terms

  • Independent variable (predictor): The variable used to predict or explain the dependent variable
  • Dependent variable (response): The variable being predicted or explained by the independent variable
  • Slope (β1\beta_1): Change in the dependent variable for a one-unit change in the independent variable
  • Y-intercept (β0\beta_0): Value of the dependent variable when the independent variable is zero
  • Residuals: Differences between the observed values and the predicted values from the regression line
  • Coefficient of determination (R2R^2): Proportion of the variance in the dependent variable explained by the independent variable
  • Standard error of the estimate: Measure of the accuracy of predictions made with the regression line
  • Hypothesis testing: Assessing the significance of the relationship between the variables

The Math Behind It

  • Least squares method: Minimizes the sum of squared residuals to find the best-fitting line
  • Slope formula: β1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2\beta_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}
  • Y-intercept formula: β0=yˉβ1xˉ\beta_0 = \bar{y} - \beta_1\bar{x}
  • Coefficient of determination: R2=SSRSST=1SSESSTR^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}
    • SSRSSR: Sum of squares regression
    • SSTSST: Total sum of squares
    • SSESSE: Sum of squares error
  • Standard error of the estimate: se=SSEn2s_e = \sqrt{\frac{SSE}{n-2}}
  • Hypothesis tests for the slope and y-intercept use t-tests
  • Confidence intervals for the slope and y-intercept can be constructed using the standard errors and t-distribution

Correlation vs. Regression

  • Correlation measures the strength and direction of the linear relationship between two variables
  • Regression models the relationship between a dependent variable and one or more independent variables
  • Correlation coefficient (r) ranges from -1 to +1, indicating the strength and direction of the linear relationship
  • Regression focuses on predicting the dependent variable based on the independent variable(s)
  • Correlation does not imply causation, while regression can suggest causal relationships with proper experimental design
  • Correlation is symmetric (interchangeable), while regression is asymmetric (dependent and independent variables)

How to Calculate and Interpret

  • Calculate the slope and y-intercept using the formulas or statistical software
  • Interpret the slope as the change in the dependent variable for a one-unit change in the independent variable
  • Interpret the y-intercept as the value of the dependent variable when the independent variable is zero
  • Calculate the coefficient of determination (R2R^2) to assess the proportion of variance explained by the model
  • Interpret R2R^2 as the percentage of variation in the dependent variable explained by the independent variable
  • Conduct hypothesis tests for the slope and y-intercept to determine their statistical significance
  • Construct confidence intervals for the slope and y-intercept to estimate the range of plausible values
  • Use the regression equation to make predictions for new values of the independent variable

Assumptions and Limitations

  • Linearity: The relationship between the variables must be linear
  • Independence: Observations must be independent of each other
  • Homoscedasticity: The variance of the residuals should be constant across all levels of the independent variable
  • Normality: The residuals should be normally distributed
  • No multicollinearity: Independent variables should not be highly correlated with each other (multiple regression)
  • Outliers can heavily influence the regression line and should be carefully examined
  • Extrapolation beyond the range of the observed data can lead to unreliable predictions
  • Correlation does not imply causation; other factors may influence the relationship

Real-World Applications

  • Predicting sales based on advertising expenditure (marketing)
  • Estimating the relationship between years of education and income (economics)
  • Modeling the effect of temperature on crop yields (agriculture)
  • Analyzing the relationship between a drug dosage and patient response (pharmacology)
  • Predicting housing prices based on square footage (real estate)
  • Examining the relationship between study time and exam scores (education)
  • Forecasting demand for a product based on its price (business)

Common Mistakes to Avoid

  • Ignoring the assumptions of linear regression
  • Confusing correlation with causation
  • Extrapolating beyond the range of the observed data
  • Failing to consider other relevant variables that may influence the relationship
  • Overinterpreting the results without considering the context and limitations
  • Relying solely on R2R^2 to assess the model's goodness of fit
  • Not checking for outliers or influential observations
  • Using linear regression when the relationship between the variables is clearly nonlinear


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.