All Study Guides Preparatory Statistics Unit 14
📈 Preparatory Statistics Unit 14 – Simple Linear Regression & CorrelationSimple linear regression models the relationship between two continuous variables, helping us understand how changes in one variable affect another. This powerful statistical tool allows us to make predictions and draw inferences about the relationship between variables.
Key concepts include the independent and dependent variables, slope, y-intercept, and residuals. The method uses least squares to find the best-fitting line, calculates important statistics like R-squared, and requires assumptions such as linearity and independence of observations.
What's Simple Linear Regression?
Statistical method used to model the linear relationship between two continuous variables
Consists of one independent variable (predictor) and one dependent variable (response)
Aims to find the best-fitting straight line that minimizes the sum of squared residuals
Equation of the line: y = β 0 + β 1 x + ϵ y = \beta_0 + \beta_1x + \epsilon y = β 0 + β 1 x + ϵ
y y y : dependent variable
x x x : independent variable
β 0 \beta_0 β 0 : y-intercept
β 1 \beta_1 β 1 : slope
ϵ \epsilon ϵ : random error term
Can be used for prediction and inference
Helps understand how changes in the independent variable affect the dependent variable
Requires a linear relationship between the variables
Key Concepts and Terms
Independent variable (predictor): The variable used to predict or explain the dependent variable
Dependent variable (response): The variable being predicted or explained by the independent variable
Slope (β 1 \beta_1 β 1 ): Change in the dependent variable for a one-unit change in the independent variable
Y-intercept (β 0 \beta_0 β 0 ): Value of the dependent variable when the independent variable is zero
Residuals: Differences between the observed values and the predicted values from the regression line
Coefficient of determination (R 2 R^2 R 2 ): Proportion of the variance in the dependent variable explained by the independent variable
Standard error of the estimate: Measure of the accuracy of predictions made with the regression line
Hypothesis testing: Assessing the significance of the relationship between the variables
The Math Behind It
Least squares method: Minimizes the sum of squared residuals to find the best-fitting line
Slope formula: β 1 = ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) ∑ i = 1 n ( x i − x ˉ ) 2 \beta_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} β 1 = ∑ i = 1 n ( x i − x ˉ ) 2 ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ )
Y-intercept formula: β 0 = y ˉ − β 1 x ˉ \beta_0 = \bar{y} - \beta_1\bar{x} β 0 = y ˉ − β 1 x ˉ
Coefficient of determination: R 2 = S S R S S T = 1 − S S E S S T R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} R 2 = SST SSR = 1 − SST SSE
S S R SSR SSR : Sum of squares regression
S S T SST SST : Total sum of squares
S S E SSE SSE : Sum of squares error
Standard error of the estimate: s e = S S E n − 2 s_e = \sqrt{\frac{SSE}{n-2}} s e = n − 2 SSE
Hypothesis tests for the slope and y-intercept use t-tests
Confidence intervals for the slope and y-intercept can be constructed using the standard errors and t-distribution
Correlation vs. Regression
Correlation measures the strength and direction of the linear relationship between two variables
Regression models the relationship between a dependent variable and one or more independent variables
Correlation coefficient (r) ranges from -1 to +1, indicating the strength and direction of the linear relationship
Regression focuses on predicting the dependent variable based on the independent variable(s)
Correlation does not imply causation, while regression can suggest causal relationships with proper experimental design
Correlation is symmetric (interchangeable), while regression is asymmetric (dependent and independent variables)
How to Calculate and Interpret
Calculate the slope and y-intercept using the formulas or statistical software
Interpret the slope as the change in the dependent variable for a one-unit change in the independent variable
Interpret the y-intercept as the value of the dependent variable when the independent variable is zero
Calculate the coefficient of determination (R 2 R^2 R 2 ) to assess the proportion of variance explained by the model
Interpret R 2 R^2 R 2 as the percentage of variation in the dependent variable explained by the independent variable
Conduct hypothesis tests for the slope and y-intercept to determine their statistical significance
Construct confidence intervals for the slope and y-intercept to estimate the range of plausible values
Use the regression equation to make predictions for new values of the independent variable
Assumptions and Limitations
Linearity: The relationship between the variables must be linear
Independence: Observations must be independent of each other
Homoscedasticity: The variance of the residuals should be constant across all levels of the independent variable
Normality: The residuals should be normally distributed
No multicollinearity: Independent variables should not be highly correlated with each other (multiple regression)
Outliers can heavily influence the regression line and should be carefully examined
Extrapolation beyond the range of the observed data can lead to unreliable predictions
Correlation does not imply causation; other factors may influence the relationship
Real-World Applications
Predicting sales based on advertising expenditure (marketing)
Estimating the relationship between years of education and income (economics)
Modeling the effect of temperature on crop yields (agriculture)
Analyzing the relationship between a drug dosage and patient response (pharmacology)
Predicting housing prices based on square footage (real estate)
Examining the relationship between study time and exam scores (education)
Forecasting demand for a product based on its price (business)
Common Mistakes to Avoid
Ignoring the assumptions of linear regression
Confusing correlation with causation
Extrapolating beyond the range of the observed data
Failing to consider other relevant variables that may influence the relationship
Overinterpreting the results without considering the context and limitations
Relying solely on R 2 R^2 R 2 to assess the model's goodness of fit
Not checking for outliers or influential observations
Using linear regression when the relationship between the variables is clearly nonlinear