📊Probability and Statistics Unit 10 – Correlation and Linear Regression

Correlation and linear regression are powerful statistical tools for understanding relationships between variables. These methods help us quantify and visualize how changes in one variable relate to changes in another, allowing us to make predictions and draw insights from data. Linear regression takes correlation a step further by modeling the relationship between variables mathematically. This technique enables us to predict outcomes, assess the strength of relationships, and understand how multiple factors can influence a single variable. These tools are essential for data analysis across various fields.

Key Concepts

  • Correlation measures the strength and direction of the linear relationship between two variables
  • Positive correlation indicates that as one variable increases, the other variable also tends to increase
  • Negative correlation indicates that as one variable increases, the other variable tends to decrease
  • No correlation means there is no linear relationship between the variables
  • Correlation does not imply causation, meaning that a strong correlation between two variables does not necessarily mean that one variable causes the other
  • Linear regression is a statistical method used to model the linear relationship between a dependent variable and one or more independent variables
  • The goal of linear regression is to find the best-fitting line that minimizes the sum of the squared differences between the observed values and the predicted values

Types of Correlation

  • Positive correlation occurs when an increase in one variable is associated with an increase in the other variable (height and weight)
  • Negative correlation occurs when an increase in one variable is associated with a decrease in the other variable (age and physical fitness)
  • Zero correlation occurs when there is no linear relationship between the two variables (shoe size and intelligence)
  • Perfect correlation has a correlation coefficient of 1 or -1, indicating a strong linear relationship between the variables
    • Perfect positive correlation (1) means that the data points fall exactly on a straight line with a positive slope
    • Perfect negative correlation (-1) means that the data points fall exactly on a straight line with a negative slope
  • Weak correlation has a correlation coefficient close to 0, indicating a weak linear relationship between the variables
  • Spurious correlation occurs when two variables appear to be related but are actually influenced by a third variable (ice cream sales and shark attacks, both influenced by temperature)

Correlation Coefficients

  • Correlation coefficients quantify the strength and direction of the linear relationship between two variables
  • Pearson's correlation coefficient (r) is the most commonly used correlation coefficient for continuous variables
    • Ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation
    • Calculated using the formula: r=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}
  • Spearman's rank correlation coefficient (ρ) is used for ordinal variables or when the relationship between variables is monotonic but not linear
  • Kendall's tau (τ) is another non-parametric correlation coefficient used for ordinal variables
  • The coefficient of determination (R²) represents the proportion of variance in the dependent variable that can be explained by the independent variable(s)
    • Ranges from 0 to 1, where 0 indicates no linear relationship and 1 indicates a perfect linear relationship

Scatter Plots and Visualization

  • Scatter plots are used to visualize the relationship between two continuous variables
    • Each data point is represented by a dot on the graph, with the x-coordinate representing the value of one variable and the y-coordinate representing the value of the other variable
  • The shape of the scatter plot can indicate the type and strength of the correlation between the variables
    • A clear upward trend suggests a positive correlation
    • A clear downward trend suggests a negative correlation
    • A random pattern with no clear trend suggests no correlation
  • Outliers can be identified in a scatter plot as data points that are far from the main cluster of points
  • Adding a line of best fit (regression line) to the scatter plot can help visualize the linear relationship between the variables
  • Residual plots can be used to assess the assumptions of linear regression by plotting the residuals (differences between observed and predicted values) against the independent variable or predicted values

Simple Linear Regression

  • Simple linear regression models the linear relationship between one independent variable (x) and one dependent variable (y)
  • The simple linear regression equation is: y=β0+β1x+ϵy = \beta_0 + \beta_1x + \epsilon
    • β0\beta_0 is the y-intercept, representing the value of y when x is 0
    • β1\beta_1 is the slope, representing the change in y for a one-unit increase in x
    • ϵ\epsilon is the error term, representing the unexplained variation in y
  • The least squares method is used to estimate the parameters (β0\beta_0 and β1\beta_1) by minimizing the sum of the squared residuals
  • The fitted regression line is the line that best describes the linear relationship between the variables based on the estimated parameters
  • Hypothesis tests (t-tests) and confidence intervals can be used to assess the significance of the slope and y-intercept estimates
  • Predictions can be made using the fitted regression equation by plugging in values for the independent variable

Multiple Linear Regression

  • Multiple linear regression extends simple linear regression to model the linear relationship between one dependent variable and two or more independent variables
  • The multiple linear regression equation is: y=β0+β1x1+β2x2+...+βpxp+ϵy = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon
    • β0\beta_0 is the y-intercept
    • β1,β2,...,βp\beta_1, \beta_2, ..., \beta_p are the slopes for each independent variable
    • x1,x2,...,xpx_1, x_2, ..., x_p are the independent variables
    • ϵ\epsilon is the error term
  • The least squares method is used to estimate the parameters by minimizing the sum of the squared residuals
  • Adjusted R² is used to assess the goodness of fit for multiple linear regression models, taking into account the number of independent variables
  • Multicollinearity occurs when independent variables are highly correlated with each other, which can lead to unstable parameter estimates
    • Variance Inflation Factor (VIF) can be used to detect multicollinearity
  • Stepwise regression methods (forward selection, backward elimination, and bidirectional elimination) can be used to select the most important independent variables for the model

Assumptions and Limitations

  • Linear regression assumes a linear relationship between the dependent and independent variables
  • Residuals should be normally distributed with a mean of 0 and constant variance (homoscedasticity)
    • Violations of these assumptions can lead to biased parameter estimates and invalid hypothesis tests
  • Independence of observations is assumed, meaning that the residuals should not be correlated with each other
    • Autocorrelation can be detected using the Durbin-Watson test
  • Outliers and influential points can have a significant impact on the regression results and should be carefully examined
  • Extrapolation beyond the range of the observed data can lead to unreliable predictions
  • Correlation does not imply causation, and causal relationships should be established through careful experimental design or other methods
  • Omitted variable bias can occur when important variables are not included in the regression model, leading to biased parameter estimates

Real-World Applications

  • Economics: Modeling the relationship between consumer spending and income, or between supply and demand
  • Finance: Analyzing the relationship between stock prices and financial indicators (price-to-earnings ratio)
  • Social sciences: Investigating the relationship between education level and income, or between age and political ideology
  • Medicine: Examining the relationship between drug dosage and patient response, or between risk factors and disease incidence
  • Environmental studies: Modeling the relationship between air pollution levels and respiratory illness, or between temperature and crop yields
  • Marketing: Analyzing the relationship between advertising expenditure and sales, or between customer satisfaction and loyalty
  • Sports: Investigating the relationship between player statistics and team performance, or between training hours and individual performance


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.