11.3 Regression models: linear, logistic, and survival analysis
5 min read•august 14, 2024
Regression models are powerful tools for analyzing relationships between variables in epidemiological studies. helps us understand continuous outcomes, while tackles binary outcomes. digs into , crucial for studying disease progression and treatment effects.
These models allow researchers to control for factors and estimate the impact of specific variables on health outcomes. By interpreting coefficients, odds ratios, and hazard ratios, we can quantify the strength of associations and make evidence-based decisions in public health.
Linear Regression for Continuous Outcomes
Linear Regression Modeling
Top images from around the web for Linear Regression Modeling
Data Science for Water Professionals: Basic Linear Regression View original
Is this image relevant?
Simple Linear regression algorithm in machine learning with example - Codershood View original
Is this image relevant?
Data Science for Water Professionals: Basic Linear Regression View original
Is this image relevant?
Simple Linear regression algorithm in machine learning with example - Codershood View original
Is this image relevant?
1 of 2
Top images from around the web for Linear Regression Modeling
Data Science for Water Professionals: Basic Linear Regression View original
Is this image relevant?
Simple Linear regression algorithm in machine learning with example - Codershood View original
Is this image relevant?
Data Science for Water Professionals: Basic Linear Regression View original
Is this image relevant?
Simple Linear regression algorithm in machine learning with example - Codershood View original
Is this image relevant?
1 of 2
Linear regression is a statistical method used to model the linear relationship between a continuous dependent variable (outcome) and one or more independent variables (predictors)
The general form of a simple linear regression model is Y=β0+β1X+ε, where Y is the dependent variable, X is the independent variable, β0 is the intercept, β1 is the slope (regression coefficient), and ε is the error term
Multiple linear regression extends the simple linear regression model to include two or more independent variables: Y=β0+β1X1+β2X2+...+βpXp+ε
Examples of continuous outcomes modeled using linear regression include body mass index (BMI), blood pressure, and income
Assumptions and Estimation
Assumptions of linear regression include linearity, independence, , and normality of residuals
Linearity assumes a linear relationship between the dependent and independent variables
Independence assumes that observations are independent of each other
Homoscedasticity assumes constant variance of the residuals across all levels of the independent variables
Normality assumes that the residuals follow a normal distribution
The method of least squares is used to estimate the regression coefficients by minimizing the sum of squared residuals
The coefficient of determination (R2) measures the proportion of variance in the dependent variable explained by the independent variable(s)
R2 ranges from 0 to 1, with higher values indicating a better fit of the model to the data
Logistic Regression for Binary Outcomes
Logistic Regression Modeling
Logistic regression is a statistical method used to model the relationship between a binary dependent variable (outcome) and one or more independent variables (predictors)
The logistic regression model estimates the probability of the occurring given the values of the independent variables: P(Y=1∣X)=1/(1+e−(β0+β1X1+...+βpXp))
Examples of binary outcomes modeled using logistic regression include disease status (present or absent), mortality (alive or dead), and customer churn (churned or retained)
Odds Ratios and Model Fit
The (OR) is a measure of association between an exposure and an outcome, representing the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure
The logistic regression coefficients (β) are interpreted as the change in the log odds of the outcome for a one-unit increase in the predictor variable, holding other variables constant
The exponentiated coefficients (eβ) represent the odds ratios for each predictor variable
can be assessed using the likelihood ratio test, Wald test, and deviance test, as well as measures such as the Hosmer-Lemeshow test and pseudo-R2 values (e.g., McFadden's R2, Cox and Snell R2)
Survival Analysis for Time-to-Event Data
Kaplan-Meier Method and Log-Rank Test
Survival analysis is a statistical method used to analyze time-to-event data, where the outcome variable is the time until an event of interest occurs (e.g., death, disease recurrence, or mechanical failure)
The Kaplan-Meier method is a non-parametric approach to estimate the survival function, S(t), which represents the probability of surviving beyond time t
The log-rank test is used to compare the survival curves of two or more groups to determine if there is a statistically significant difference in survival between the groups
Examples of time-to-event data analyzed using survival analysis include time to death in cancer patients, time to disease recurrence after treatment, and time to mechanical failure in engineering systems
Cox Proportional Hazards Model and Censoring
The Cox proportional hazards model is a semi-parametric regression model used to investigate the relationship between survival time and one or more predictor variables
The hazard ratio (HR) is a measure of the effect of a predictor variable on the hazard (instantaneous risk) of the event occurring, assuming that the proportional hazards assumption holds
Censoring occurs when the exact survival time is unknown, either because the individual has not experienced the event by the end of the study (right-censoring) or because they were lost to follow-up (left-censoring)
Examples of predictor variables in a Cox proportional hazards model include age, gender, and treatment group
Regression Model Interpretation and Diagnostics
Coefficient Interpretation and Significance
Regression coefficients represent the change in the dependent variable associated with a one-unit change in the independent variable, holding other variables constant
In linear regression, the coefficients directly represent the change in the dependent variable, while in logistic regression, the coefficients represent the change in the log odds of the outcome
Statistical significance of regression coefficients can be assessed using t-tests (linear regression) or Wald tests (logistic regression), with p-values indicating the probability of observing the estimated coefficient if the null hypothesis (β=0) is true
Confidence intervals for regression coefficients provide a range of plausible values for the true population parameter
Model Fit and Diagnostic Tests
Model fit can be assessed using the coefficient of determination (R2) for linear regression and likelihood ratio tests, Wald tests, and deviance tests for logistic regression
Diagnostic tests for regression models include checking for (variance inflation factor), influential observations (Cook's distance, leverage), and residual plots to assess model assumptions (linearity, homoscedasticity, normality)
Cross-validation techniques, such as k-fold cross-validation or leave-one-out cross-validation, can be used to assess the model's predictive performance on unseen data and to detect overfitting
Examples of diagnostic tests include examining the variance inflation factor (VIF) to detect multicollinearity, using Cook's distance to identify influential observations, and creating residual plots to assess the linearity assumption in linear regression