📊Causal Inference Unit 1 – Probability and Statistics Fundamentals

Probability and statistics form the foundation of causal inference, providing tools to analyze data and draw meaningful conclusions. This unit covers key concepts like probability distributions, descriptive statistics, hypothesis testing, and regression analysis, essential for understanding causal relationships. These fundamentals are crucial for interpreting research findings and making informed decisions in various fields. By mastering these concepts, students can critically evaluate statistical evidence and apply appropriate methods to investigate causal effects in real-world scenarios.

Key Concepts and Definitions

  • Probability the likelihood of an event occurring, expressed as a number between 0 and 1
  • Statistics the collection, analysis, interpretation, and presentation of data
    • Descriptive statistics summarize and describe the main features of a data set
    • Inferential statistics use sample data to make inferences about a larger population
  • Random variable a variable whose value is determined by the outcome of a random event
    • Discrete random variables have a countable number of possible values (number of heads in 10 coin flips)
    • Continuous random variables can take on any value within a specified range (height of students in a class)
  • Distribution a function that describes the likelihood of different outcomes for a random variable
  • Hypothesis testing a statistical method for determining whether there is enough evidence to support a claim about a population parameter
  • Correlation a measure of the strength and direction of the linear relationship between two variables
  • Regression a statistical method for modeling the relationship between a dependent variable and one or more independent variables
  • Causal inference the process of determining whether a causal relationship exists between two variables

Probability Basics

  • Probability is a measure of the likelihood that an event will occur, ranging from 0 (impossible) to 1 (certain)
  • The probability of an event A is denoted as P(A)
  • The sum of the probabilities of all possible outcomes in a sample space is equal to 1
  • Independent events the occurrence of one event does not affect the probability of another event occurring (rolling a die multiple times)
  • Dependent events the occurrence of one event affects the probability of another event occurring (drawing cards from a deck without replacement)
  • Conditional probability the probability of an event A occurring given that event B has already occurred, denoted as P(A|B)
    • Calculated using the formula: P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}
  • Bayes' Theorem a formula for calculating conditional probabilities based on prior probabilities and new evidence
    • P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

Statistical Distributions

  • Normal distribution a symmetric, bell-shaped curve characterized by its mean and standard deviation
    • Approximately 68% of data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations
  • Standard normal distribution a normal distribution with a mean of 0 and a standard deviation of 1
  • Z-score a measure of how many standard deviations an observation is from the mean of its distribution
    • Calculated using the formula: z=xμσz = \frac{x - \mu}{\sigma}, where xx is the observation, μ\mu is the mean, and σ\sigma is the standard deviation
  • Binomial distribution the probability distribution of the number of successes in a fixed number of independent trials, each with the same probability of success (flipping a coin 10 times and counting the number of heads)
  • Poisson distribution the probability distribution of the number of events occurring in a fixed interval of time or space, given a known average rate (number of customers arriving at a store per hour)
  • Central Limit Theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough

Descriptive Statistics

  • Measures of central tendency describe the center or typical value of a dataset
    • Mean the arithmetic average of a set of numbers
    • Median the middle value in a dataset when the values are arranged in order
    • Mode the most frequently occurring value in a dataset
  • Measures of dispersion describe the spread or variability of a dataset
    • Range the difference between the largest and smallest values in a dataset
    • Variance the average of the squared differences from the mean
      • Calculated using the formula: σ2=i=1n(xiμ)2n\sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}, where xix_i is each individual value, μ\mu is the mean, and nn is the sample size
    • Standard deviation the square root of the variance, expressing dispersion in the same units as the original data
  • Skewness a measure of the asymmetry of a distribution
    • Positive skew the tail of the distribution extends to the right (income distribution)
    • Negative skew the tail of the distribution extends to the left (exam scores with a difficult test)
  • Kurtosis a measure of the thickness of the tails of a distribution relative to a normal distribution
    • Leptokurtic distribution has thicker tails than a normal distribution
    • Platykurtic distribution has thinner tails than a normal distribution

Inferential Statistics

  • Population the entire group of individuals, objects, or events of interest
  • Sample a subset of the population used to make inferences about the population
  • Parameter a numerical characteristic of a population (population mean, population standard deviation)
  • Statistic a numerical characteristic of a sample (sample mean, sample standard deviation)
  • Sampling distribution the probability distribution of a statistic obtained from all possible samples of a given size from a population
  • Standard error the standard deviation of a sampling distribution
    • For the sampling distribution of the mean, the standard error is calculated as: SE=σnSE = \frac{\sigma}{\sqrt{n}}, where σ\sigma is the population standard deviation and nn is the sample size
  • Confidence interval a range of values that is likely to contain the true population parameter with a specified level of confidence
    • For a population mean, the confidence interval is calculated as: xˉ±zσn\bar{x} \pm z^* \cdot \frac{\sigma}{\sqrt{n}}, where xˉ\bar{x} is the sample mean, zz^* is the critical value from the standard normal distribution, σ\sigma is the population standard deviation, and nn is the sample size
  • Margin of error the maximum expected difference between the true population parameter and the sample estimate

Hypothesis Testing

  • Null hypothesis (H0H_0) the claim that there is no significant difference or relationship between variables
  • Alternative hypothesis (HaH_a or H1H_1) the claim that there is a significant difference or relationship between variables
  • Type I error rejecting the null hypothesis when it is actually true (false positive)
    • The probability of a Type I error is denoted by α\alpha and is typically set at 0.05
  • Type II error failing to reject the null hypothesis when it is actually false (false negative)
    • The probability of a Type II error is denoted by β\beta
  • Power the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true
    • Calculated as 1β1 - \beta
  • p-value the probability of obtaining a test statistic as extreme as, or more extreme than, the observed result, assuming the null hypothesis is true
    • A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis
  • Test statistic a value calculated from the sample data used to determine whether to reject the null hypothesis (z-score, t-score, chi-square)
  • Critical value the threshold value of the test statistic that determines the boundary between rejecting and not rejecting the null hypothesis

Correlation and Regression

  • Correlation a measure of the strength and direction of the linear relationship between two variables
    • Pearson correlation coefficient (r) a measure of the strength and direction of the linear relationship between two continuous variables
      • Ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation
  • Scatterplot a graph that displays the relationship between two continuous variables
  • Simple linear regression a statistical method for modeling the linear relationship between a dependent variable and one independent variable
    • Regression equation: y=β0+β1x+ϵy = \beta_0 + \beta_1x + \epsilon, where yy is the dependent variable, xx is the independent variable, β0\beta_0 is the y-intercept, β1\beta_1 is the slope, and ϵ\epsilon is the error term
  • Multiple linear regression a statistical method for modeling the linear relationship between a dependent variable and two or more independent variables
  • Coefficient of determination (R2R^2) a measure of the proportion of variance in the dependent variable that is predictable from the independent variable(s)
    • Ranges from 0 to 1, with higher values indicating a better fit of the regression model to the data
  • Residual the difference between the observed value of the dependent variable and the predicted value from the regression model

Applications in Causal Inference

  • Causal inference the process of determining whether a causal relationship exists between two variables
  • Randomized controlled trial (RCT) an experimental design in which participants are randomly assigned to treatment and control groups to estimate the causal effect of an intervention
  • Observational study a non-experimental study design in which researchers observe and analyze data without manipulating the variables of interest
  • Confounding a situation in which the relationship between an exposure and an outcome is distorted by the presence of a third variable that is associated with both the exposure and the outcome
  • Selection bias a systematic error that occurs when the sample is not representative of the population due to the way in which participants are selected
  • Propensity score matching a statistical technique used to estimate the causal effect of a treatment by matching treated and untreated individuals based on their likelihood of receiving the treatment
  • Instrumental variable a variable that is associated with the exposure but not directly associated with the outcome, used to estimate causal effects in the presence of unmeasured confounding
  • Difference-in-differences a method for estimating the causal effect of a policy or intervention by comparing the change in outcomes between a treatment group and a control group, before and after the intervention
  • Regression discontinuity design a quasi-experimental design that estimates the causal effect of a treatment by comparing outcomes for individuals just above and below a threshold value of a continuous variable used to assign treatment


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary