🎳Intro to Econometrics Unit 1 – Probability & Statistics Fundamentals

Probability and statistics form the foundation of econometrics, providing tools to analyze and interpret data. This unit covers key concepts like probability distributions, random variables, and descriptive statistics, essential for understanding economic phenomena and making informed decisions. Students will learn about hypothesis testing, confidence intervals, and regression analysis. These techniques allow economists to draw inferences from sample data, estimate relationships between variables, and test economic theories using empirical evidence.

Key Concepts and Definitions

  • Probability measures the likelihood of an event occurring, expressed as a number between 0 and 1
    • 0 indicates an impossible event, while 1 represents a certain event
  • Random variables assign numerical values to outcomes of a random experiment
    • Discrete random variables have countable outcomes (number of defective items in a batch)
    • Continuous random variables have an infinite number of possible outcomes within a range (height of students in a class)
  • Probability distributions describe the likelihood of different outcomes for a random variable
    • Probability mass functions (PMFs) define probability distributions for discrete random variables
    • Probability density functions (PDFs) define probability distributions for continuous random variables
  • Expected value represents the average outcome of a random variable over a large number of trials, calculated as the sum of each outcome multiplied by its probability
  • Variance and standard deviation measure the spread or dispersion of a probability distribution
    • Variance is the average squared deviation from the mean, denoted as σ2\sigma^2
    • Standard deviation is the square root of variance, denoted as σ\sigma
  • Covariance and correlation measure the relationship between two random variables
    • Covariance indicates the direction of the linear relationship (positive, negative, or zero)
    • Correlation is a standardized measure of the linear relationship, ranging from -1 to 1

Probability Basics

  • The law of large numbers states that as the number of trials increases, the average of the results will converge to the expected value
  • Conditional probability is the probability of an event A occurring, given that event B has already occurred, denoted as P(AB)P(A|B)
  • The multiplication rule states that the probability of two events A and B occurring together is the product of the probability of A and the conditional probability of B given A, expressed as P(AB)=P(A)×P(BA)P(A \cap B) = P(A) \times P(B|A)
  • Independent events have no influence on each other's probability
    • For independent events A and B, P(AB)=P(A)P(A|B) = P(A) and P(BA)=P(B)P(B|A) = P(B)
    • The probability of independent events occurring together is the product of their individual probabilities, P(AB)=P(A)×P(B)P(A \cap B) = P(A) \times P(B)
  • Mutually exclusive events cannot occur at the same time
    • The probability of mutually exclusive events A and B occurring is the sum of their individual probabilities, P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B)
  • The complement of an event A is the probability of A not occurring, denoted as P(A)P(A') or 1P(A)1 - P(A)
  • Bayes' theorem describes the probability of an event based on prior knowledge and new evidence, expressed as P(AB)=P(BA)×P(A)P(B)P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}

Types of Distributions

  • Bernoulli distribution models a single trial with two possible outcomes (success or failure)
    • The probability of success is denoted as pp, and the probability of failure is 1p1-p
  • Binomial distribution models the number of successes in a fixed number of independent Bernoulli trials
    • Characterized by the number of trials nn and the probability of success pp
  • Poisson distribution models the number of rare events occurring in a fixed interval of time or space
    • Characterized by the average rate of occurrence λ\lambda
  • Normal (Gaussian) distribution is a continuous probability distribution with a bell-shaped curve
    • Characterized by its mean μ\mu and standard deviation σ\sigma
    • The standard normal distribution has a mean of 0 and a standard deviation of 1
  • Uniform distribution has equal probability for all outcomes within a given range
    • Discrete uniform distribution has a fixed number of equally likely outcomes
    • Continuous uniform distribution has an infinite number of equally likely outcomes within a range
  • Exponential distribution models the time between events in a Poisson process
    • Characterized by the rate parameter λ\lambda, which is the inverse of the mean
  • Student's t-distribution is similar to the normal distribution but with heavier tails, used when the sample size is small or the population standard deviation is unknown

Descriptive Statistics

  • Measures of central tendency describe the center or typical value of a dataset
    • Mean is the arithmetic average of all values in a dataset
    • Median is the middle value when the dataset is ordered from lowest to highest
    • Mode is the most frequently occurring value in a dataset
  • Measures of dispersion describe the spread or variability of a dataset
    • Range is the difference between the maximum and minimum values
    • Interquartile range (IQR) is the difference between the first and third quartiles
    • Variance and standard deviation measure the average distance of data points from the mean
  • Skewness measures the asymmetry of a distribution
    • Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail
  • Kurtosis measures the heaviness of the tails and peakedness of a distribution compared to a normal distribution
    • Leptokurtic distributions have heavier tails and a higher peak than a normal distribution
    • Platykurtic distributions have lighter tails and a lower peak than a normal distribution
  • Percentiles and quartiles divide a dataset into equal parts
    • Percentiles divide a dataset into 100 equal parts
    • Quartiles divide a dataset into four equal parts (Q1, Q2 or median, Q3)
  • Boxplots visually represent the five-number summary of a dataset (minimum, Q1, median, Q3, maximum)
    • Outliers are data points that fall outside the whiskers of a boxplot

Inferential Statistics

  • Hypothesis testing is a statistical method for making decisions based on sample data
    • The null hypothesis (H0H_0) represents the status quo or no effect
    • The alternative hypothesis (HaH_a or H1H_1) represents the claim or effect being tested
  • Type I error (false positive) occurs when rejecting a true null hypothesis
    • The significance level α\alpha is the probability of making a Type I error
  • Type II error (false negative) occurs when failing to reject a false null hypothesis
    • The power of a test is the probability of correctly rejecting a false null hypothesis
  • Confidence intervals estimate the range of values that likely contain the true population parameter
    • The confidence level (e.g., 95%) represents the probability that the interval contains the true value
  • p-values measure the strength of evidence against the null hypothesis
    • A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis
  • t-tests compare means between two groups or a sample mean to a known population mean
    • Independent samples t-test compares means between two independent groups
    • Paired samples t-test compares means between two related groups or measurements
    • One-sample t-test compares a sample mean to a known population mean
  • Analysis of Variance (ANOVA) tests for differences in means among three or more groups
    • One-way ANOVA compares means across one categorical variable
    • Two-way ANOVA compares means across two categorical variables and their interaction

Data Visualization Techniques

  • Scatter plots display the relationship between two continuous variables
    • Each data point represents an observation with its x and y coordinates
  • Line plots connect data points in order, typically used for time series data
    • Multiple line plots can be used to compare trends across different categories
  • Bar plots compare values across different categories
    • Vertical bar plots (column charts) are used for categories with no natural ordering
    • Horizontal bar plots are useful when category labels are long or numerous
  • Histograms display the distribution of a continuous variable
    • The x-axis is divided into bins, and the y-axis shows the frequency or count of observations in each bin
  • Pie charts show the proportion or percentage of each category in a whole
    • Best used when the number of categories is small and the proportions are significantly different
  • Heatmaps display values using color intensity, useful for visualizing patterns in matrices or tables
  • Box plots summarize the distribution of a continuous variable across different categories
    • They display the five-number summary and any outliers
  • Violin plots combine a box plot and a kernel density plot to show the distribution shape
  • Faceting (small multiples) creates multiple subplots based on one or more categorical variables, allowing for comparisons across subgroups

Applications in Econometrics

  • Regression analysis models the relationship between a dependent variable and one or more independent variables
    • Simple linear regression models the relationship between two continuous variables
    • Multiple linear regression models the relationship between a dependent variable and multiple independent variables
  • Time series analysis studies data collected over time to identify trends, seasonality, and other patterns
    • Autoregressive (AR) models use past values of the variable to predict future values
    • Moving average (MA) models use past forecast errors to predict future values
    • Autoregressive integrated moving average (ARIMA) models combine AR and MA components and account for non-stationarity
  • Panel data analysis studies data collected over time for multiple individuals, firms, or other entities
    • Fixed effects models control for unobserved, time-invariant individual characteristics
    • Random effects models assume individual-specific effects are uncorrelated with the independent variables
  • Instrumental variables (IV) estimation addresses endogeneity issues when independent variables are correlated with the error term
    • Valid instruments are correlated with the endogenous variable but not with the error term
  • Difference-in-differences (DID) estimation compares the change in outcomes between a treatment and control group before and after an intervention
    • Parallel trends assumption: the treatment and control groups would have followed the same trend in the absence of the intervention
  • Propensity score matching (PSM) estimates the effect of a treatment by comparing treated and untreated observations with similar propensity scores
    • The propensity score is the probability of receiving the treatment based on observed characteristics

Common Pitfalls and Misconceptions

  • Correlation does not imply causation: a strong correlation between two variables does not necessarily mean that one causes the other
    • Confounding variables or reverse causality may explain the observed relationship
  • Outliers can heavily influence statistical measures and model results
    • It is important to identify and appropriately handle outliers based on the research context
  • Overfitting occurs when a model is too complex and fits the noise in the data rather than the underlying pattern
    • Overfitted models may have poor performance on new, unseen data
  • Underfitting occurs when a model is too simple and fails to capture the true relationship between variables
    • Underfitted models may have high bias and low variance
  • Multicollinearity arises when independent variables in a regression model are highly correlated with each other
    • Multicollinearity can lead to unstable coefficient estimates and difficulty in interpreting individual variable effects
  • Heteroscedasticity refers to the situation where the variance of the error term is not constant across observations
    • Heteroscedasticity can lead to biased standard errors and invalid inference
  • Autocorrelation occurs when the error terms in a time series or panel data model are correlated with each other
    • Autocorrelation can lead to biased standard errors and inefficient coefficient estimates
  • Simpson's paradox occurs when a trend or relationship observed in aggregated data disappears or reverses when the data is disaggregated by a confounding variable
    • It highlights the importance of considering subgroup effects and controlling for relevant variables


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.