Biostatistics

🐛Biostatistics Unit 3 – Probability Distributions in Biology

Probability distributions are essential tools in biology for understanding and analyzing random events. They help quantify the likelihood of outcomes in various biological processes, from genetic mutations to species distributions. These distributions come in discrete and continuous forms, each with unique applications. Biologists use them to model phenomena like Hardy-Weinberg equilibrium, rare genetic events, and population characteristics, enabling hypothesis testing and data-driven insights in research.

Key Concepts in Probability

  • Probability quantifies the likelihood of an event occurring ranges from 0 (impossible) to 1 (certain)
  • Sample space represents all possible outcomes of an experiment or trial
  • Events are subsets of the sample space can be combined using set operations (union, intersection, complement)
  • Mutually exclusive events cannot occur simultaneously in a single trial (rolling a 1 and a 6 on a die)
  • Independent events do not influence each other's probability (flipping a coin multiple times)
    • Probability of independent events is calculated by multiplying individual probabilities
  • Conditional probability measures the likelihood of an event given that another event has occurred
    • Calculated using the formula: P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}
  • Bayes' theorem allows updating probabilities based on new information relates conditional probabilities
    • Formula: P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}

Types of Probability Distributions

  • Probability distributions describe the likelihood of different outcomes in a random variable
  • Discrete distributions have a countable number of possible outcomes (number of mutations in a gene)
    • Examples: Binomial, Poisson, Geometric, Hypergeometric
  • Continuous distributions have an infinite number of possible outcomes within a range (height of plants)
    • Examples: Normal (Gaussian), Exponential, Gamma, Beta
  • Binomial distribution models the number of successes in a fixed number of independent trials with two possible outcomes (success or failure)
    • Characterized by parameters nn (number of trials) and pp (probability of success)
  • Poisson distribution models the number of rare events occurring in a fixed interval of time or space (mutations per generation)
    • Characterized by parameter λ\lambda (average rate of events)
  • Normal distribution is symmetric and bell-shaped models many biological phenomena (body weight, IQ scores)
    • Characterized by parameters μ\mu (mean) and σ\sigma (standard deviation)

Biological Applications of Distributions

  • Hardy-Weinberg equilibrium uses binomial distribution to model genotype frequencies in a population
    • Assumes no selection, mutation, migration, or genetic drift
  • Poisson distribution models rare events in biology (number of mutations, species distribution patterns)
  • Exponential distribution describes waiting times between events (time between mutations, survival times)
  • Normal distribution applies to many continuous biological variables (height, weight, blood pressure)
    • Central Limit Theorem states that means of large samples are approximately normally distributed
  • Gamma distribution models waiting times for a specified number of events to occur (time until kk mutations)
  • Beta distribution is used for proportions and probabilities (allele frequencies, inheritance patterns)
  • Hypergeometric distribution models sampling without replacement from a finite population (genotyping individuals)

Measures of Central Tendency and Spread

  • Measures of central tendency describe the typical or central value in a dataset
  • Mean is the arithmetic average calculated by summing all values and dividing by the number of observations
    • Sensitive to outliers and extreme values
  • Median is the middle value when data is ordered from lowest to highest
    • Robust to outliers more appropriate for skewed distributions
  • Mode is the most frequently occurring value in a dataset
    • Can have multiple modes (bimodal, multimodal) or no mode
  • Measures of spread describe the variability or dispersion of data points
  • Variance quantifies the average squared deviation from the mean
    • Calculated as: σ2=i=1n(xiμ)2n\sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}
  • Standard deviation is the square root of variance
    • Measures spread in the same units as the original data
  • Range is the difference between the maximum and minimum values
    • Sensitive to outliers provides limited information about overall spread
  • Interquartile range (IQR) is the difference between the first and third quartiles
    • More robust to outliers than range

Hypothesis Testing with Distributions

  • Hypothesis testing uses probability distributions to make statistical inferences about populations
  • Null hypothesis (H0H_0) states that there is no significant effect or difference (no difference in mean height between two plant species)
  • Alternative hypothesis (HaH_a) states that there is a significant effect or difference (mean height differs between species)
  • Test statistic is a value calculated from sample data used to decide between the null and alternative hypotheses
    • Examples: z-score (normal distribution), t-score (t-distribution), chi-square (chi-square distribution)
  • P-value is the probability of obtaining the observed test statistic or a more extreme value, assuming the null hypothesis is true
    • Smaller p-values provide stronger evidence against the null hypothesis
  • Significance level (α\alpha) is the threshold for rejecting the null hypothesis (commonly 0.05)
    • If p-value < α\alpha, reject H0H_0 and conclude there is a significant effect
  • Type I error (false positive) occurs when rejecting a true null hypothesis
    • Probability of Type I error is equal to the significance level (α\alpha)
  • Type II error (false negative) occurs when failing to reject a false null hypothesis
    • Probability of Type II error is denoted by β\beta and depends on sample size and effect size

Data Visualization Techniques

  • Histograms display the frequency distribution of a continuous variable
    • Divide data into bins of equal width and plot the count or frequency of observations in each bin
  • Box plots (box-and-whisker plots) summarize the distribution of a continuous variable
    • Show median, quartiles, and potential outliers
  • Scatter plots display the relationship between two continuous variables
    • Each point represents an observation with its x and y coordinates
  • Bar plots compare frequencies or means across categories
    • Height of each bar represents the frequency or mean for that category
  • Pie charts show the proportion or percentage of each category in a whole
    • Angle and area of each slice correspond to its proportion
  • Heatmaps visualize patterns in two-dimensional data
    • Colors represent values in a matrix (gene expression levels)
  • Violin plots combine a box plot and a kernel density plot to show the distribution shape
    • Width of the "violin" represents the density of observations at each value

Real-world Examples in Biology

  • Genome-wide association studies (GWAS) use probability distributions to identify genetic variants associated with traits or diseases
    • Test for significant differences in allele frequencies between cases and controls
  • Epidemiological models use Poisson and exponential distributions to predict the spread of infectious diseases
    • Estimate parameters such as transmission rate and incubation period
  • Population ecology models use probability distributions to describe species abundance, dispersal, and interactions
    • Examples: species-area relationships, metapopulation dynamics
  • Quantitative genetics uses normal distribution to model the inheritance of continuous traits (height, weight)
    • Estimate heritability and response to selection
  • Phylogenetic inference uses probability distributions to model the evolution of DNA sequences
    • Maximum likelihood and Bayesian methods estimate tree topologies and branch lengths
  • Ecological niche modeling uses probability distributions to predict species' geographic distributions based on environmental variables
    • Examples: Maxent, Generalized Linear Models (GLMs)

Common Pitfalls and Misconceptions

  • Confusing probability with certainty or frequency
    • Probability is a measure of uncertainty, not a guarantee of occurrence
  • Misinterpreting p-values as the probability of the null hypothesis being true
    • P-value is the probability of observing the data given that the null hypothesis is true
  • Overinterpreting non-significant results as evidence of no effect
    • Failure to reject the null hypothesis does not prove it true (absence of evidence is not evidence of absence)
  • Assuming that statistical significance implies practical or biological significance
    • Small effects can be statistically significant with large sample sizes but may not be biologically meaningful
  • Neglecting to check assumptions of statistical tests and models
    • Violations of assumptions (normality, independence, equal variances) can lead to invalid inferences
  • Overfitting models to noise in the data
    • Overly complex models may fit the sample data well but fail to generalize to new data
  • Misusing summary statistics without considering the underlying distribution
    • Different distributions can have the same mean or variance but different shapes and properties
  • Confusing correlation with causation
    • Correlation does not imply a causal relationship between variables (confounding factors, reverse causation)


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.