All Study Guides Biostatistics Unit 3
🐛 Biostatistics Unit 3 – Probability Distributions in BiologyProbability distributions are essential tools in biology for understanding and analyzing random events. They help quantify the likelihood of outcomes in various biological processes, from genetic mutations to species distributions.
These distributions come in discrete and continuous forms, each with unique applications. Biologists use them to model phenomena like Hardy-Weinberg equilibrium, rare genetic events, and population characteristics, enabling hypothesis testing and data-driven insights in research.
Key Concepts in Probability
Probability quantifies the likelihood of an event occurring ranges from 0 (impossible) to 1 (certain)
Sample space represents all possible outcomes of an experiment or trial
Events are subsets of the sample space can be combined using set operations (union, intersection, complement)
Mutually exclusive events cannot occur simultaneously in a single trial (rolling a 1 and a 6 on a die)
Independent events do not influence each other's probability (flipping a coin multiple times)
Probability of independent events is calculated by multiplying individual probabilities
Conditional probability measures the likelihood of an event given that another event has occurred
Calculated using the formula: P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A|B) = \frac{P(A \cap B)}{P(B)} P ( A ∣ B ) = P ( B ) P ( A ∩ B )
Bayes' theorem allows updating probabilities based on new information relates conditional probabilities
Formula: P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B) = \frac{P(B|A)P(A)}{P(B)} P ( A ∣ B ) = P ( B ) P ( B ∣ A ) P ( A )
Types of Probability Distributions
Probability distributions describe the likelihood of different outcomes in a random variable
Discrete distributions have a countable number of possible outcomes (number of mutations in a gene)
Examples: Binomial, Poisson, Geometric, Hypergeometric
Continuous distributions have an infinite number of possible outcomes within a range (height of plants)
Examples: Normal (Gaussian), Exponential, Gamma, Beta
Binomial distribution models the number of successes in a fixed number of independent trials with two possible outcomes (success or failure)
Characterized by parameters n n n (number of trials) and p p p (probability of success)
Poisson distribution models the number of rare events occurring in a fixed interval of time or space (mutations per generation)
Characterized by parameter λ \lambda λ (average rate of events)
Normal distribution is symmetric and bell-shaped models many biological phenomena (body weight, IQ scores)
Characterized by parameters μ \mu μ (mean) and σ \sigma σ (standard deviation)
Biological Applications of Distributions
Hardy-Weinberg equilibrium uses binomial distribution to model genotype frequencies in a population
Assumes no selection, mutation, migration, or genetic drift
Poisson distribution models rare events in biology (number of mutations, species distribution patterns)
Exponential distribution describes waiting times between events (time between mutations, survival times)
Normal distribution applies to many continuous biological variables (height, weight, blood pressure)
Central Limit Theorem states that means of large samples are approximately normally distributed
Gamma distribution models waiting times for a specified number of events to occur (time until k k k mutations)
Beta distribution is used for proportions and probabilities (allele frequencies, inheritance patterns)
Hypergeometric distribution models sampling without replacement from a finite population (genotyping individuals)
Measures of Central Tendency and Spread
Measures of central tendency describe the typical or central value in a dataset
Mean is the arithmetic average calculated by summing all values and dividing by the number of observations
Sensitive to outliers and extreme values
Median is the middle value when data is ordered from lowest to highest
Robust to outliers more appropriate for skewed distributions
Mode is the most frequently occurring value in a dataset
Can have multiple modes (bimodal, multimodal) or no mode
Measures of spread describe the variability or dispersion of data points
Variance quantifies the average squared deviation from the mean
Calculated as: σ 2 = ∑ i = 1 n ( x i − μ ) 2 n \sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n} σ 2 = n ∑ i = 1 n ( x i − μ ) 2
Standard deviation is the square root of variance
Measures spread in the same units as the original data
Range is the difference between the maximum and minimum values
Sensitive to outliers provides limited information about overall spread
Interquartile range (IQR) is the difference between the first and third quartiles
More robust to outliers than range
Hypothesis Testing with Distributions
Hypothesis testing uses probability distributions to make statistical inferences about populations
Null hypothesis (H 0 H_0 H 0 ) states that there is no significant effect or difference (no difference in mean height between two plant species)
Alternative hypothesis (H a H_a H a ) states that there is a significant effect or difference (mean height differs between species)
Test statistic is a value calculated from sample data used to decide between the null and alternative hypotheses
Examples: z-score (normal distribution), t-score (t-distribution), chi-square (chi-square distribution)
P-value is the probability of obtaining the observed test statistic or a more extreme value, assuming the null hypothesis is true
Smaller p-values provide stronger evidence against the null hypothesis
Significance level (α \alpha α ) is the threshold for rejecting the null hypothesis (commonly 0.05)
If p-value < α \alpha α , reject H 0 H_0 H 0 and conclude there is a significant effect
Type I error (false positive) occurs when rejecting a true null hypothesis
Probability of Type I error is equal to the significance level (α \alpha α )
Type II error (false negative) occurs when failing to reject a false null hypothesis
Probability of Type II error is denoted by β \beta β and depends on sample size and effect size
Data Visualization Techniques
Histograms display the frequency distribution of a continuous variable
Divide data into bins of equal width and plot the count or frequency of observations in each bin
Box plots (box-and-whisker plots) summarize the distribution of a continuous variable
Show median, quartiles, and potential outliers
Scatter plots display the relationship between two continuous variables
Each point represents an observation with its x and y coordinates
Bar plots compare frequencies or means across categories
Height of each bar represents the frequency or mean for that category
Pie charts show the proportion or percentage of each category in a whole
Angle and area of each slice correspond to its proportion
Heatmaps visualize patterns in two-dimensional data
Colors represent values in a matrix (gene expression levels)
Violin plots combine a box plot and a kernel density plot to show the distribution shape
Width of the "violin" represents the density of observations at each value
Real-world Examples in Biology
Genome-wide association studies (GWAS) use probability distributions to identify genetic variants associated with traits or diseases
Test for significant differences in allele frequencies between cases and controls
Epidemiological models use Poisson and exponential distributions to predict the spread of infectious diseases
Estimate parameters such as transmission rate and incubation period
Population ecology models use probability distributions to describe species abundance, dispersal, and interactions
Examples: species-area relationships, metapopulation dynamics
Quantitative genetics uses normal distribution to model the inheritance of continuous traits (height, weight)
Estimate heritability and response to selection
Phylogenetic inference uses probability distributions to model the evolution of DNA sequences
Maximum likelihood and Bayesian methods estimate tree topologies and branch lengths
Ecological niche modeling uses probability distributions to predict species' geographic distributions based on environmental variables
Examples: Maxent, Generalized Linear Models (GLMs)
Common Pitfalls and Misconceptions
Confusing probability with certainty or frequency
Probability is a measure of uncertainty, not a guarantee of occurrence
Misinterpreting p-values as the probability of the null hypothesis being true
P-value is the probability of observing the data given that the null hypothesis is true
Overinterpreting non-significant results as evidence of no effect
Failure to reject the null hypothesis does not prove it true (absence of evidence is not evidence of absence)
Assuming that statistical significance implies practical or biological significance
Small effects can be statistically significant with large sample sizes but may not be biologically meaningful
Neglecting to check assumptions of statistical tests and models
Violations of assumptions (normality, independence, equal variances) can lead to invalid inferences
Overfitting models to noise in the data
Overly complex models may fit the sample data well but fail to generalize to new data
Misusing summary statistics without considering the underlying distribution
Different distributions can have the same mean or variance but different shapes and properties
Confusing correlation with causation
Correlation does not imply a causal relationship between variables (confounding factors, reverse causation)