You have 3 free guides left 😟
Unlock your guides
You have 3 free guides left 😟
Unlock your guides

Phylogenetic analysis is a powerful tool for understanding evolutionary relationships among organisms. It uses molecular and morphological data to reconstruct evolutionary histories, revealing how species are related and how traits have evolved over time.

This topic covers key aspects of phylogenetic analysis, including tree representation, sequence alignment, substitution models, and tree inference methods. It also explores challenges like and , which can complicate accurate tree reconstruction.

Phylogenetic tree representation

  • Phylogenetic trees visually represent the evolutionary relationships among various biological entities (species, genes, or other taxa)
  • The branching patterns and lengths of the branches convey information about the degree of similarity and the evolutionary distance between the entities

Rooted vs unrooted trees

Top images from around the web for Rooted vs unrooted trees
Top images from around the web for Rooted vs unrooted trees
  • Rooted trees have a specific node designated as the common ancestor of all other nodes in the tree
    • The root node represents the earliest point in the evolutionary history of the group
  • Unrooted trees do not specify the location of the common ancestor
    • They only display the relative relationships among the entities without indicating the direction of evolution
  • Rooted trees can be converted to unrooted trees by removing the root (midpoint rooting, outgroup rooting)

Bifurcating vs multifurcating trees

  • have internal nodes that split into exactly two branches
    • They assume that events give rise to two descendant lineages
  • have at least one internal node that splits into more than two branches
    • They represent scenarios where the evolutionary relationships are unresolved or where multiple speciation events occurred simultaneously (adaptive radiation, rapid diversification)
  • Bifurcating trees are more informative but require stronger assumptions about the evolutionary process

Cladograms vs phylograms

  • depict the branching order of the entities without considering branch lengths
    • They only convey the relative relationships among the taxa (monophyletic, paraphyletic, polyphyletic groups)
  • include branch lengths proportional to the amount of evolutionary change or time
    • The branch lengths can represent the number of substitutions, , or chronological time
  • Cladograms emphasize the topology of the tree, while phylograms provide additional information about the evolutionary distances

Sequence alignment for phylogenetics

  • Sequence alignment is a crucial step in phylogenetic analysis that arranges homologous residues from different sequences
  • Accurate alignment is essential for inferring evolutionary relationships and estimating phylogenetic trees

Global vs local alignment

  • attempts to align the entire length of the sequences
    • It assumes that the sequences are related over their full length (conserved domains, orthologs)
  • identifies regions of similarity within the sequences
    • It allows for gaps and focuses on aligning the most conserved regions (motifs, paralogs)
  • The choice between global and local alignment depends on the evolutionary relatedness and the presence of insertions/deletions

Progressive vs iterative alignment

  • builds the multiple sequence alignment incrementally by aligning the most similar sequences first and then adding more distant sequences
    • It is computationally efficient but can propagate errors made in the early stages (guide tree, pairwise alignments)
  • repeatedly refines the alignment by realigning subsets of sequences and optimizing a scoring function
    • It can correct mistakes made in the initial alignment but is more computationally intensive (consistency-based, hidden Markov models)
  • Iterative alignment methods generally produce more accurate alignments, especially for distantly related sequences

Multiple sequence alignment tools

  • : Progressive alignment method that uses a guide tree based on pairwise sequence similarities
  • : Iterative alignment method that combines progressive and refinement stages
  • : Rapid alignment method that uses fast Fourier transform to identify homologous regions
  • : Consistency-based alignment method that incorporates information from pairwise alignments
  • : Phylogeny-aware alignment method that models insertions and deletions separately

Substitution models in phylogenetics

  • Substitution models describe the process of character substitution over evolutionary time
  • They specify the rates at which different types of substitutions occur and the equilibrium frequencies of the characters

Nucleotide substitution models

  • Jukes-Cantor (JC69): Assumes equal base frequencies and equal substitution rates
  • Kimura 2-parameter (K80): Allows for different rates of transitions and transversions
  • Hasegawa-Kishino-Yano (HKY85): Incorporates unequal base frequencies and different transition/transversion rates
  • General time-reversible (GTR): Most complex model with six substitution rates and unequal base frequencies

Amino acid substitution models

  • Dayhoff: Empirical model based on observed amino acid replacements in closely related proteins
  • Jones-Taylor-Thornton (JTT): Derived from a larger dataset of protein families
  • Whelan and Goldman (WAG): Based on a broader range of globular protein families
  • Le and Gascuel (LG): Incorporates the variability of evolutionary rates across sites

Model selection criteria

  • (AIC): Balances the goodness of fit with the number of parameters in the model
  • (BIC): Similar to AIC but penalizes complex models more heavily
  • (LRT): Compares the fit of nested models using a chi-square distribution
  • (DT): Selects the model that minimizes the expected loss of phylogenetic accuracy

Phylogenetic tree inference methods

  • Phylogenetic tree inference methods reconstruct the evolutionary relationships among taxa based on molecular or morphological data
  • They differ in their assumptions, computational efficiency, and the optimality criterion used to evaluate the trees

Distance-based methods

  • Neighbor-joining (NJ): Agglomerative clustering algorithm that minimizes the total branch length of the tree
    • Computationally efficient and produces a single tree (UPGMA, BIONJ)
  • Minimum evolution (ME): Selects the tree with the smallest sum of branch lengths
    • Requires a heuristic search of the tree space (nearest neighbor interchange, subtree pruning and regrafting)
  • Least squares (LS): Minimizes the squared differences between the observed and expected distances
    • Can handle incomplete distance matrices and negative branch lengths (weighted LS, generalized LS)

Maximum parsimony

  • (MP) selects the tree that requires the fewest character state changes to explain the observed data
    • Assumes that evolution is parsimonious and that homoplasy (, reversal, parallelism) is rare
  • MP is computationally intensive and may be inconsistent when the rates of evolution vary across lineages (long-branch attraction)
  • assigns different costs to different types of character state changes (step matrices, Sankoff parsimony)

Maximum likelihood

  • (ML) estimates the parameters of a substitution model that maximize the probability of observing the data given the tree
    • Assumes that the substitution process follows a Markov model and that the characters evolve independently
  • ML is statistically consistent and can accommodate complex substitution models (rate heterogeneity, partitioned analysis)
  • The likelihood surface may have multiple optima, requiring heuristic search algorithms (hill-climbing, genetic algorithms)

Bayesian inference

  • (BI) combines the prior probability of a tree with the likelihood of the data to estimate the posterior probability distribution of trees
    • The prior distribution incorporates prior knowledge about the tree topology, branch lengths, and substitution model parameters
  • BI uses Markov chain Monte Carlo (MCMC) algorithms to sample trees from the posterior distribution (Metropolis-Hastings, Gibbs sampling)
  • The posterior probabilities of clades can be interpreted as the probability that the clade is true given the data and the model (credible sets, majority-rule consensus)

Assessing phylogenetic tree reliability

  • Assessing the reliability of phylogenetic trees is crucial for determining the confidence in the inferred relationships
  • Several methods are available to quantify the support for individual clades or the overall tree topology

Bootstrap analysis

  • estimates the sampling variance of the estimated tree by creating pseudo-replicate datasets
    • The original dataset is randomly sampled with replacement to generate multiple bootstrap datasets of the same size
  • The tree inference method is applied to each bootstrap dataset, and the proportion of trees that contain a particular clade is the value
  • Bootstrap values range from 0 to 100% and indicate the robustness of the clades to sampling error (70% cutoff for strong support)

Jackknife resampling

  • is similar to bootstrapping but creates pseudo-replicate datasets by randomly omitting a proportion of the original data
    • The omitted data can be characters (delete-half jackknife) or taxa (delete-one jackknife)
  • The jackknife support values are calculated as the proportion of jackknife replicates that recover a particular clade
  • Jackknifing is less commonly used than bootstrapping but can be useful for detecting influential characters or taxa

Posterior probability support

  • Posterior probability (PP) support values are obtained from Bayesian inference and represent the probability of a clade given the data and the model
    • PP values range from 0 to 1 and are interpreted as the probability that the clade is true
  • PP values are generally higher than bootstrap values and may overestimate the support for short internodes (Bayesian star-tree paradox)
  • Corrections for PP values have been proposed to account for model misspecification and (gene tree-species tree discordance)

Applications of phylogenetic analysis

  • Phylogenetic analysis has diverse applications in evolutionary biology, systematics, and comparative genomics
  • Phylogenetic trees serve as a framework for understanding the evolution of traits, genes, and species

Species tree reconstruction

  • Species trees depict the evolutionary relationships among species and can be inferred from multiple gene trees
    • Concatenation methods combine multiple gene alignments into a supermatrix and infer a single tree (maximum likelihood, Bayesian inference)
  • Coalescent-based methods account for the discordance between gene trees and the species tree due to incomplete lineage sorting (BEST, *BEAST, ASTRAL)
  • Species trees are used to study speciation, biogeography, and character evolution at the macroevolutionary scale

Gene tree inference

  • Gene trees represent the evolutionary history of individual genes and can differ from the species tree due to gene duplication, loss, and horizontal transfer
    • Reconciliation methods map gene trees onto a species tree and infer the evolutionary events that explain the discordance (Notung, AnGST, Mowgli)
  • Gene trees are used to study the evolution of gene families, identify orthologous and paralogous genes, and detect selection at the molecular level

Molecular clock analysis

  • estimates the timing of evolutionary events based on the assumption that the rate of molecular evolution is constant over time
    • Strict clock models assume a single rate across all lineages, while relaxed clock models allow the rate to vary (lognormal, exponential, random local clocks)
  • Calibration points from the fossil record or biogeographic events are used to convert the branch lengths into absolute time (node dating, tip dating)
  • analysis is used to study the tempo and mode of evolution, date the origin of lineages, and reconstruct ancestral characters

Ancestral state reconstruction

  • infers the character states of extinct ancestors based on the character states of extant taxa and the phylogenetic tree
    • Parsimony-based methods minimize the number of character state changes along the tree (accelerated transformation, delayed transformation)
  • Likelihood-based methods estimate the probabilities of different character states at each node under a continuous-time Markov model (Mk model, threshold model)
  • Ancestral state reconstruction is used to study the evolution of morphological, ecological, and behavioral traits, as well as the origin and loss of complex characters

Challenges in phylogenetic analysis

  • Phylogenetic analysis faces several challenges that can affect the accuracy and reliability of the inferred trees
  • These challenges arise from the complexity of the evolutionary process, the limitations of the available data, and the assumptions of the methods

Long-branch attraction

  • Long-branch attraction (LBA) is a systematic error that occurs when rapidly evolving lineages are artificially grouped together in the inferred tree
    • LBA is caused by the accumulation of homoplasies (convergent, parallel, or reversed changes) in fast-evolving lineages
  • LBA can be mitigated by using more realistic substitution models, removing fast-evolving sites, or breaking up long branches with additional taxa (long-branch subdivision)
  • LBA is a common problem in phylogenetic analysis and can lead to incorrect conclusions about the relationships among taxa

Incomplete lineage sorting

  • Incomplete lineage sorting (ILS) occurs when ancestral polymorphisms are not completely sorted among descendant lineages, leading to discordance between gene trees and the species tree
    • ILS is more likely to occur when the time between speciation events is short relative to the effective population size
  • ILS can be accounted for by using coalescent-based methods that model the probability of gene tree-species tree discordance (multispecies coalescent model)
  • ILS is a major source of gene tree heterogeneity and can affect the accuracy of species tree inference and divergence time estimation

Horizontal gene transfer

  • Horizontal gene transfer (HGT) is the transfer of genetic material between organisms that are not in a parent-offspring relationship
    • HGT is common in prokaryotes and can also occur in eukaryotes (endosymbiotic gene transfer, viral integration)
  • HGT can lead to discordance between gene trees and the species tree and can affect the inference of phylogenetic relationships and evolutionary events
  • HGT can be detected by comparing the topology of gene trees to the species tree and identifying statistically supported incongruences (reconciliation, network methods)

Compositional heterogeneity

  • refers to the variation in nucleotide or amino acid composition across taxa or sites
    • Compositional heterogeneity can arise from differences in mutation bias, selection pressure, or GC content
  • Compositional heterogeneity can lead to the grouping of taxa with similar composition rather than true evolutionary relationships (compositional attraction)
  • Compositional heterogeneity can be accounted for by using more complex substitution models that allow for variation in equilibrium frequencies (CAT model, mixture models)
  • Failure to account for compositional heterogeneity can result in biased tree estimates and incorrect inferences about evolutionary processes
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary