Phylogenetic tree construction is a crucial aspect of molecular evolution studies. It involves using various methods to infer evolutionary relationships between organisms based on genetic data. These methods can be broadly categorized into distance-based and character-based approaches, each with its own strengths and limitations.
Understanding the different tree construction methods is essential for interpreting evolutionary relationships accurately. From simple algorithms like UPGMA to more complex approaches like and , each method offers unique insights into the evolutionary history of organisms. Assessing tree reliability and interpreting results are key skills in molecular phylogenetics.
Phylogenetic Tree Construction Methods
Distance-Based and Character-Based Methods
Top images from around the web for Distance-Based and Character-Based Methods
Iroki: automatic customization and visualization of phylogenetic trees [PeerJ] View original
Is this image relevant?
Figure 5 | Rooted neighbor-joining phylogenetic tree of Gamm… | Flickr View original
Is this image relevant?
Figure 3 | Rooted neighbor-joining phylogenetic tree of Alph… | Flickr View original
Is this image relevant?
Iroki: automatic customization and visualization of phylogenetic trees [PeerJ] View original
Is this image relevant?
Figure 5 | Rooted neighbor-joining phylogenetic tree of Gamm… | Flickr View original
Is this image relevant?
1 of 3
Top images from around the web for Distance-Based and Character-Based Methods
Iroki: automatic customization and visualization of phylogenetic trees [PeerJ] View original
Is this image relevant?
Figure 5 | Rooted neighbor-joining phylogenetic tree of Gamm… | Flickr View original
Is this image relevant?
Figure 3 | Rooted neighbor-joining phylogenetic tree of Alph… | Flickr View original
Is this image relevant?
Iroki: automatic customization and visualization of phylogenetic trees [PeerJ] View original
Is this image relevant?
Figure 5 | Rooted neighbor-joining phylogenetic tree of Gamm… | Flickr View original
Is this image relevant?
1 of 3
Phylogenetic trees are constructed using either distance-based or character-based methods, each with their own advantages and limitations
Distance-based methods calculate pairwise distances between sequences to construct a tree (neighbor-joining, UPGMA)
These methods are computationally efficient but may not always produce the most accurate tree topology
Neighbor-joining is a bottom-up clustering algorithm that minimizes the total branch length at each stage of tree construction
UPGMA assumes a constant rate of evolution and produces rooted trees
Character-based methods use discrete characters or character states to infer the most likely tree topology (maximum parsimony, maximum likelihood)
These methods are more computationally intensive but can produce more accurate trees
Maximum parsimony selects the tree that requires the fewest evolutionary changes to explain the observed data
Maximum likelihood estimates the probability of observing the data given a specific tree topology and evolutionary model
Bayesian inference incorporates prior probabilities and calculates the posterior probability of a tree given the data and the model
Advantages and Limitations of Different Methods
Distance-based methods are computationally efficient and can handle large datasets, but they may lose information by reducing sequences to pairwise distances
They are sensitive to the choice of evolutionary model and may not recover the correct tree topology if the model assumptions are violated
Neighbor-joining does not assume a constant rate of evolution, making it more flexible than UPGMA
UPGMA is sensitive to unequal rates of evolution and can produce incorrect trees if this assumption is violated
Character-based methods use more information from the sequences and can produce more accurate trees, but they are computationally intensive and may be sensitive to model choice
Maximum parsimony does not explicitly model evolutionary processes and may be misled by (convergent evolution, reversals, and parallel evolution)
Maximum likelihood accounts for different rates of evolution and provides a statistical framework for model selection and hypothesis testing
Bayesian inference incorporates prior knowledge and quantifies uncertainty in tree estimates, but it requires specifying prior distributions and can be computationally demanding
Constructing Phylogenetic Trees
Input Data and Evolutionary Models
Distance-based methods require a distance matrix as input, which is calculated from pairwise sequence alignments using a specific evolutionary model (Jukes-Cantor, Kimura 2-parameter)
The choice of evolutionary model affects the estimated distances and the resulting tree topology
Models differ in their assumptions about nucleotide frequencies, substitution rates, and rate variation among sites
Character-based methods require a multiple sequence alignment as input, where each position in the alignment represents a character
The quality of the alignment affects the accuracy of the resulting tree
Evolutionary models are used to calculate the likelihood of the data given a tree topology and to estimate branch lengths
Models for character-based methods include nucleotide substitution models (GTR, HKY), amino acid substitution models (WAG, LG), and codon models (Goldman-Yang)
Tree Searching and Optimization Algorithms
Neighbor-joining starts with a star-like tree and iteratively joins the least distant pairs of taxa, adjusting branch lengths to minimize the total tree length
The algorithm is fast and guaranteed to find the tree with the smallest total branch length for a given distance matrix
UPGMA creates a by successively clustering the least distant pairs of taxa, assuming a (constant rate of evolution)
The algorithm is simple and fast but may produce incorrect trees if the molecular clock assumption is violated
Maximum parsimony searches for the tree that minimizes the total number of character state changes (mutations) required to explain the observed data
Exact searches are computationally infeasible for large datasets, so heuristic methods (branch-and-bound, tree bisection, and reconnection) are used to find the most parsimonious tree(s)
Maximum likelihood calculates the probability of observing the data given a tree topology and an evolutionary model, selecting the tree with the highest likelihood
The likelihood is optimized using numerical methods (Newton-Raphson, expectation-maximization) or heuristic searches (nearest neighbor interchange, subtree pruning, and regrafting)
Bayesian inference combines the likelihood of the data with prior probabilities to calculate the posterior probability of a tree, often using Markov Chain Monte Carlo (MCMC) sampling to explore the tree space
MCMC algorithms (Metropolis-Hastings, Gibbs sampling) generate a sample of trees from the posterior distribution, which can be summarized to estimate tree topology, branch lengths, and support values
Phylogenetic Tree Reliability
Statistical Measures of Branch Support
is a resampling technique used to estimate the reliability of tree branches by creating pseudoreplicates of the original dataset and calculating the proportion of times each branch is recovered
Branches with high bootstrap support (>70%) are considered more reliable than those with low support
Bootstrapping is computationally intensive and may not always provide an accurate measure of branch support, especially for small datasets or short branches
Jackknife analysis is similar to bootstrapping but involves removing a proportion of the data (50%) in each pseudoreplicate
Jackknifing is faster than bootstrapping but may be less accurate due to the smaller sample size in each pseudoreplicate
Decay index (Bremer support) measures the number of additional evolutionary steps required to collapse a branch in a maximum parsimony tree
Higher decay indices indicate stronger support for a branch, as more evidence is needed to contradict it
Decay indices are calculated by searching for the shortest trees that do not contain a particular branch and comparing their lengths to the most parsimonious tree
Posterior probabilities in Bayesian inference indicate the probability of a branch being true given the data and the model
Posterior probabilities are interpreted differently from bootstrap values and are generally higher for well-supported branches
Posterior probabilities can be sensitive to model choice and prior specifications, so they should be interpreted cautiously
Assessing Tree Fit and Model Selection
Consistency index (CI) and retention index (RI) are used to assess the fit of a maximum parsimony tree to the data
CI measures the amount of homoplasy in the tree, with higher values indicating less homoplasy and a better fit
RI measures the proportion of synapomorphy retained in the tree, with higher values indicating a better fit
Both indices range from 0 to 1, with 1 indicating a perfect fit and no homoplasy
Likelihood ratio tests can be used to compare the fit of different evolutionary models to the data and select the best-fitting model for tree construction
The likelihood ratio test compares the likelihoods of two nested models, with the more complex model having additional parameters
If the likelihood improvement of the more complex model is statistically significant, it is preferred over the simpler model
Model selection criteria (Akaike information criterion, Bayesian information criterion) balance the fit of the model with its complexity, favoring models that explain the data well without overfitting
Interpreting Phylogenetic Trees
Evolutionary Relationships and Patterns
The branching pattern (topology) of a phylogenetic tree reflects the evolutionary relationships among taxa
Taxa that share a more recent common ancestor are more closely related than those with a more distant common ancestor
Monophyletic groups (clades) consist of an ancestor and all its descendants, and are supported by shared derived characters (synapomorphies)
Paraphyletic groups include an ancestor but not all of its descendants, while polyphyletic groups have multiple ancestors
Branch lengths in a phylogenetic tree represent the amount of evolutionary change (number of substitutions per site) between taxa
Longer branches indicate more evolutionary change and a greater genetic distance between taxa
Branch lengths can be used to estimate divergence times and rates of evolution, but they are affected by the choice of evolutionary model and the presence of rate variation among lineages
Rooted trees have a specific node designated as the root, representing the common ancestor of all taxa in the tree
The root determines the direction of evolutionary change and the relative ages of lineages
Unrooted trees do not specify the position of the root and only depict the relative relationships among taxa
Outgroup taxa are used to root a tree and determine the direction of character state changes
An outgroup is a taxon that is known to be less closely related to the ingroup taxa than they are to each other
The outgroup is used to polarize character states, with the state present in the outgroup considered the ancestral state
Inferring Evolutionary Events and Processes
Polytomies (multifurcations) in a tree indicate uncertainty in the branching order or rapid diversification events
Hard polytomies represent simultaneous divergence of multiple lineages, while soft polytomies result from insufficient data to resolve the branching order
Polytomies can be resolved by adding more data (characters or taxa) or using more sophisticated phylogenetic methods
Convergent evolution can be inferred when distantly related taxa share similar character states due to similar selective pressures
Convergent characters (homoplasies) can mislead phylogenetic analyses and should be identified and accounted for
can be detected by comparing the fit of alternative tree topologies or by examining character state distributions across the tree
Horizontal gene transfer can be detected when a gene tree topology differs significantly from the species tree topology
Horizontal gene transfer is common in prokaryotes and can result in discordance between gene trees and species trees
Phylogenetic network methods can be used to visualize and quantify horizontal gene transfer events
Reconciliation methods can be used to infer the history of gene duplications, losses, and transfers that explain the discordance between gene and species trees