3.4 Database searching using BLAST and other tools
5 min read•august 14, 2024
Database searching is a crucial tool in computational biology, allowing researchers to compare sequences and find similarities. , the most popular tool, uses various algorithms to search massive databases, helping identify potential relationships between genes or proteins.
Understanding BLAST results is key to interpreting biological significance. E-values, bit scores, and alignment quality all play a role in determining the relevance of matches. Different BLAST algorithms cater to specific search needs, from DNA to protein comparisons.
Database Searching for Sequence Analysis
Principles and Applications
Top images from around the web for Principles and Applications
Frontiers | Sequence similarity network reveals the imprints of major diversification events in ... View original
Is this image relevant?
Determining Evolutionary Relationships | OpenStax Biology 2e View original
Is this image relevant?
Frontiers | Metagenomic Data Assembly – The Way of Decoding Unknown Microorganisms View original
Is this image relevant?
Frontiers | Sequence similarity network reveals the imprints of major diversification events in ... View original
Is this image relevant?
Determining Evolutionary Relationships | OpenStax Biology 2e View original
Is this image relevant?
1 of 3
Top images from around the web for Principles and Applications
Frontiers | Sequence similarity network reveals the imprints of major diversification events in ... View original
Is this image relevant?
Determining Evolutionary Relationships | OpenStax Biology 2e View original
Is this image relevant?
Frontiers | Metagenomic Data Assembly – The Way of Decoding Unknown Microorganisms View original
Is this image relevant?
Frontiers | Sequence similarity network reveals the imprints of major diversification events in ... View original
Is this image relevant?
Determining Evolutionary Relationships | OpenStax Biology 2e View original
Is this image relevant?
1 of 3
Database searching identifies similarities between biological sequences (DNA, RNA, or protein) to infer functional, structural, or evolutionary relationships
detection identifies sequences sharing a common evolutionary ancestor, providing insights into the function and structure of uncharacterized sequences
The basic principle involves comparing a against a database of known sequences using algorithms that assess similarity based on sequence alignment
Enables researchers to annotate newly sequenced genomes, identify potential drug targets, study evolutionary relationships, and discover novel genes or protein families
Effectiveness depends on factors such as the size and quality of the database, the choice of search algorithm and parameters, and the evolutionary distance between the query and target sequences
Factors Influencing Database Searching
Database size and quality impact the comprehensiveness and reliability of search results (, )
Search algorithm and parameter selection affect sensitivity, specificity, and computational efficiency (, , )
Evolutionary distance between query and target sequences influences the ability to detect homology, with more distant relationships requiring more sensitive algorithms and parameters
Sequence length and complexity can affect the statistical significance and biological relevance of search results, with longer and more complex sequences potentially generating more false positives
Database composition and taxonomic representation should be considered when interpreting search results, as biases in database content can influence the observed patterns of sequence similarity and homology
Interpreting BLAST Results
Statistical Significance Measures
(Expect value) represents the number of hits expected by chance given the database size, with lower E-values indicating higher significance
measures alignment quality taking into account the scoring matrix used and is independent of database size, with higher bit scores indicating better alignments
P-value estimates the probability of observing an alignment with a given score or better by chance, with lower P-values indicating higher significance
Alignment length and provide additional information on the extent and quality of the sequence similarity
Biological Relevance Assessment
Examine alignments to assess the extent and continuity of sequence similarity, considering factors such as , gaps, and mismatches
Evaluate percentage identity to gauge the level of sequence conservation, with higher identity suggesting closer evolutionary relationships or functional similarity
Consider query coverage to determine the proportion of the query sequence that aligns with the database sequence, with higher coverage indicating more extensive similarity
Inspect subject descriptions and annotations to infer potential functions, evolutionary relationships, or domain architecture of the matched sequences
Integrate information from multiple BLAST hits and alignments to build a more comprehensive understanding of the query sequence's biological context and relationships
BLAST Algorithms: Uses and Comparisons
Algorithm-Specific Use Cases
compares nucleotide queries against nucleotide databases, identifying similar DNA or RNA sequences (homologous genes, regulatory elements)
compares protein queries against protein databases, inferring functional or structural relationships and studying protein evolution
compares translated nucleotide queries (six reading frames) against protein databases, identifying potential protein-coding genes in unannotated DNA sequences
compares protein queries against translated nucleotide databases (six reading frames), identifying DNA sequences encoding proteins similar to the query, even if not annotated
compares translated nucleotide queries against translated nucleotide databases (six reading frames), identifying potential protein-coding genes in unannotated DNA sequences from distantly related organisms
Comparative Analysis and Integration
The choice of BLAST algorithm depends on the nature of the query and target sequences, research question, and required sensitivity and specificity
Comparing results from different BLAST algorithms provides a more comprehensive understanding of sequence relationships and helps identify false positives or negatives
Integrating results from multiple BLAST searches and algorithms can improve the accuracy and confidence of homology detection and functional inference
Combining BLAST results with other sources of information (domain databases, protein family databases, literature) enhances the biological interpretation of sequence similarities
Iterative BLAST searches using identified as queries can expand the scope of homology detection and refine the understanding of evolutionary relationships
Identifying Orthologs, Paralogs, and Homologs
Definitions and Significance
are genes in different species evolved from a common ancestral gene by speciation, typically retaining the same function
are genes within the same species evolved from a common ancestral gene by duplication, potentially diverging in function
Homologs are genes sharing a common evolutionary origin, including both orthologs and paralogs
Identifying orthologs is crucial for studying gene function and evolution across species, while paralogs help understand gene family evolution and the emergence of new functions
Homology identification is essential for studying the evolutionary history and relationships of genes and organisms
Methods for Ortholog and Paralog Identification
involve using a gene from one species to search for homologs in another species, then using the best hit from the second species to search the first species, with the original gene being the best hit in the reciprocal search indicating likely orthologs
constructs evolutionary trees based on sequence similarity to infer evolutionary relationships, with orthologs typically clustering together and paralogs forming separate clades
Combining reciprocal BLAST searches and phylogenetic analysis provides a robust approach to distinguish orthologs, paralogs, and homologs by considering both sequence similarity and evolutionary relationships
examines the conservation of gene order and neighborhood across genomes to identify orthologs and distinguish them from paralogs
Comparative genomics approaches integrate sequence similarity, phylogenetic analysis, and synteny information to refine ortholog and paralog assignments across multiple species