🧬Computational Genomics Unit 9 – RNA-seq: Transcriptome Analysis
RNA-seq is a powerful tool for studying gene expression in cells and tissues. It allows researchers to quantify RNA levels, identify new transcripts, and detect alternative splicing events. This technique has revolutionized transcriptomics by enabling genome-wide analysis of gene expression at unprecedented depth and resolution.
Key concepts in transcriptomics include the transcriptome, gene expression, alternative splicing, and non-coding RNAs. Understanding these concepts is crucial for designing RNA-seq experiments, analyzing data, and interpreting results. Proper experimental design, sample preparation, and quality control are essential for generating reliable RNA-seq data.
RNA-seq (RNA sequencing) is a high-throughput sequencing technology used to study the transcriptome, which is the complete set of RNA transcripts in a cell or tissue at a specific time point
Enables researchers to quantify gene expression levels, identify novel transcripts, and discover alternative splicing events by directly sequencing the RNA molecules present in a sample
Provides a snapshot of the active genes and their expression levels under specific conditions (developmental stage, disease state, or treatment)
Offers several advantages over traditional gene expression profiling methods (microarrays) including higher sensitivity, wider dynamic range, and the ability to detect novel transcripts without prior knowledge of the genome sequence
Generates large amounts of data that require computational analysis to extract biologically meaningful insights
Typical RNA-seq experiment produces millions of short reads (50-150 base pairs) that need to be aligned to a reference genome or assembled de novo
Expression levels are quantified by counting the number of reads mapping to each gene or transcript
Has revolutionized the field of transcriptomics by enabling genome-wide analysis of gene expression at an unprecedented resolution and depth
Widely applied in various research areas (developmental biology, cancer research, and plant sciences) to understand the molecular mechanisms underlying biological processes and diseases
Key Concepts in Transcriptomics
Transcriptome refers to the complete set of RNA transcripts present in a cell or tissue at a given time point, including messenger RNAs (mRNAs), non-coding RNAs (ncRNAs), and small RNAs
Gene expression is the process by which the information encoded in a gene is used to synthesize functional gene products, primarily proteins
Expression levels can be quantified by measuring the abundance of mRNA transcripts produced from each gene
Differential gene expression refers to the changes in expression levels between different conditions (e.g., healthy vs. diseased, treated vs. untreated)
Alternative splicing is a regulated process during gene expression that allows a single gene to produce multiple mRNA isoforms, potentially encoding different protein variants
Occurs when exons are included or excluded from the final processed mRNA, or when introns are retained
Contributes to the diversity of the proteome and plays a crucial role in cell differentiation, development, and disease
Non-coding RNAs (ncRNAs) are functional RNA molecules that are not translated into proteins, but have important regulatory roles in gene expression and cellular processes
Examples include long non-coding RNAs (lncRNAs), microRNAs (miRNAs), and small interfering RNAs (siRNAs)
RNA editing is a post-transcriptional modification that alters the nucleotide sequence of an RNA molecule, potentially changing the amino acid sequence of the encoded protein or affecting the stability and localization of the RNA
Gene fusion occurs when two separate genes are joined together due to chromosomal rearrangements, resulting in the production of a chimeric RNA and potentially a fusion protein with altered function
Often associated with cancer development and can serve as diagnostic biomarkers or therapeutic targets
RNA-seq Experimental Design
Defining the research question and hypothesis is crucial for designing an appropriate RNA-seq experiment
Clearly state the biological question to be addressed and the specific hypotheses to be tested
Consider the type of samples to be analyzed (e.g., tissues, cell lines, or single cells) and the comparisons to be made (e.g., treatment vs. control, time course, or different developmental stages)
Determining the sequencing depth and read length depends on the research goals and the complexity of the transcriptome
Higher sequencing depth (more reads per sample) increases the sensitivity to detect low-abundance transcripts and rare isoforms
Longer read lengths (100-150 bp) improve the accuracy of transcript assembly and isoform identification
Balance between sequencing depth and number of biological replicates based on budget and experimental design
Selecting the appropriate number of biological replicates is essential for statistical power and reproducibility
Biological replicates (independent samples from different individuals or experiments) capture the biological variability and allow for robust differential expression analysis
At least three biological replicates per condition are recommended, with more replicates increasing the power to detect significant differences
Choosing the RNA extraction and library preparation methods based on the sample type and research objectives
Total RNA sequencing captures all RNA species, including mRNAs, ncRNAs, and small RNAs
mRNA sequencing (poly(A) selection) enriches for mature mRNAs by targeting the poly(A) tail
Ribosomal RNA (rRNA) depletion removes the highly abundant rRNA molecules to increase the coverage of other RNA species
Strand-specific protocols preserve the information about the originating strand of the RNA transcripts
Planning for data storage, management, and analysis infrastructure is important given the large volume of data generated by RNA-seq experiments
Raw sequencing data (FASTQ files) and processed data (aligned reads, count matrices) require significant storage capacity
Computational resources (high-performance computing clusters or cloud-based platforms) are necessary for data processing and analysis
Establish a data management plan for organizing, backing up, and sharing the data in accordance with FAIR (Findable, Accessible, Interoperable, and Reusable) principles
Sample Prep and Sequencing
RNA extraction is the first step in sample preparation, which involves isolating total RNA from the biological samples
Use commercially available kits (TRIzol, RNeasy) or phenol-chloroform extraction methods
Assess RNA quality and integrity using spectrophotometry (NanoDrop) and capillary electrophoresis (Bioanalyzer or TapeStation)
High-quality RNA with minimal degradation (RIN > 8) is essential for successful library preparation and sequencing
Library preparation converts the RNA molecules into cDNA libraries compatible with the sequencing platform
Fragmentation of RNA into smaller pieces (200-500 bp) to ensure uniform coverage across the transcriptome
Reverse transcription to synthesize cDNA from the fragmented RNA using random hexamer primers
Adapter ligation to attach platform-specific sequences to the ends of the cDNA fragments, enabling amplification and sequencing
Amplification of the cDNA library using PCR to increase the amount of material for sequencing
Size selection to enrich for fragments of the desired length and remove adapter dimers and other artifacts
Multiplexing allows for the simultaneous sequencing of multiple samples in a single run by using unique barcodes (sample-specific sequences) added during library preparation
Reduces sequencing costs and increases throughput
Requires careful design to ensure balanced representation of samples and avoid barcode crosstalk
Sequencing platforms (Illumina, PacBio, Oxford Nanopore) generate millions to billions of short reads (50-150 bp) or longer reads (1-100 kb) from the cDNA libraries
Illumina sequencing (HiSeq, NextSeq) is the most widely used platform for RNA-seq, offering high accuracy, throughput, and cost-effectiveness
PacBio and Oxford Nanopore sequencing provide longer reads that can improve the resolution of complex isoforms and splice variants, but have higher error rates and lower throughput compared to Illumina
Sequencing depth and read length should be chosen based on the research objectives and the complexity of the transcriptome
Aim for at least 20-30 million reads per sample for differential expression analysis of coding genes
Increase sequencing depth (50-100 million reads) for detecting low-abundance transcripts, non-coding RNAs, or rare isoforms
Longer read lengths (100-150 bp) improve the accuracy of transcript assembly and isoform identification
Quality Control and Preprocessing
Quality assessment of raw sequencing data (FASTQ files) is crucial for identifying and addressing any issues that may affect downstream analysis
Use tools like FastQC or MultiQC to generate quality control reports
Check for base quality scores, GC content, sequence duplication levels, and overrepresented sequences (adapters, primers)
Low-quality bases (Q < 20) and adapter sequences should be trimmed using tools like Trimmomatic or Cutadapt to improve the accuracy of alignment and quantification
Filtering out low-quality reads and contaminants helps to reduce noise and improve the signal-to-noise ratio in the data
Remove reads with a high proportion of low-quality bases (e.g., >50% bases with Q < 20)
Discard reads aligning to ribosomal RNA (rRNA) or other contaminating sequences (e.g., PhiX control) using tools like SortMeRNA or Bowtie2
Trim or filter out reads with adapter sequences, poly(A) tails, or other technical artifacts
Read trimming involves removing low-quality bases and adapter sequences from the ends of the reads
Performed using tools like Trimmomatic, Cutadapt, or BBDuk
Improves the accuracy of alignment and quantification by ensuring that only high-quality bases are used in the analysis
Trims bases below a specified quality threshold (e.g., Q < 20) from the 3' end of the reads
Removes adapter sequences by matching the read sequences against a library of known adapter sequences
Read deduplication identifies and removes PCR duplicates, which are reads originating from the same cDNA fragment during library amplification
PCR duplicates can introduce biases in quantification and lead to overestimation of expression levels
Tools like Picard MarkDuplicates or Clumpify can be used to identify and remove duplicate reads based on their alignment positions
Deduplication is more important for low-complexity libraries (e.g., small RNA-seq) or when using PCR-based library preparation methods
Quality control metrics and thresholds should be carefully evaluated and reported to ensure the reproducibility and reliability of the results
Base quality scores: Aim for a median Q-score > 30 and no more than 20% bases with Q < 20
Adapter content: Less than 10% of reads should contain adapter sequences after trimming
rRNA contamination: Less than 5% of reads should align to rRNA sequences
Alignment rate: At least 70-80% of reads should align uniquely to the reference genome or transcriptome
Read duplication rate: Depends on the library complexity and preparation method, but should be consistent across samples
Alignment and Quantification
Alignment maps the preprocessed reads to a reference genome or transcriptome to determine their originating locations
Spliced aligners (STAR, HISAT2, TopHat2) are used to handle reads spanning exon-exon junctions
Alignment parameters (e.g., mismatch rate, gap penalties) should be optimized based on the read length and quality
Alignment quality metrics (e.g., uniquely mapped reads, multi-mapped reads, unmapped reads) should be evaluated to assess the quality of the alignment
Transcript assembly reconstructs the full-length transcripts from the aligned reads, allowing for the identification of novel isoforms and splice variants
Reference-guided assembly (Cufflinks, StringTie) uses the reference genome annotation to guide the assembly process
De novo assembly (Trinity, Oases) reconstructs transcripts without relying on a reference genome, enabling the discovery of novel transcripts in non-model organisms
Hybrid assembly approaches (HISAT-StringTie, STAR-Cufflinks) combine reference-guided and de novo methods to improve the accuracy and completeness of the assembly
Quantification estimates the expression levels of genes and transcripts by counting the number of reads or fragments mapping to each feature
Count-based methods (HTSeq, featureCounts) assign reads to genes or exons based on their alignment positions
Requires a gene annotation file (GTF/GFF) to define the genomic coordinates of the features
Generates a count matrix with rows representing genes and columns representing samples
Transcript-level quantification (RSEM, Kallisto, Salmon) estimates the abundance of individual isoforms by probabilistically assigning reads to transcripts
Uses the principles of pseudoalignment or quasi-mapping to rapidly assign reads to transcripts without the need for a full alignment
Outputs TPM (Transcripts Per Million) or FPKM (Fragments Per Kilobase Million) values, which normalize for transcript length and library size
Normalization adjusts the raw read counts to account for differences in library size, sequencing depth, and other technical factors that may affect the comparison of expression levels across samples
CPM (Counts Per Million) and RPM (Reads Per Million) normalize the read counts by the total number of mapped reads in each sample
TPM (Transcripts Per Million) and FPKM (Fragments Per Kilobase Million) additionally normalize for transcript length, making them more suitable for comparing expression levels across genes
TMM (Trimmed Mean of M-values) and DESeq2's median-of-ratios method are more robust normalization methods that account for differences in library composition and reduce the impact of highly expressed genes
Batch effect correction removes systematic biases introduced by technical factors (e.g., sequencing run, library preparation batch) that can confound the biological variation of interest
Tools like ComBat, SVA, and RUVSeq can be used to identify and correct for batch effects using statistical methods
Batch effects should be assessed and corrected for before downstream analysis to avoid false positives and improve the reproducibility of the results
Differential Expression Analysis
Differential expression analysis identifies genes or transcripts that are significantly up- or down-regulated between experimental conditions
Requires a count matrix with normalized read counts for each gene or transcript across all samples
Compares the expression levels between two or more groups (e.g., treatment vs. control, time points, tissue types) to identify genes with statistically significant changes in expression
Statistical methods for differential expression analysis model the distribution of read counts and test for significant differences between conditions
Negative binomial distribution (DESeq2, edgeR) is commonly used to model the overdispersion of read counts, accounting for both biological and technical variability
Generalized linear models (GLMs) are used to test for differential expression while controlling for confounding factors (e.g., batch effects, covariates)
Likelihood ratio tests (LRT) or Wald tests are used to assess the significance of the differential expression, generating p-values for each gene or transcript
Multiple testing correction adjusts the p-values to control for the false discovery rate (FDR) when performing numerous simultaneous hypothesis tests
Bonferroni correction is a conservative method that multiplies the p-values by the number of tests performed, ensuring a family-wise error rate (FWER) of less than the specified threshold (e.g., 0.05)
Benjamini-Hochberg procedure is a more powerful method that controls the FDR, which is the expected proportion of false positives among all significant results
Adjusted p-values (q-values) < 0.05 are typically considered significant, but the threshold can be adjusted based on the desired balance between sensitivity and specificity
Fold change cutoffs are often used in combination with statistical significance to identify biologically meaningful changes in expression
Fold change (FC) is calculated as the ratio of the average normalized expression levels between two conditions (e.g., treatment/control)
Log2 fold change (LFC) is the logarithm (base 2) of the fold change, with positive values indicating up-regulation and negative values indicating down-regulation
Commonly used fold change cutoffs are |FC| > 2 or |LFC| > 1, but the choice of threshold depends on the biological context and the desired level of stringency
Visualization of differential expression results helps to interpret and communicate the findings
Volcano plots display the statistical significance (-log10 p-value) against the fold change (log2 FC) for each gene, highlighting genes that are both statistically significant and biologically meaningful
Heatmaps cluster genes and samples based on their expression patterns, revealing groups of co-regulated genes and sample relationships
MA plots (M-A plots) show the average expression level (A) against the fold change (M) for each gene, helping to identify intensity-dependent biases and assess the overall distribution of differential expression