RNA-seq Analysis Using Reference Genome of A Related Species, Stampy Vs Bowtie (6/23/2015)


Tue, Jun 23, 2015 at 8:14 AM

In an ongoing project, AccuraScience analyzes RNA-seq data for Balb/c mice. Because Balb/c's reference genome is unavailable, the genome of a closely related mouse (C57) is used in the project. A low mapping rate (~15%) is observed. With loosened criteria (set in Bowtie), a higher mapping rate of 30% is achieved. Customer is still unsatisfied, and has a local bioinformatician to redo the mapping, who achieves ~97% mapping rate. AccuraScience LB questions how the mapping was performed, and Customer's local bioinformatician answers that she uses Stampy to map the reads to mm10.

Tue, Jun 23, 2015 at 9:46 AM

AccuraScience LB: (1) Whereas higher mapping rate is always desirable, achieving highest possible mapping rate might not be considered as an ultimate goal especially for a RNA-seq project. Rather, we would be looking at (a) whether the mapping produced "enough" mappable reads, and (b) how to do the mapping to facilitate downstream analysis - including expression quantification and differential expression analysis: we want to be careful in deciding our mapping strategy to avoid introducing unnecessary bias in the downstream analysis.

(2) It is important to note that RNA-seq has a dynamic range of 5 orders of magnitude in expression level measurement. Think about the data, you would realize that most of the mapped reads (>90%) represent a fairly small number of highly expressed genes, and the remaining small proportion of reads correspond to many genes with mid-level expression, and the remaining, very small proportion of reads correspond to lowly expressed genes whose precise expression levels are hard to measure accurately. When we talk about possible improvements in mapping rate, we are talking linear scale: for instance, from 30% mapping rate to 90% mapping rate, that's only 3 times improvement. The wide dynamic range in gene expression measurement (across 5 orders of magnitude) determines that this 3 times improvement in mapping rate will not do any good to the vast majority of the gens whose expression levels can be reliably quantified. For example, a gene with mid-range expression level, e.g., 20 FPKM, of an average length, e.g., 5K bp, would have 1000 reads mapped to it before this improvement (assuming 10 million mapped reads in total). After the 3 times improvement in mapping rate, the same gene would have 3000 reads mapped to it. This would lead to almost unnoticeable change in the quantification of this gene's expression level. Difference is made only for very lowly expressed genes, e.g., a gene with expression level of 0.5 FPKM, which corresponds to 25 reads mapped before improvement, and 75 reads mappped after the 3 times improvement. How important is it to accurately measure the quantification of gene expression levels on the 0.5 FPKM level? Practically, genes of expression level <1 FPKM are considered "not expressed" at all by many researchers, and quantification of these genes' expression levels will not be regarded as accurate no matter what. This is the reason why I say achieving very high mapping rate should not be considered as the ultimate goal for a RNA-seq study.

It is much more important to examine the *number* of mappable reads, which helps us evaluate whether the dataset allows us to achieve "reasonable" gene expression quantification for moderately lowly expressed genes. A rule of thumb we have developed over the years is, if the total number of "mappable" reads is above 10 million reads, it is a good sample. The reason we went ahead with gene expression quantification for the two balb/c samples was because the number of mappable reads for the two samples were ~10 million (for more stringent mapping setting) and ~20 million (for looser mapping setting), which are really not bad, despite the low mapping rates.

(3) It is possible to achieve higher mapping rate if we switch to a different reference genome, however (a) This introduces systematic bias if you intend to perform differential expression analysis between the Balb/c samples and C57 samples, because they were mapped to different genomes. (b) This introduces complexity when you try to annotate the differentially expressed genes, because different reference genomes have different functional annotation of genes. The Balb/c genome might not even have adequate gene annotation that allows us to label those expressed genes.

(4) Although Stampy can be used to analyze RNA-seq data, it involves a different pipeline which requires the using of a "exon-exon junction" database. This pipeline is very different from Cufflinks, and it will not be proper to perform differential expression analysis between some samples whose expression levels were quantified using Cufflinks and other samples quantified using Stampy with exon-exon junction database.

(5) Another concern I have with Stampy is that it is known to be the most aggressive mapping tool available - it achieves high mapping rate, but suffers low specificity problem: that is, a higher proportion of the mapping results are expected to be mapped "incorrectly".

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer's privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.