"Advanced" Analysis Options for miRNA-Seq Data (12/23/2015)
Wed, Dec 23, 2015 at 1:07 PM
Customer has miRNA sequencing data acquired using Illumina sequencing platform. He is interested in - on top of the "routine" miRNA sequencing data analysis, which involves mapping of the reads to known miRNA sequences documented in miRBase - other analysis options, in particular, identification of novel miRNAs, and target prediction for known and novel miRNAs. The sample is of a pathogenic virus infecting a human tissue. Customer mentions that there were previously obtained DNA microarray and mRNA sequencing data, and asks whether there are ways to make use of those data in the miRNA sequencing data analysis project.
Wed, Dec 23, 2015 at 3:16 PM
AccuraScience LB: (1) Novel miRNA prediction can be made using a transcribed small RNA sequence - and its surrounding regions (~80bp) in the respective genome, by looking at whether it could form a pre-miRNA like hair-pin structure. This is doable because the virus genome is available.
(2) Once novel miRNAs are actually identified, prediction of their target (protein-coding) genes can be made. Most miRNA target prediction tools (there are dozens of them) cannot predict targets for novel miRNAs, but this is something we can work out. A known issue with the miRNA target prediction field is that all tools suffer from false positive problems. A common practice is to run multiple of those tools (3+), and choose the common subset of their results.
(3) If the previously performed microarray and long RNA-seq experiments were done using matching samples (with the recent small RNA-seq), there are ways to take advantage of those (protein-coding) gene expression data when looking for targets of the novel miRNAs (if the latter are found). The general principle underlying this integrated analysis is, if a "novel" miRNA is up-regulated in one sample, the targets of this miRNA are expected to be down-regulated in the same sample - the latter would be visible in the microarray or long RNA-seq data.
(4) A unique complexity of this project is that the RNA samples are a mixture of the virus and host samples. The analysis pipeline to be developed would need to take this into account.
Generally, the analysis might includes the following components: (i) Sequencing data quality control: this could be tricky because the read length is likely longer than the miRNA length, implying that part of the read sequence could be primer and/or adapter sequence, (ii) A series of mapping steps, to map the (truncated) reads to multiple reference sources, including (a) human genome, (b) the virus genome, and (c) (possibly) known human miRNAs in miRBase. (iii) For small RNA sequences that are of the virus source, attempt novel miRNA prediction, based on RNA folding (described in (1) above). (iv) If novel miRNAs are identified, perform target prediction (described in (2) above), possibly taking advantage of previous microarray and/or long RNA-seq data (described in (3) above).
Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.
Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer's privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.