RNA-seq analysis pipeline focusing on long ncRNAs (3/11/2016)

Fri, Mar 11, 2016 at 5:09 AM

Customer asks about analysis pipeline for strand-specific RNA-seq data with focus on identifying long non-coding RNAs. He also asks about how to analyze the coding counterpart of the long non-coding transcripts.

Fri, Mar 11, 2016 at 10:42 AM

AccuraScience LB: Assuming that it is a model organism species with a high-quality reference genome and annotation of protein-coding genes in Ensembl, a "typical" analysis pipeline for RNA-seq data focusing on identifying long non-coding RNAs (lncRNAs) would include the following - starting from Cufflinks analysis results: (1) annotate all transcripts based on gene models documented in Ensembl, and produce lists of (a) transcripts that can be annotated as known protein-coding genes, (b) transcripts that can be annotated as known ncRNAs, and (c) transcripts that cannot be annotated as known genes or known ncRNAs - this category of transcripts are called "novel transcripts". (2) Perform coding potential prediction of the novel transcripts, which produces lists of (c1) novel coding RNAs and (c2) novel non-coding transcripts. The whole list of ncRNAs will be (b)+(c2).

Most of the functionally characterized lncRNAs work in cis, i.e., they have “counterpart” protein-coding genes that reside in close proximity to them on the genome. We can produce a list of these protein-coding genes with ease. It is important to note that some lncRNAs work in trans, that is, their “counterparts” reside in remote locations. Whether a lncRNA work in cis or in trans can only be determined through functional studies.

