Analysis of Human rDNA Sequences in YAC Clones (2/5/2015)


Thu, Feb 5, 2015 at 1:11 PM

Customer: I have been working with his rDNA sequencing project. We have done sequencing of YAC clones of rDNA regions from human. We have ~150 bp Illumina reads of these clones, including a very large percentage (99%) of the yeast DNA sequence. Our goal is to remove the yeast sequences and assemble the human sequences into rDNA contigs. The challenge is that rDNA is highly repetitive. What I have done until is the following: (1) QC of the reads using FastQC, which indicated a significant presence of the Nextera cloning sequence. I trimmed the reads to remove those sequences. (2) Alignment against yeast genome to remove yeast DNA reads. (3) Assembly of remaining reads to find putative rDNA contigs

What I found was that a large number of the contigs again match yeast DNA. A few contigs match a known rDNA sequence. However the contigs that don't match yeast DNA don't seem to be helpful in extending the known rDNA sequence in any direction.

Thu, Feb 5, 2015 at 1:43 PM

AccuraScience LB: A couple of quick questions: (1) Are the sequencing data pair-ended or mate-pair or else? What's the expected insert size? (2) Did you look into the expected level of homology between human rDNA and yeast rDNA, and take this level into account in your step 2 (aligning with yeast genome)? (3) Are any of the human sequences known that can be used as positive control (like spike-in sequences)? Can known human rDNA sequences be used for this purpose? (4) Those contigs that do not map to rDNAs - did you Blast it against all known sequences and see what they might be?

Thu, Feb 5, 2015 at 2:52 PM

Customer: (1) These are pair-ended, but not mate-pair. Perhaps a mate-pair approach might have been more suitable. Insert size is an average of 300bp (min 100bp, max 800bp)

(2) It is difficult to get an idea of the homology of rDNA sequence since they are so poorly characterized. However I think yeast rDNA is not so similar to human rDNA, which is probably the reason it was possible to clone them in yeast and not in E. coli or any other organism. Cloning in other organism led to death rather than replication.

(3) There are 4 or 5 putative sequences for human rDNA that match each other somewhat. I've tried using them as a profile to align against, but the contigs I get seem to map in a concentrated location and does not seem to span a large region.

(4) I blasted the unmapped contigs and they map to yeast, and a couple of known mouse BAC sequences, for some reason. I redid the read filtering by removing those that mapped to the yeast genome or either of the two mouse BAC clones, but that still doesn't improve the contigs I get.

Thu, Feb 5, 2015 at 4:10 PM

AccuraScience LB: Based on your description, my current working hypothesis is that overly loose criteria were used in the mapping step, and a high proportion of human rDNA reads were erroneously mapped to homologous yeast rDNA genes and didn't reach the assembly step.

A few things I would try: (1) obtain an understanding of homology levels between human rDNA and yeast rDNA, using existing rDNA sequence data. I would try to get all annotated human rDNA sequences and yeast rDNA sequences, starting with methods described in this page http://seqanswers.com/forums/showthread.php?t=3563 (they likely will not work perfectly, but are a good starting point). Then perform pair-wise alignment between human rDNA genes and yeast rDNA genes, to get an understanding about how homologous they are, and possibly what regions are more homologous than others.

(2) Using knowledge obtained in (1) as guidance, I would titrate the stringency levels applied in your mapping step: Bowtie offers more flexibility than BWA, and the mechanisms of setting stringency levels are different between these two tools. Novoalign and Stampy may also be worth trying, but I would go by Bowtie first. I would use multiple stringency levels (and multiple mapping tools) to do the mapping, assemble the umapped reads, and compare the contigs resulting from these different settings.

(3) Another thing that might be helpful is to monitor the depth coverage across regions of the yeast genome, in particular, to evaluate whether the depth coverage on the rDNA genes is higher than that of the other regions, and whether some regions of the rDNA genes have higher coverage than other regions - the higher the depth coverage, the higher chance that the region is erroneously "absorbing" human rDNA reads. An iterative procedure might be developed to fine-tune the mapping configurations to get the best assembly results.

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.