RAD-seq Data Analysis, Identifying Sex-Specific Markers, and Genome Assembly (5/25/2015)


Mon, May 25, 2015 at 3:32 PM

Customer: I have some DNA sequencing data from a fish species and would like to work with you on data analysis. What we have done: (1) RAD-Seq for ~ 70 fish from 8 strains including males and females, 2. Genomic library for 1 fish as a reference. What we would like you to help with: (1) Genome-Wide discovery of SNPs. (2) Genome-Wide Patterns of Genetic Variation among different strains, (3) Unique sequences related to strain, and (4) Unique sequences related to sex: if there are some that are either present in males and absent in females, or those that have a female:male ratio of 2:1 (e.g. X:X in females and only one copy of X in males) when counting the average number of reads per sample.

Tue, May 26, 2015 at 4:11 PM

AccuraScience LB: It might be difficult to decide how the genomic library is to be used. RAD-seq data are analyzed either (a) assuming the absence of a reference genome, or (b) in the presence of a reference genome. The analysis pipelines for these two scenarios are very different. The genomic library data cannot be used directly, without first being assembled into a genome assembly. Genome assembly can be attempted, however, (a) for a higher Eukaryotic species, successful assembly of a genome would require typically multiple libraries of differing insert sizes. (b) De novo assembly of a higher Eukaryotic genome itself is a substantial bioinformatics project

Could you tell us the expected diversity level between individuals of the same strain, and expected divergence level across strains? And what's the read length in the RAD-seq experiments? Was it single-ended or pair-ended?

Illumina's data are known to demonstrate far-from-uniform distribution in depth coverage across regions (largely due to less-than-perfect random primers). Depth coverage ratios could be calculated more precisely on levels of larger genomic sections (e.g., a whole chromosome or a sizable portion of it), but would not be possible on individual read level. Thus calling sex-related sequences by depth coverage ratio (M:F = 2:1 rule), though theoretically a very good idea, would likely not work in practice.

Wed, May 27, 2015 at 4:24 PM

Customer: Yes, we need to do Genome assembly, and more libraries. This is why we are doing more WGS. I am thinking: (1) analyzing RAD data in the absence of a reference genome first to identify SNPs and genomic diversity, and (2) getting more genomic libraries for genome assembly.

The diversity level between individuals of the same strain may be not high, but divergence levels across strains are high. For example, NC strain grow much faster than others. So if we could find something related to this through genemic studies, that would be very significant.

It was pair-ended. HiSeq = 2x100 bases, MiSeq = 2x300 bases;

The genomic libary was run on the MiSeq, so each read length is 300 bases; depth: MiSeq ~2x20-25 million reads per run.

Thu, May 28, 2015 at 11:54 AM

AccuraScience LB: (1) For de novo assembly of a genome, much larger amount of data (than what has been available) would be required. Current best practice is, for larger Eukaryotic genome, the different sets of libraries (with differing insert sizes and differing depth coverage) are designed to maximize the chance of successful assembly. There are guidelines available that would be worth looking into. (2) RAD-seq data (and others focusing primarily on SNPs) would be beneficial in determining the variant structures within and between strains, but it might be hard to directly translate this type of information into functional interpretations. For functional studies (egg ribbon/growth/sex determination), RNA-seq data would likely be more effective. Even in the absence of a reference genome, RNA-seq data could still offer deep functional insights. (3) For objectives with heavy evolutionary focuses, it might be important to determine what data are available for other species with related traits as well as outgroup. It could be difficult to go very far if we only have data of the same species (albeit from multiple strains).

Thu, May 28, 2015 at 3:36 PM

Customer sends a paper describing strategy he would want to apply in the sex determination study, and discusses further the option of obtain a genome assembly as reference.

Fri, May 29, 2015 at 10:47 AM

AccuraScience LB:In this paper, they relied on identifying sex-specific tag markers rather than looking at read count ratio between the two sexes.

Genome assembly is significant work: it is relatively straightforward (though costs tons of computational time) to assemble reads into contigs. Scaffolding is a little trickier. If annotation work is further added, it would be very costly. Getting a good genome assembly for publication (which would require more than minimal annotation) is a fairly significant project.

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer's privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.