Assembling WGS data for marker identification (4/18/2016)

Mon, Apr 18, 2016 at 11:50 AM

Customer: We are interested in bioinformatic analysis of next gen sequencing data of a collection of weed strains. Presently there is a published sequence for the genome, but there are many strains that are poorly characterized. We are performing NGS sequencing for ~30 strains, to design molecular markers unique to each strain.

Mon, Apr 18, 2016 at 2:40 PM

AccuraScience LB: De novo assembly of a 500+Mb size genome is a significant project, and depending on the library design (that is, how many libraries with differing fragment sizes there are, and the expected depth coverage for each library), there is risk of not being able to obtain an optimal assembly. Moreover, even after the assembly of each of the 30 strains is completed, it remains a tricky task to identify regions that can be used as markers to distinguish across the 30 strains.

Based on your description, however, I suspect that it might not even be necessary to perform a complete assembly of the genomes. If the sole purpose of this study is to identify marker regions unique to each strain, so that you could do PCR-based ID of it - and you are not interested in obtaining the gene catalogs of the genomes, then what might work is try to assemble contigs in each strain, skipping the scaffolding and finishing phases all together. Then, we could map the raw read sequences in each of the other 29 strains against each contig in the strain of interest, selecting those contigs that have minimal coverage by reads of the other 29 strains. If you specify the length of unique contigs required to be considered as candidate markers (and other criteria, e.g., G/C content), we would perform this filtering accordingly.

Note: LB stands for Lead Bioinformatician. n AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

