Whole-genome sequencing analysis options for human samples (2/6/2016)
Sat, Feb 6, 2016 at 5:22 PM
Customer: We will be acquiring whole genome sequencing of several cell lines and are trying to find bioinformatics assistance. We will have data from several cell lines in the next few months. The bioinformatics analysis that we would like includes identification of genes with mutations (1) In coding regions, including notation for missense, non-sense, etc. (2) Within 100 bp of splice sites, and (3) Within promoter regions, such as within 1 kb of transcriptional start site. Then this gene list could be cross-referenced to a cancer or clinical gene list, such as this: such as this: http://tests.labmed.washington.edu/UW-OncoPlex. It would also be nice if the WGS data could be it in a viewable format, such as a track that can be uploaded to the UCSC genome browser.
Sun, Feb 7, 2016 at 9:30 AM
AccuraScience LB: The WGS data analysis pipeline we frequently run includes the following steps: (1) Sequencing data quality control, which examines potential issues that occurred during the sequencing experiments, e.g., low Phred scores would indicate quality problems in the data, and unusual enrichment of subsequences representing adapter or primer sequences would suggest issues in certain steps of the library preparation before sequencing, (2) Map all the reads to the reference genome - lower than expected mapping rate would be another sign of data quality problems, (3) Calling SNVs and short Indels, (4) Predict the functional consequences of the SNVs and short Indels using multiple tools chosen from ANNOVAR, SIFT, PolyPhen2, GERP, MutationTaster. These tools apply complementary strategies (evolutionary conservation, physicochemical properties of changing amino acids etc) to make such predictions. Most of them focus on mutations identified on coding regions only. (5) Pathway analysis, to identify biological pathways (often represented in Gene Ontology terms) significantly enriched among mutated genes or recurrently mutated genes, e.g., if apoptosis pathway or DNA repair pathway is highly enriched, it would point directions to what the researcher might go into for further mechanistic studies of the cancer. If only cancer samples but no matching healthy control samples are available, mutation profiles of healthy subjects from the 1000-Genomes project could pulled and used as controls. (6) Structural variant (SV) calling: in contrast to SNVs and short Indels, SVs refer to changes on large genomic scales, e.g., chromosome translocations, which are also important signatures of cancer. When proper, we will present results in browser formats to ease visualization.
As you might see, there are substantial overlaps between this pipeline and what you would want done. Some minor difference include: (a) we did not look particularly on regions within 100 bp of splice sites: my recollection is that these regions are used in defining exomes, but are not frequently cited in WGS data analysis. This said, if there is motivation to look at variants close to splice sites, there is something that we can develop and implement. (b) We didn't specifically focus on the UW cancer gene panel. Our pathway analysis (described in (5) above) is in a way an unbiased way of identifying significant pathways important to the system. I personally think it is more effective than the cancer panel-based approach. Moreover, there are many cancer panels like the UW panel (e.g., Illumina and Ion Torrent has each developed one or more). It is hard to say which one is better than others. But these said, if you would like us to cross-check our results with the UW panel, this can be done.
Sun, Feb 7, 2016 at 12:22 PM
Customer: Our samples will be human pluripotent stem cells (embryonic or induced pluripotent). The goal with the analysis is to understand if there are red flags in a given cell line that we could interpret before differentiation.
The functional prediction and pathway analysis are the most interesting components of your analysis. How are promoter regions analyzed? Are they part of the functional analysis, or is that just coding analysis? I have never put much stock in GO terms, what are the other parts of your pathway analysis? How many healthy or disease WGS data sets, such as the 1000-genomes project, are available in your typical analysis? We are not wedded to the UW cancer panel, but screening against set(s) of cancer genes (UW, Illumina, or others) would be nice, in addition to GO analysis.
The first cell lines will be wildtype, but future cell lines for WGS will be engineered with large deletions or insertions of approximately 1 kb. Would deletions be easier to analyze? How would large insertions be handled? Would the exogenous sequence be ignored? Would you want the exogenous sequence and include that as an artificial chromosome?
Do you have preferred sequencing data for WGS analysis? We were planning on what seems to be the standard 30X coverage WGS with Illumina sequencing. In your opinion, would that be good?
Sun, Feb 7, 2016 at 4:15 PM
AccuraScience LB: There are not a lot of options to do deep analysis of mutations in non-coding regions (including promoters and UTRs), and there are multiple reasons for this: (1) Unlike amino acid changes, DNA changes (e.g., an 'A' change to a 'C') do not carry a lot of interesting functional information, (2) Functional elements residing in promoters and UTRs - e.g., transcription factor binding sites and miRNA target sites - fall in the realm of imprecise business: they are hard to pinpoint precisely and informatically, there isn't good documentation of them across the genome, and (3) These regions are evolutionarily less conserved and expected to mutate more frequently, thus a mutation found in the region does not carry as high weight. We can check mutations against known "ultraconserved regions" - those regions that do not mutate across related species thus are expected to be functionally important, but after that, there isn't much else that can be done informatically: experiments have to be designed and carried out to obtain further mechanistic insights about those mutations.
Besides GO-defined pathways, there are a number of functionally defined gene sets that can be plugged when pathway analysis is performed. You could take a look at this list: http://software.broadinstitute.org/gsea/msigdb/collections.jsp.
The 1000 Genomes Project have data for ~1000 healthy individuals from 26 populations. We often use all data in the population matching the case subjects, e.g. if your case subjects are all Finnish, you might want to use all subjects labeled with FIN population in the 1000 Genomes Project as controls. The 26 populations are explained in http://www.1000genomes.org/category/frequently-asked-questions/population.
Structural variants (SVs) can be identified using short read data. You are welcome to take a look at this recent inquiry page on our web site http://www.accurascience.com/lstyuo/php/recent_inquiries.php?id=%27inquiry_98_12_12_2015%27. It describes strategies and limitations of this line of work.
Exogenous sequences could be added as an artificial chromosome when we map the reads.
Illumina 30X coverage is a good compromise between cost and gain. For stem cell research it is good coverage. For cancer cells there might be some motivation to go much higher coverage because they are expected to be much more heterogeneous than stem cells.
Note: LB stands for Lead Bioinformatician. n AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.
Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.