DNase I Hypersensitive Sites (DHS) Analysis (11/23/2014)


Sat, 11/22/2014 at 2:18 PM

Customer: Here are a list of analysis tasks for our DHS datasets that we would like you to help with: 1) Peak calling of DNase I hypersensitive (DHS) sites of all libraries using F seq. 2) Gene annotation of the DHS sites - distributions/localizations of DHS sites relative to transcription start site (TSS), gene body, intergenic region, etc. (presentation: genome browser, histograms, etc.) 3) Relationship between DHS sites and gene expression (correlation between DNase seq and RNA seq). 4) Correlation between DHS sites and histone modifications (H3K4m, H3K27m, H3K9m and acetylation. 5) Distribution/correlation between DHS sites and conserved noncoding regions (CNS). 6) Binding/distribution of important transcription factors (e.g, lineage specific transcription factors or master regulators, NFAT, some Stats and SMADs) (ChIP seq datasets some transcription factors are available) in DHS sites. 7) Cell type-specific DHS sites/comparison between several different cell types.

Sun, 11/23/2014 at 10:11 AM

AccuraScience LB: I describe my understanding of this project, from the viewpoint of how to carry it out.

(a) Task 1 (peak calling of DNase hypersensitive site sequencing data) is considered routine sequencing analysis (similar to ChIP-seq).

(b) A portion of Tasks 2, 5 and 6 involves "routine" genomics analysis, that is, obtaining and organizing genomic annotation information related to TSS's, genic regions, intergenic regions, conserved non-coding regions, and TFBS's. The more challenging portion is the development of proper statistical methods - discussed below.

(c) The most proper statistical method for the "correlation" analysis in Tasks 2, 3, 4, 5, and 6 is likely to be linear mixed effects models, because you have multiple independent variables, including (i) cell type and (ii) gene expression or other genomic features - they are both fixed effects, and (iii) replicates - a random effect. Please take a look at this paper: http://www.pnas.org/content/109/32/E2183.abstract - in which linear mixed effect models were applied to examine correlation between DNA methylation and other features such as gene expression and genomic features such as transposons, 21-nt sRNAs etc. Other than the fact that the dependent variable in this study was DNA methylation level, rather than DNase hypersensitive sites, what this study did was very similar to what you would like to accomplish in your study.

(d) Task 7 will likely require development of multivariate statistical methods, which I have not yet thought through at this time.

(e) I assume you would be providing us with the RNA-seq data required for Task 3, and ChIP-seq data required for Task 4 (for histone modifications) and Task 6 (for transcription factors). Please let me know if this is incorrect.

(f) A few other things would require further development/testing, e.g., (i) whether we should use read counts or peak counts to do the correlation analysis involving DNase hypersensitive site sequencing and ChIP-seq data, (ii) For Task 6, Because there are ChIP-seq data for some of the transcription factors you are interested in, but not for others, we have to rely on TFBS models from TRANSFAC for the latter type. A question is, how to make the analysis results consistent (or comparable) between these two types of transcription factors.

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.