Statistical considerations for metabolomics data analysis (4/4/2016)
Mon, April 4, 2016 at 4:30 AM
Customer: Our data are non-targeted metabolite profiling data acquired specifically by Agilent instrumentation. Could you tell us (1) Which methods do you apply for the imputation of the data (curation of missing values)? Do you have any additional pre-processing of the data to reduce noise etc? (2) Our typical metabolomics dataset consists of ca 200 samples of human plasma, where two or more groups are compared. In many cases we have two measurements of the same individual (before and after), and the effect of the treatment is compared between different products. Which statistical methods do you apply for such comparisons? (3) Do you use purely univariate techniques, or do you also apply chemometric modeling tools or machine learning methods for such large datasets?
Mon, April 4, 2016 at 4:24 PM
AccuraScience LB: The way to impute the data has a lot to do with the nature of the missing values. If a missing value represents missing measurement, then imputing it by the mean or median of other samples for the same metabolite would be proper. If the missing value represents the measurement is "too low to be determined (reliably)", then setting it to 2-3 times the background might be proper. If too many samples have missing values for the same metabolite, then it would be a good idea to eliminate the metabolite all together. It is also important to check for outliers - by PCA and/or clustering analysis, as part of the data pre-processing procedure.
Your study design is not complicated. Paired t-test followed by false discovery rate (FDR) control would be adequate, if only two groups are involved. If 3 or more groups are to be compared, ANOVA or a linear mixed effects model will do.
We do machine learning/pattern recognition types of work a lot - they involve method such as SVM (support vector machine), RF (random forest), PLS-DA, and various feature selection techniques.
Note: LB stands for Lead Bioinformatician. n AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.
Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.