Matrix completion in genetic methylation studies: LMCC, a Linear Model of Coregionalization with informative Covariates
Karim Oualkacha · Feb 16, 2024
Date: 2024-02-16
Time: 15:30-16:30 (Montreal time)
Location: In person, Burnside 1104
https://mcgill.zoom.us/j/82678428848
Meeting ID: 826 7842 8848
Passcode: None
Abstract:
DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, missing values is an issue and appropriate imputation techniques are important to avoid an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where a relatively small number of samples are processed via an expensive high-density Whole Genome Bisulfite Sequencing (WGBS) strategy and a larger number of samples are processed using more affordable low-density array-based technologies. In such cases, one can impute/complete the data matrix of the low coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this work, we propose an efficient Linear Model of Coregionalization with informative Covariates (LMCC) to predict missing values based on observed values and informative covariates. Our model assumes that at each genomics position, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across positions/sites by assuming Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also show that the proposed model is efficient when the number of columns is much greater than the number of rows in the data matrix-which is usually the case in methylation data analysis. Finally, we apply and compare the proposed method with alternative approaches to complete a matrix of DNA methylation containing 15 rows (methylation samples) and 1 million columns (sites). Joint work with Melina Ribaud and Aurelie Labbe (HEC, Montreal).