/categories/mcgill-statistics-seminar/index.xml McGill Statistics Seminar - McGill Statistics Seminars
  • Matrix completion in genetic methylation studies: LMCC, a Linear Model of Coregionalization with informative Covariates

    Date: 2024-02-16

    Time: 15:30-16:30 (Montreal time)

    Location: In person, Burnside 1104

    https://mcgill.zoom.us/j/82678428848

    Meeting ID: 826 7842 8848

    Passcode: None

    Abstract:

    DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, missing values is an issue and appropriate imputation techniques are important to avoid an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where a relatively small number of samples are processed via an expensive high-density Whole Genome Bisulfite Sequencing (WGBS) strategy and a larger number of samples are processed using more affordable low-density array-based technologies. In such cases, one can impute/complete the data matrix of the low coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this work, we propose an efficient Linear Model of Coregionalization with informative Covariates (LMCC) to predict missing values based on observed values and informative covariates. Our model assumes that at each genomics position, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across positions/sites by assuming Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also show that the proposed model is efficient when the number of columns is much greater than the number of rows in the data matrix-which is usually the case in methylation data analysis. Finally, we apply and compare the proposed method with alternative approaches to complete a matrix of DNA methylation containing 15 rows (methylation samples) and 1 million columns (sites). Joint work with Melina Ribaud and Aurelie Labbe (HEC, Montreal).

  • Mesoscale two-sample testing for networks

    Date: 2024-02-09

    Time: 15:30-16:30 (Montreal time)

    Location: In person, Burnside 1104

    https://mcgill.zoom.us/j/87465663442

    Meeting ID: 874 6566 3442

    Passcode: None

    Abstract:

    Networks arise naturally in many scientific fields as a representation of pairwise connections. Statistical network analysis has most often considered a single large network, but it is common in a number of applications, for example, neuroimaging, to observe multiple networks on a shared node set. When these networks are grouped by case-control status or another categorical covariate, the classical statistical question of two-sample comparison arises. In this work, we address the problem of testing for statistically significant differences in a prespecified subset of the connections. This general framework allows an analyst to focus on a single node, a specific region of interest, or compare whole networks. In this “mesoscale” setting, we develop statistically sound projection-based tests for two-sample comparison in both weighted and binary edge networks. Our approach can leverage all available network information, and learn informative projections which improve testing power when low-dimensional network structure is present.

  • Fast calibration of FARIMA models with dependent errors

    Date: 2024-02-02

    Time: 15:30-16:30 (Montreal time)

    Location: Online, retransmitted in Burnside 1104

    https://mcgill.zoom.us/j/89669635642

    Meeting ID: 896 6963 5642

    Passcode: None

    Abstract:

    In this work, we investigate the asymptotic properties of Le Cam’s one-step estimator for weak Fractionally AutoRegressive Integrated Moving-Average (FARIMA) models. For these models, noises are uncorrelated but neither necessarily independent nor martingale differences errors. We show under some regularity assumptions that the one-step estimator is strongly consistent and asymptotically normal with the same asymptotic variance as the least squares estimator. We show through simulations that the proposed estimator reduces computational time compared with the least squares estimator.

  • Imaging and Clinical Biomarker Estimation in Alzheimer’s Disease

    Date: 2024-01-19

    Time: 15:30-16:30 (Montreal time)

    Location: Online, retransmitted in Burnside 1104

    https://mcgill.zoom.us/j/85422946487

    Meeting ID: 854 2294 6487

    Passcode: None

    Abstract:

    Estimation of biomarkers related to disease classification and modeling of its progression is essential for treatment development for Alzheimer’s Disease (AD). The task is more daunting for characterizing relatively rare AD subtypes such as the early-onset AD. In this talk, I will describe the Longitudinal Alzheimer’s Disease Study (LEADS) intending to collect and publicly distribute clinical, imaging, genetic, and other types of data from people with EOAD, as well as cognitively normal (CN) controls and people with early-onset non-amyloid positive (EOnonAD) dementias. I will discuss manifold estimation methods for estimation of surfaces of shapes in the brain using data clouds, longitudinal manifold learning methods for modeling trajectories of shape changes in the brain over time. Finally, I will discuss our work in leveraging magnetic resonance imaging and positron emission tomography data to characterize distributions of white matter hyperintensities in people with EOAD and to obtain imaging-based biomarkers of disease trajectories of AD subtypes.

  • New Advances in High-Dimensional DNA Methylation Analysis in Cancer Epigenetic Using Trans-dimensional Hidden Markov Models

    Date: 2024-01-12

    Time: 15:30-16:30 (Montreal time)

    Location: In person, Burnside 1104

    https://mcgill.zoom.us/j/83008174313

    Meeting ID: 830 0817 4313

    Passcode: None

    Abstract:

    Epigenetic alterations are key drivers in the development and progression of cancer. Identifying differentially methylated cytosines (DMCs) in cancer samples is a crucial step toward understanding these changes. In this talk, we propose a trans-dimensional Markov chain Monte Carlo (TMCMC) approach that uses hidden Markov models (HMMs) with binomial emission, and bisulfite sequencing (BS-Seq) data, called DMCTHM, to identify DMCs in cancer epigenetic studies. We introduce the Expander-Collider penalty to tackle under and over-estimation in TMCMC-HMMs. We address all known challenges inherent in BS-Seq data by introducing novel approaches for capturing functional patterns and autocorrelation structure of the data, as well as for handling missing values, multiple covariates, multiple comparisons, and family-wise errors. We demonstrate the effectiveness of DMCTHM through comprehensive simulation studies. The results show that our proposed method outperforms other competing methods in identifying DMCs. Notably, with DMCTHM, we uncovered new DMCs and genes in Colorectal cancer that were significantly enriched in the Tp53 pathway.

  • Robust and Tuning-Free Sparse Linear Regression via Square-Root Slope

    Date: 2023-11-17

    Time: 15:30-16:30 (Montreal time)

    Location: Online, retransmitted in Burnside 1104

    https://mcgill.zoom.us/j/81865630475

    Meeting ID: 818 6563 0475

    Passcode: None

    Abstract:

    We consider the high-dimensional linear regression model and assume that a fraction of the responses are contaminated by an adversary with complete knowledge of the data and the underlying distribution. We are interested in the situation when the dense additive noise can be heavy-tailed but the predictors have sub-Gaussian distribution. We establish minimax lower bounds that depend on the fraction of the contaminated data and the tails of the additive noise. Moreover, we design a modification of the square root Slope estimator with several desirable features: (a) it is provably robust to adversarial contamination, with the performance guarantees that take the form of sub-Gaussian deviation inequalities and match the lower error bounds up to log-factors; (b) it is fully adaptive with respect to the unknown sparsity level and the variance of the noise, and (c) it is computationally tractable as a solution of a convex optimization problem. To analyze the performance of the proposed estimator, we prove several properties of matrices with sub-Gaussian rows that could be of independent interest. This is joint work with Stanislav Minsker and Lang Wang.

  • Copula-based estimation of health inequality measures

    Date: 2023-11-10

    Time: 15:30-16:30 (Montreal time)

    Location: In person, Burnside 1104

    https://mcgill.zoom.us/j/89337793218

    Meeting ID: 893 3779 3218

    Passcode: None

    Abstract:

    This paper aims to use copulas to derive estimators of the health concentration curve and Gini coefficient for health distribution. We highlight the importance of expressing health inequality measures in terms of a copula, which we in turn use to build copula-based semi and nonparametric estimators of the above measures. Thereafter, we study the asymptotic properties of these estimators. In particular, we establish their consistency and asymptotic normality. We provide expressions for their variances, which can be used to construct confidence intervals and build tests for the health concentration curve and Gini health coefficient. A Monte-Carlo simulation exercise shows that the semiparametric estimator outperforms the smoothed nonparametric estimator, and the latter does better than the empirical estimator in terms of Mean Squared Error. We also run an extensive empirical study where we apply our estimators to show that the inequalities across U.S. states’s socioeconomic variables like income/poverty and race/ethnicity explain the observed inequalities in COVID-19 infections and deaths in the U.S.

  • Reduced-Rank Envelope Vector Autoregressive Models

    Date: 2023-11-03

    Time: 15:30-16:30 (Montreal time)

    Location: In person, Burnside 1104

    https://mcgill.zoom.us/j/2571023554

    Meeting ID: 257 102 3554

    Passcode: None

    Abstract:

    Classical vector autoregressive (VAR) models have long been a popular choice for modeling multivariate time series data due to their flexibility and ease of use. However, the VAR model suffers from overparameterization which is a serious issue for high-dimensional time series data as it restricts the number of variables and lags that can be incorporated into the model. Several statistical methods have been proposed to achieve dimension reduction in the parameter space of VAR models. Yet, these methods prove inefficient in extracting relevant information from complex datasets, as they fail to distinguish between information aligned with scientific objectives and are also inefficient in addressing rank deficiency problems. Envelope methods, founded on novel parameterizations that employ reduced subspaces to establish connections between the mean function and covariance matrix, offer a solution by efficiently identifying and eliminating irrelevant information. In this presentation, we introduce a new, parsimonious VAR model that incorporates the concept of envelope models into the reduced-rank VAR framework that can achieve substantial dimension reduction and efficient parameter estimation. We will present the results of simulation studies and real data analysis comparing the performance of our proposed model with that of existing models in the literature.

  • Doubly Robust Estimation under Covariate-induced Dependent Left Truncation

    Date: 2023-10-27

    Time: 15:30-16:30 (Montreal time)

    Location: Online, retransmitted in Burnside 1104

    https://mcgill.zoom.us/j/84195498572

    Meeting ID: 841 9549 8572

    Passcode: None

    Abstract:

    In prevalent cohort studies with follow-up, the time-to-event outcome is subject to left truncation leading to selection bias. For estimation of the distribution of time-to-event, conventional methods adjusting for left truncation tend to rely on the (quasi-)independence assumption that the truncation time and the event time are “independent" on the observed region. This assumption is violated when there is dependence between the truncation time and the event time possibly induced by measured covariates. Inverse probability of truncation weighting leveraging covariate information can be used in this case, but it is sensitive to misspecification of the truncation model. In this work, we apply the semiparametric theory to find the efficient influence curve of an expected (arbitrarily transformed) survival time in the presence of covariate-induced dependent left truncation. We then use it to construct estimators that are shown to enjoy double-robustness properties. Our work represents the first attempt to construct doubly robust estimators in the presence of left truncation, which does not fall under the established framework of coarsened data where doubly robust approaches are developed. We provide technical conditions for the asymptotic properties that appear to not have been carefully examined in the literature for time-to-event data, and study the estimators via extensive simulation. We apply the estimators to two data sets from practice, with different right-censoring patterns.

  • Neural network architectures for functional data analysis

    Date: 2023-10-20

    Time: 15:30-16:30 (Montreal time)

    Location: In person, Burnside 1104

    https://mcgill.zoom.us/j/89761165882

    Meeting ID: 897 6116 5882

    Passcode: None

    Abstract:

    Functional data is defined as any random variables that assume values in an infinite precision domain, such as time or space. In applications, this data is usually discretely observed at some regularly or irregularly-spaced points over the domain. In this talk, we discuss ways to adapt modern neural network architectures for the analysis of functional data. To do so, we design new neural network layers in order to process functional data either as input, output or both. First, we propose the functional output layer, which can be used to solve a multitude of function-on-scalar regression problems in a non-linear way. The proposed layer provides a smooth representation of the output and we demonstrate how to regularize such a layer during the network training phase. Second, we propose a concept for functional weights that project functional data to a scalar representation, leading to a novel formulation for a functional input layer. We demonstrate how to combine both of these proposed functional layers to create a functional autoencoder. This model takes as input the data in the form it is usually collected, as discrete points over the domain, and can be used for feature extraction and functional data smoothing. We demonstrate the benefits of the proposed architectures with various experiments on simulated data and real data applications. We conclude with a brief discussion of ongoing work in the design of a functional convolution layer that bridges the gap between the discrete convolution operation and its continuous counterpart.