/post/index.xml Past Seminar Series - McGill Statistics Seminars
  • Joint integrative analysis of multiple data sources with correlated vector outcomes

    Date: 2021-02-19

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    We consider the joint estimation of regression parameters from multiple potentially heterogeneous data sources with correlated vector outcomes. The primary goal of this joint integrative analysis is to estimate covariate effects on all vector outcomes through a marginal regression model in a statistically and computationally efficient way. We present a general class of distributed estimators that can be implemented in a parallelized computational scheme. Modelling, computational and theoretical challenges are overcome by first fitting a local model within each data source and then combining local results while accounting for correlation between data sources. This approach to distributed estimation and inference is formulated using Hansen’s generalized method of moments but implemented via an asymptotically equivalent and communication-efficient meta-estimator. We show both theoretically and numerically that the proposed method yields efficiency improvements and is computationally fast. We illustrate the proposed methodology with the joint integrative analysis of metabolic pathways in a large multi-cohort study.

  • Spatio-temporal methods for estimating subsurface ocean thermal response to tropical cyclones

    Date: 2021-02-12

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    Tropical cyclones (TCs), driven by heat exchange between the air and sea, pose a substantial risk to many communities around the world. Accurate characterization of the subsurface ocean thermal response to TC passage is crucial for accurate TC intensity forecasts and for understanding the role TCs play in the global climate system, yet that characterization is complicated by the high-noise ocean environment, correlations inherent in spatio-temporal data, relative scarcity of in situ observations and the entanglement of the TC-induced signal with seasonal signals. We present a general methodological framework that addresses these difficulties, integrating existing techniques in seasonal mean field estimation, Gaussian process modeling, and nonparametric regression into a functional ANOVA model. Importantly, we improve upon past work by properly handling seasonality, providing rigorous uncertainty quantification, and treating time as a continuous variable, rather than producing estimates that are binned in time. This functional ANOVA model is estimated using in situ subsurface temperature profiles from the Argo fleet of autonomous floats through a multi-step procedure, which (1) characterizes the upper ocean seasonal shift during the TC season; (2) models the variability in the temperature observations; (3) fits a thin plate spline using the variability estimates to account for heteroskedasticity and correlation between the observations. This spline fit reveals the ocean thermal response to TC passage. Through this framework, we obtain new scientific insights into the interaction between TCs and the ocean on a global scale, including a three-dimensional characterization of the near-surface and subsurface cooling along the TC storm track and the mixing-induced subsurface warming on the track’s right side. Joint work with Addison Hu, Ann Lee, Donata Giglio and Kimberly Wood.

  • An Adaptive Algorithm to Multi-armed Bandit Problem with High-dimensional Covariates

    Date: 2021-02-05

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    This work studies an important sequential decision making problem known as the multi-armed bandit problem with covariates. Under a linear bandit framework with high-dimensional covariates, we propose a general arm allocation algorithm that integrates both arm elimination and randomized assignment strategies. By employing a class of high-dimensional regression methods for coefficient estimation, the proposed algorithm is shown to have near optimal finite-time regret performance under a new study scope that requires neither a margin condition nor a reward gap condition for competitive arms. Based on synergistically verified benefit of the margin, our algorithm exhibits an adaptive performance that automatically adapts to the margin and gap conditions, and attains the optimal regret rates under both study scopes, without or with the margin, up to a logarithmic factor. The proposed algorithm also simultaneously generates useful coefficient estimation output for competitive arms and is shown to achieve both estimation consistency and variable selection consistency. Promising empirical performance is demonstrated through two real data evaluation examples in drug dose assignment and news article recommendation.

  • Small Area Estimation in Low- and Middle-Income Countries

    Date: 2021-01-29

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    The under-five mortality rate (U5MR) is a key barometer of the health of a nation. Unfortunately, many people living in low- and middle-income countries are not covered by civil registration systems. This makes estimation of the U5MR, particularly at the subnational level, difficult. In this talk, I will describe models that have been developed to produce the official United Nations (UN) subnational U5MR estimates in 22 countries. Estimation is based on household surveys, which use stratified, two-stage cluster sampling. I will describe a range of area- and unit-level models and describe the rationale for the modeling we carry out. Data sparsity in time and space is a key challenge, and smoothing models are vital. I will discuss the advantages and disadvantages of discrete and continuous spatial models, in the context of estimation at the scale at which health interventions are made. Other issues that will be touched upon include: design-based versus model-based inference; adjustments for HIV epidemics; the inclusion of so-called indirect (summary birth history) data; reproducibility through software availability; benchmarking; how to deal with incomplete geographical data; and working with the UN to produce estimates.

  • Large-scale Machine Learning Algorithms for Biomedical Data Science

    Date: 2021-01-15

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    During the last decade, hundreds of machine learning methods have been developed for disease outcome prediction based on high-throughput genomics data. However, the quality of the input genomics features and the output clinical variables has been ignored in these algorithms. In this talk, I will introduce two studies that develop methods to learn more accurate molecular signatures and drug response values for cancer research. These studies are supported by NSF, NIH, and Moffitt Cancer Center.

  • Quasi-random sampling for multivariate distributions via generative neural networks

    Date: 2020-12-04

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    A novel approach based on generative neural networks is introduced for constructing quasi-random number generators for multivariate models with any underlying copula in order to estimate expectations with variance reduction. So far, quasi-random number generators for multivariate distributions required a careful design, exploiting specific properties (such as conditional distributions) of the implied copula or the underlying quasi-Monte Carlo point set, and were only tractable for a small number of models. Utilizing specific generative neural networks allows one to construct quasi-random number generators for a much larger variety of multivariate distributions without such restrictions. Once trained with a pseudo-random sample, these neural networks only require a multivariate standard uniform randomized quasi-Monte Carlo point set as input and are thus fast in estimating expectations under dependence with variance reduction. Reproducible numerical examples are considered to demonstrate the approach. Emphasis is put on ideas rather than mathematical proofs.

  • Probabilistic Approaches to Machine Learning on Tensor Data

    Date: 2020-11-27

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    In contemporary scientific research, it is often of great interest to predict a categorical response based on a high-dimensional tensor (i.e. multi-dimensional array). Motivated by applications in science and engineering, we propose two probabilistic methods for machine learning on tensor data in the supervised and the unsupervised context, respectively. For supervised problems, we develop a comprehensive discriminant analysis model, called the CATCH model. The CATCH model integrates the information from the tensor and additional covariates to predict the categorical outcome with high accuracy. We further consider unsupervised problems, where no categorical response is available even on the training data. A doubly-enhanced EM (DEEM) algorithm is proposed for model-based tensor clustering, in which both the E-step and the M-step are carefully tailored for tensor data. CATCH and DEEM are developed under explicit statistical models with clear interpretations. They aggressively take advantage of the tensor structure and sparsity to tackle the new computational and statistical challenges arising from the intimidating tensor dimensions. Efficient algorithms are developed to solve the related optimization problems. Under mild conditions, CATCH and DEEM are shown to be consistent even when the dimension of each mode grows at an exponential rate of the sample size. Numerical studies also strongly support the application of CATCH and DEEM.

  • Modeling viral rebound trajectories after analytical antiretroviral treatment interruption

    Date: 2020-11-20

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Despite the success of combined antiretroviral therapy (ART) in achieving sustained control of viral replication, the concerns about side-effects, drug-drug interactions, drug resistance and cost call for a need to identify strategies for achieving HIV eradication or an ART-free remission. Following ART withdrawal, patients’ viral load levels usually increase rapidly to a peak followed by a dip, and then stabilize at a viral load set point. Characterizing features of the viral rebound trajectories (e.g., time to viral rebound and viral set points) and identifying host, virological, and immunological factors that are predictive of these features requires addressing analytical challenges such as non-linear viral rebound trajectories, coarsened data due to the assay’s limit of quantification, and intermittent measurements of viral load values. We first introduce a parametric nonlinear mixed effects (NLME) model for the viral rebound trajectory and compare its performance to a mechanistic modeling approach. We then develop a smoothed simulated pseudo maximum likelihood method for fitting NLME models that permits flexible specification of random effects distributions. Finally, we investigate the association between the time to viral suppression after ART initiation and the time to viral rebound after ART interruption through a Cox proportional hazard regression model where both the outcome and the covariate are interval-censored observations.

  • Approximate Cross-Validation for Large Data and High Dimensions

    Date: 2020-11-13

    Time: 15:30-16:30

    Zoom Link

    Abstract:

    The error or variability of statistical and machine learning algorithms is often assessed by repeatedly re-fitting a model with different weighted versions of the observed data. The ubiquitous tools of cross-validation (CV) and the bootstrap are examples of this technique. These methods are powerful in large part due to their model agnosticism but can be slow to run on modern, large data sets due to the need to repeatedly re-fit the model. We use a linear approximation to the dependence of the fitting procedure on the weights, producing results that can be faster than repeated re-fitting by orders of magnitude. This linear approximation is sometimes known as the “infinitesimal jackknife” (IJ) in the statistics literature, where it has mostly been used as a theoretical tool to prove asymptotic results. We provide explicit finite-sample error bounds for the infinitesimal jackknife in terms of a small number of simple, verifiable assumptions. Without further modification, though, we note that the IJ deteriorates in accuracy in high dimensions and incurs a running time roughly cubic in dimension. We additionally show, then, how dimensionality reduction can be used to successfully run the IJ in high dimensions when data is sparse or low rank. Simulated and real-data experiments support our theory.

  • Generalized Energy-Based Models

    Date: 2020-11-06

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    I will introduce Generalized Energy Based Models (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the “generator”). In particular, while the energy function is analogous to the GAN critic function, it is not discarded after training. GEBMs are trained by alternating between learning the energy and the base, much like a GAN. Both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via MCMC, thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. GEBMs also return state-of-the-art performance on density modelling tasks, and when using base measures with an explicit form.