/categories/mcgill-statistics-seminar/index.xml McGill Statistics Seminar - McGill Statistics Seminars
  • Joint integrative analysis of multiple data sources with correlated vector outcomes

    Date: 2021-02-19

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    We consider the joint estimation of regression parameters from multiple potentially heterogeneous data sources with correlated vector outcomes. The primary goal of this joint integrative analysis is to estimate covariate effects on all vector outcomes through a marginal regression model in a statistically and computationally efficient way. We present a general class of distributed estimators that can be implemented in a parallelized computational scheme. Modelling, computational and theoretical challenges are overcome by first fitting a local model within each data source and then combining local results while accounting for correlation between data sources. This approach to distributed estimation and inference is formulated using Hansen’s generalized method of moments but implemented via an asymptotically equivalent and communication-efficient meta-estimator. We show both theoretically and numerically that the proposed method yields efficiency improvements and is computationally fast. We illustrate the proposed methodology with the joint integrative analysis of metabolic pathways in a large multi-cohort study.

  • An Adaptive Algorithm to Multi-armed Bandit Problem with High-dimensional Covariates

    Date: 2021-02-05

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    This work studies an important sequential decision making problem known as the multi-armed bandit problem with covariates. Under a linear bandit framework with high-dimensional covariates, we propose a general arm allocation algorithm that integrates both arm elimination and randomized assignment strategies. By employing a class of high-dimensional regression methods for coefficient estimation, the proposed algorithm is shown to have near optimal finite-time regret performance under a new study scope that requires neither a margin condition nor a reward gap condition for competitive arms. Based on synergistically verified benefit of the margin, our algorithm exhibits an adaptive performance that automatically adapts to the margin and gap conditions, and attains the optimal regret rates under both study scopes, without or with the margin, up to a logarithmic factor. The proposed algorithm also simultaneously generates useful coefficient estimation output for competitive arms and is shown to achieve both estimation consistency and variable selection consistency. Promising empirical performance is demonstrated through two real data evaluation examples in drug dose assignment and news article recommendation.

  • Large-scale Machine Learning Algorithms for Biomedical Data Science

    Date: 2021-01-15

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    During the last decade, hundreds of machine learning methods have been developed for disease outcome prediction based on high-throughput genomics data. However, the quality of the input genomics features and the output clinical variables has been ignored in these algorithms. In this talk, I will introduce two studies that develop methods to learn more accurate molecular signatures and drug response values for cancer research. These studies are supported by NSF, NIH, and Moffitt Cancer Center.

  • Quasi-random sampling for multivariate distributions via generative neural networks

    Date: 2020-12-04

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    A novel approach based on generative neural networks is introduced for constructing quasi-random number generators for multivariate models with any underlying copula in order to estimate expectations with variance reduction. So far, quasi-random number generators for multivariate distributions required a careful design, exploiting specific properties (such as conditional distributions) of the implied copula or the underlying quasi-Monte Carlo point set, and were only tractable for a small number of models. Utilizing specific generative neural networks allows one to construct quasi-random number generators for a much larger variety of multivariate distributions without such restrictions. Once trained with a pseudo-random sample, these neural networks only require a multivariate standard uniform randomized quasi-Monte Carlo point set as input and are thus fast in estimating expectations under dependence with variance reduction. Reproducible numerical examples are considered to demonstrate the approach. Emphasis is put on ideas rather than mathematical proofs.

  • Probabilistic Approaches to Machine Learning on Tensor Data

    Date: 2020-11-27

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    In contemporary scientific research, it is often of great interest to predict a categorical response based on a high-dimensional tensor (i.e. multi-dimensional array). Motivated by applications in science and engineering, we propose two probabilistic methods for machine learning on tensor data in the supervised and the unsupervised context, respectively. For supervised problems, we develop a comprehensive discriminant analysis model, called the CATCH model. The CATCH model integrates the information from the tensor and additional covariates to predict the categorical outcome with high accuracy. We further consider unsupervised problems, where no categorical response is available even on the training data. A doubly-enhanced EM (DEEM) algorithm is proposed for model-based tensor clustering, in which both the E-step and the M-step are carefully tailored for tensor data. CATCH and DEEM are developed under explicit statistical models with clear interpretations. They aggressively take advantage of the tensor structure and sparsity to tackle the new computational and statistical challenges arising from the intimidating tensor dimensions. Efficient algorithms are developed to solve the related optimization problems. Under mild conditions, CATCH and DEEM are shown to be consistent even when the dimension of each mode grows at an exponential rate of the sample size. Numerical studies also strongly support the application of CATCH and DEEM.

  • Modeling viral rebound trajectories after analytical antiretroviral treatment interruption

    Date: 2020-11-20

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Despite the success of combined antiretroviral therapy (ART) in achieving sustained control of viral replication, the concerns about side-effects, drug-drug interactions, drug resistance and cost call for a need to identify strategies for achieving HIV eradication or an ART-free remission. Following ART withdrawal, patients’ viral load levels usually increase rapidly to a peak followed by a dip, and then stabilize at a viral load set point. Characterizing features of the viral rebound trajectories (e.g., time to viral rebound and viral set points) and identifying host, virological, and immunological factors that are predictive of these features requires addressing analytical challenges such as non-linear viral rebound trajectories, coarsened data due to the assay’s limit of quantification, and intermittent measurements of viral load values. We first introduce a parametric nonlinear mixed effects (NLME) model for the viral rebound trajectory and compare its performance to a mechanistic modeling approach. We then develop a smoothed simulated pseudo maximum likelihood method for fitting NLME models that permits flexible specification of random effects distributions. Finally, we investigate the association between the time to viral suppression after ART initiation and the time to viral rebound after ART interruption through a Cox proportional hazard regression model where both the outcome and the covariate are interval-censored observations.

  • Generalized Energy-Based Models

    Date: 2020-11-06

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    I will introduce Generalized Energy Based Models (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the “generator”). In particular, while the energy function is analogous to the GAN critic function, it is not discarded after training. GEBMs are trained by alternating between learning the energy and the base, much like a GAN. Both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via MCMC, thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. GEBMs also return state-of-the-art performance on density modelling tasks, and when using base measures with an explicit form.

  • Test-based integrative analysis of randomized trial and real-world data for treatment heterogeneity estimation

    Date: 2020-10-30

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Parallel randomized clinical trial (RCT) and real-world data (RWD) are becoming increasingly available for treatment evaluation. Given the complementary features of the RCT and RWD, we propose a test-based integrative analysis of the RCT and RWD for accurate and robust estimation of the heterogeneity of treatment effect (HTE), which lies at the heart of precision medicine. When the RWD are not subject to bias, e.g., due to unmeasured confounding, our approach combines the RCT and RWD for optimal estimation by exploiting semiparametric efficiency theory. Utilizing the design advantage of RTs, we construct a built-in test procedure to gauge the reliability of the RWD and decide whether or not to use RWD in an integrative analysis. We characterize the asymptotic distribution of the test-based integrative estimator under local alternatives, which provides a better approximation of the finite-sample behaviors of the test and estimator when the idealistic assumption required for the RWD is weakly violated. We provide a data-adaptive procedure to select the threshold of the test statistic that promises the smallest mean square error of the proposed estimator of the HTE. Lastly, we construct an adaptive confidence interval that has a good finite-sample coverage property. We apply the proposed method to characterize who can benefit from adjuvant chemotherapy in patients with stage IB non-small cell lung cancer.

  • Linear Regression and its Inference on Noisy Network-linked Data

    Date: 2020-10-23

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Linear regression on a set of observations linked by a network has been an essential tool in modeling the relationship between response and covariates with additional network data. Despite its wide range of applications in many areas, such as social sciences and health-related research, the problem has not been well-studied in statistics so far. Previous methods either lack of inference tools or rely on restrictive assumptions on social effects, and usually treat the network structure as precisely observed, which is too good to be true in many problems. We propose a linear regression model with nonparametric social effects. Our model does not assume the relational data or network structure to be accurately observed; thus, our method can be provably robust to a certain level of perturbation of the network structure. We establish a full set of computationally efficient asymptotic inference tools under a general requirement of the perturbation and then study the robustness of our method in the specific setting when the perturbation is from random network models. We discover a phase-transition phenomenon of inference validity concerning the network density when no prior knowledge about the network model is available, while also show the significant improvement achieved by knowing the network model. A by-product of our analysis is a rate-optimal concentration bound about subspace projection that may be of independent interest. We conduct extensive simulation studies to verify our theoretical observations and demonstrate the advantage of our method compared to a few benchmarks under different data-generating models. The method is then applied to adolescent network data to study the gender and racial differences in social activities.

  • Adaptive MCMC For Everyone

    Date: 2020-10-16

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Markov chain Monte Carlo (MCMC) algorithms, such as the Metropolis Algorithm and the Gibbs Sampler, are an extremely useful and popular method of approximately sampling from complicated probability distributions. Adaptive MCMC attempts to automatically modify the algorithm while it runs, to improve its performance on the fly. However, such adaptation often destroys the ergodicity properties necessary for the algorithm to be valid. In this talk, we first illustrate MCMC algorithms using simple graphical examples. We then discuss adaptive MCMC, and present examples and theorems concerning its ergodicity and efficiency.