/categories/mcgill-statistics-seminar/index.xml McGill Statistics Seminar - McGill Statistics Seminars
  • Learning Causal Structures via Continuous Optimization

    Date: 2021-03-26

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    There has been a recent surge of interest in the machine learning community in developing causal models that handle the effect of interventions in a system. In this talk, I will consider the problem of learning (estimating) a causal graphical model from data. The search over possible directed acyclic graphs modeling the causal structure is inherently combinatorial, but I’ll describe our recent work which use gradient-based continuous optimization for learning both the parameters of the distribution and the causal graph jointly, and can be combined naturally with flexible parametric families that use neural networks.

  • Measuring timeliness of annual reports filing by jump additive models

    Date: 2021-03-19

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    Foreign public issuers (FPIs) are required by the Securities and Exchanges Commission (SEC) to file Form 20-F as comprehensive annual reports. In an effort to increase the usefulness of 20-Fs, the SEC recently enacted a regulation to accelerate the deadline of 20-F filing from six months to four months after the fiscal year-end. The rationale is that the shortened reporting lag would improve the informational relevance of 20-Fs. In this work we propose a jump additive model to evaluate the SEC’s rationale by investigating the relationship between the timeliness of 20-F filing and its decision usefulness using the market data. The proposed model extends the conventional additive models to allow possible discontinuities in the regression functions. We suggest a two-step jump-preserving estimation procedure and show that it is statistically consistent. By applying the procedure to the 20-F study, we find a moderate positive association between the magnitude of the market reaction and the filing timeliness when the acceleration is less than 17 days. We also find that the market considers the filings significantly more informative when the acceleration is more than 18 days and such reaction tapers off when the acceleration exceeds 40 days.

  • CoinPress: Practical Private Point Estimation and Confidence Intervals

    Date: 2021-02-26

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    We consider point estimation and generation of confidence intervals under the constraint of differential privacy. We provide a simple and practical framework for these tasks in relatively general settings. Our investigation addresses a novel challenge that arises in the differentially private setting, which involves the cost of weak a priori bounds on the parameters of interest. This framework is applied to the problems of Gaussian mean and covariance estimation. Despite the simplicity of our method, we are able to achieve minimax near-optimal rates for these problems. Empirical evaluations, on the problems of mean estimation, covariance estimation, and principal component analysis, demonstrate significant improvements in comparison to previous work.

  • Joint integrative analysis of multiple data sources with correlated vector outcomes

    Date: 2021-02-19

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    We consider the joint estimation of regression parameters from multiple potentially heterogeneous data sources with correlated vector outcomes. The primary goal of this joint integrative analysis is to estimate covariate effects on all vector outcomes through a marginal regression model in a statistically and computationally efficient way. We present a general class of distributed estimators that can be implemented in a parallelized computational scheme. Modelling, computational and theoretical challenges are overcome by first fitting a local model within each data source and then combining local results while accounting for correlation between data sources. This approach to distributed estimation and inference is formulated using Hansen’s generalized method of moments but implemented via an asymptotically equivalent and communication-efficient meta-estimator. We show both theoretically and numerically that the proposed method yields efficiency improvements and is computationally fast. We illustrate the proposed methodology with the joint integrative analysis of metabolic pathways in a large multi-cohort study.

  • An Adaptive Algorithm to Multi-armed Bandit Problem with High-dimensional Covariates

    Date: 2021-02-05

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    This work studies an important sequential decision making problem known as the multi-armed bandit problem with covariates. Under a linear bandit framework with high-dimensional covariates, we propose a general arm allocation algorithm that integrates both arm elimination and randomized assignment strategies. By employing a class of high-dimensional regression methods for coefficient estimation, the proposed algorithm is shown to have near optimal finite-time regret performance under a new study scope that requires neither a margin condition nor a reward gap condition for competitive arms. Based on synergistically verified benefit of the margin, our algorithm exhibits an adaptive performance that automatically adapts to the margin and gap conditions, and attains the optimal regret rates under both study scopes, without or with the margin, up to a logarithmic factor. The proposed algorithm also simultaneously generates useful coefficient estimation output for competitive arms and is shown to achieve both estimation consistency and variable selection consistency. Promising empirical performance is demonstrated through two real data evaluation examples in drug dose assignment and news article recommendation.

  • Large-scale Machine Learning Algorithms for Biomedical Data Science

    Date: 2021-01-15

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    During the last decade, hundreds of machine learning methods have been developed for disease outcome prediction based on high-throughput genomics data. However, the quality of the input genomics features and the output clinical variables has been ignored in these algorithms. In this talk, I will introduce two studies that develop methods to learn more accurate molecular signatures and drug response values for cancer research. These studies are supported by NSF, NIH, and Moffitt Cancer Center.

  • Quasi-random sampling for multivariate distributions via generative neural networks

    Date: 2020-12-04

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    A novel approach based on generative neural networks is introduced for constructing quasi-random number generators for multivariate models with any underlying copula in order to estimate expectations with variance reduction. So far, quasi-random number generators for multivariate distributions required a careful design, exploiting specific properties (such as conditional distributions) of the implied copula or the underlying quasi-Monte Carlo point set, and were only tractable for a small number of models. Utilizing specific generative neural networks allows one to construct quasi-random number generators for a much larger variety of multivariate distributions without such restrictions. Once trained with a pseudo-random sample, these neural networks only require a multivariate standard uniform randomized quasi-Monte Carlo point set as input and are thus fast in estimating expectations under dependence with variance reduction. Reproducible numerical examples are considered to demonstrate the approach. Emphasis is put on ideas rather than mathematical proofs.

  • Probabilistic Approaches to Machine Learning on Tensor Data

    Date: 2020-11-27

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    In contemporary scientific research, it is often of great interest to predict a categorical response based on a high-dimensional tensor (i.e. multi-dimensional array). Motivated by applications in science and engineering, we propose two probabilistic methods for machine learning on tensor data in the supervised and the unsupervised context, respectively. For supervised problems, we develop a comprehensive discriminant analysis model, called the CATCH model. The CATCH model integrates the information from the tensor and additional covariates to predict the categorical outcome with high accuracy. We further consider unsupervised problems, where no categorical response is available even on the training data. A doubly-enhanced EM (DEEM) algorithm is proposed for model-based tensor clustering, in which both the E-step and the M-step are carefully tailored for tensor data. CATCH and DEEM are developed under explicit statistical models with clear interpretations. They aggressively take advantage of the tensor structure and sparsity to tackle the new computational and statistical challenges arising from the intimidating tensor dimensions. Efficient algorithms are developed to solve the related optimization problems. Under mild conditions, CATCH and DEEM are shown to be consistent even when the dimension of each mode grows at an exponential rate of the sample size. Numerical studies also strongly support the application of CATCH and DEEM.

  • Modeling viral rebound trajectories after analytical antiretroviral treatment interruption

    Date: 2020-11-20

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Despite the success of combined antiretroviral therapy (ART) in achieving sustained control of viral replication, the concerns about side-effects, drug-drug interactions, drug resistance and cost call for a need to identify strategies for achieving HIV eradication or an ART-free remission. Following ART withdrawal, patients’ viral load levels usually increase rapidly to a peak followed by a dip, and then stabilize at a viral load set point. Characterizing features of the viral rebound trajectories (e.g., time to viral rebound and viral set points) and identifying host, virological, and immunological factors that are predictive of these features requires addressing analytical challenges such as non-linear viral rebound trajectories, coarsened data due to the assay’s limit of quantification, and intermittent measurements of viral load values. We first introduce a parametric nonlinear mixed effects (NLME) model for the viral rebound trajectory and compare its performance to a mechanistic modeling approach. We then develop a smoothed simulated pseudo maximum likelihood method for fitting NLME models that permits flexible specification of random effects distributions. Finally, we investigate the association between the time to viral suppression after ART initiation and the time to viral rebound after ART interruption through a Cox proportional hazard regression model where both the outcome and the covariate are interval-censored observations.

  • Generalized Energy-Based Models

    Date: 2020-11-06

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    I will introduce Generalized Energy Based Models (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the “generator”). In particular, while the energy function is analogous to the GAN critic function, it is not discarded after training. GEBMs are trained by alternating between learning the energy and the base, much like a GAN. Both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via MCMC, thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. GEBMs also return state-of-the-art performance on density modelling tasks, and when using base measures with an explicit form.