/categories/mcgill-statistics-seminar/index.xml McGill Statistics Seminar - McGill Statistics Seminars
  • Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control

    Date: 2022-04-01

    Time: 15:35-16:35 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    We introduce Learn then Test, a framework for calibrating machine learning models so that their predictions satisfy explicit, finite-sample statistical guarantees regardless of the underlying model and (unknown) data-generating distribution. The framework addresses, among other examples, false discovery rate control in multi-label classification, intersection-over-union control in instance segmentation, and the simultaneous control of the type-1 error of outlier detection and confidence set coverage in classification or regression. To accomplish this, we solve a key technical challenge: the control of arbitrary risks that are not necessarily monotonic. Our main insight is to reframe the risk-control problem as multiple hypothesis testing, enabling techniques and mathematical arguments different from those in the previous literature. We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision.

  • Distribution-​free inference for regression: discrete, continuous, and in between

    Date: 2022-03-25

    Time: 15:35-16:35 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    In data analysis problems where we are not able to rely on distributional assumptions, what types of inference guarantees can still be obtained? Many popular methods, such as holdout methods, cross-validation methods, and conformal prediction, are able to provide distribution-free guarantees for predictive inference, but the problem of providing inference for the underlying regression function (for example, inference on the conditional mean E[Y|X]) is more challenging. If X takes only a small number of possible values, then inference on E[Y|X] is trivial to achieve. At the other extreme, if the features X are continuously distributed, we show that any confidence interval for E[Y|X] must have non-vanishing width, even as sample size tends to infinity - this is true regardless of smoothness properties or other desirable features of the underlying distribution. In between these two extremes, we find several distinct regimes - in particular, it is possible for distribution-free confidence intervals to have vanishing width if and only if the effective support size of the distribution ofXis smaller than the square of the sample size.

  • New Approaches for Inference on Optimal Treatment Regimes

    Date: 2022-03-11

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    Finding the optimal treatment regime (or a series of sequential treatment regimes) based on individual characteristics has important applications in precision medicine. We propose two new approaches to quantify uncertainty in optimal treatment regime estimation. First, we consider inference in the model-free setting, which does not require specifying an outcome regression model. Existing model-free estimators for optimal treatment regimes are usually not suitable for the purpose of inference, because they either have nonstandard asymptotic distributions or do not necessarily guarantee consistent estimation of the parameter indexing the Bayes rule due to the use of surrogate loss. We study a smoothed robust estimator that directly targets the parameter corresponding to the Bayes decision rule for optimal treatment regimes estimation. We verify that a resampling procedure provides asymptotically accurate inference for both the parameter indexing the optimal treatment regime and the optimal value function. Next, we consider the high-dimensional setting and propose a semiparametric model-assisted approach for simultaneous inference. Simulation results and real data examples are used for illustration.

  • Integration of multi-omics data for the discovery of novel regulators that modulate biological processes

    Date: 2022-02-11

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    The cellular states in various biological processes such as cell differentiation, disease progression, and treatment response are often enormously complex and thus hard to be profiled with unimodal profiling (e.g., transcriptome). Although those unimodal measurements had brought success for studies in a large variety of studies, the incomplete (and often misleading) unimodal cellular profiling could lead to
    biased and inaccurate conclusions. With the development of biotechnologies, the availability of multi-omics data (bulk or single-cell) is ever-increasing. The rapid-accumulating multi-omics data offers unprecedented opportunities to accurately decode the cellular states in biological process and thus could derive a deep understanding of the change of the cellular states, crucial for finding biomarkers and therapeutic intervention strategies. In this talk, we will discuss a few multimodal methods that we developed to integrate multi-omics data for the discovery of novel regulators for multiple biological processes. Many of the novel predictions from the multimodal methods were experimentally validated and had brought new understandings of the underlying mechanisms for several diseases. I will also discuss how a potential novel COVID19 drug is discovered from such a multi-omics data integration analysis.

  • Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

    Date: 2022-02-04

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    In this talk, we consider constructing a confidence interval for a target policy’s value offline based on pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. We show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy’s value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provides rigorous uncertainty quantification.

  • Change-point analysis for complex data structures

    Date: 2022-01-21

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    The change-point analysis is more than sixty years old. Over this long period, it has been an important subject of interest in many scientific disciplines such as finance and econometrics, bioinformatics and genomics, climatology, engineering, and technology.

    In this talk, I will provide a general overview of the topic alongside some historical notes. I will then review the most recent and transformative advancements on the subject. Finally, I will discuss the change-point methodologies that my research team has developed over the past several years, covering various complex data structures.

  • Prediction of Bundled Insurance Risks with Dependence-aware Prediction using Pair Copula Construction

    Date: 2021-11-19

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    We propose a dependence-aware predictive modeling framework for multivariate risks stemmed from an insurance contract with bundling features – an important type of policy increasingly offered by major insurance companies. The bundling feature naturally leads to longitudinal measurements of multiple insurance risks. We build a novel predictive model that actively exploits the dependence among the evolution of multivariate repeated risk measurements. Specifically, the longitudinal measurement of each individual risk is first modeled using pair copula construction with a D-vine structure, and the multiple D-vines are then integrated by a flexible copula. While our analysis mainly focuses on the claim count as the measurement of insurance risk, the proposed model indeed provides a unified modeling framework that can accommodate different scales of measurements, including continuous, discrete, and mixed observations. A computationally efficient sequential method is proposed for model estimation and inference, and its performance is investigated both theoretically and via simulation studies. In the application, we examine multivariate bundled risks in multi-peril property insurance using the proprietary data obtained from a commercial property insurance provider. The proposed predictive model is found to provide improved decision making for several key insurance operations, including risk segmentation and risk management. In the underwriting operation, we show that the experience rate priced by the proposed model leads to a 9% lift in the insurer’s profit. In the reinsurance operation, we show that the insurer underestimates the risk of the retained insurance portfolio by 10% when ignoring the dependence among bundled insurance risks.

  • Variational Bayes for high-dimensional linear regression with sparse priors

    Date: 2021-11-12

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    A core problem in Bayesian statistics is approximating difficult to compute posterior distributions. In variational Bayes (VB), a method from machine learning, one approximates the posterior through optimization, which is typically faster than Markov chain Monte Carlo. We study a mean-field (i.e. factorizable) VB approximation to Bayesian model selection priors, including the popular spike-and-slab prior, in sparse high-dimensional linear regression. We establish convergence rates for this VB approach, studying conditions under which it provides good estimation. We also discuss some computational issues and study the empirical performance of the algorithm.

  • Model-assisted analyses of cluster-randomized experiments

    Date: 2021-10-22

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    Cluster-randomized experiments are widely used due to their logistical convenience and policy relevance. To analyze them properly, we must address the fact that the treatment is assigned at the cluster level instead of the individual level. Standard analytic strategies are regressions based on individual data, cluster averages, and cluster totals, which differ when the cluster sizes vary. These methods are often motivated by models with strong and unverifiable assumptions, and the choice among them can be subjective. Without any outcome modeling assumption, we evaluate these regression estimators and the associated robust standard errors from a design-based perspective where only the treatment assignment itself is random and controlled by the experimenter. We demonstrate that regression based on cluster averages targets a weighted average treatment effect, regression based on individual data is suboptimal in terms of efficiency, and regression based on cluster totals is consistent and more efficient with a large number of clusters. We highlight the critical role of covariates in improving estimation efficiency, and illustrate the efficiency gain via both simulation studies and data analysis. Moreover, we show that the robust standard errors are convenient approximations to the true asymptotic standard errors under the design-based perspective. Our theory holds even when the outcome models are misspecified, so it is model-assisted rather than model-based. We also extend the theory to a wider class of weighted average treatment effects.

  • Imbalanced learning using actuarial modified loss function in tree-based models

    Date: 2021-10-08

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    Tree-based models have gained momentum in insurance claim loss modeling; however, the point mass at zero and the heavy tail of insurance loss distribution pose the challenge to apply conventional methods directly to claim loss modeling. With a simple illustrative dataset, we first demonstrate how the traditional tree-based algorithm’s splitting function fails to cope with a large proportion of data with zero responses. To address the imbalance issue presented in such loss modeling, this paper aims to modify the traditional splitting function of Classification and Regression Tree (CART). In particular, we propose two novel actuarial modified loss functions, namely, the weighted sum of squared error and the sum of squared Canberra error. These modified loss functions impose a significant penalty on grouping observations of non-zero response with those of zero response at the splitting procedure, and thus significantly enhance their separation. Finally, we examine and compare the predictive performance of such actuarial modified tree-based models to the traditional model on synthetic datasets that imitate insurance loss. The results show that such modification leads to substantially different tree structures and improved prediction performance.