/categories/mcgill-statistics-seminar/index.xml McGill Statistics Seminar - McGill Statistics Seminars
  • Excursions in Statistical History: Highlights

    Date: 2023-03-17

    Time: 15:30-16:30 (Montreal time)

    Hybrid: In person / Zoom

    Location: Burnside Hall 1104

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    Over the last 20 years, the speaker has delved into the origins of ‘regression’; the development of the ’t’ and ‘Poisson’ distributions; forerunners of the ‘hazard’ function; and the statistical design and conduct of US Selective Service lotteries from 1917 onwards. This talk will recount the stories, data and simulations behind some of these, and provide some modern-day re-enactments.

  • Heteroskedastic Sparse PCA in High Dimensions

    Date: 2023-03-10

    Time: 15:30-16:30 (Montreal time)

    Hybrid: In person / Zoom

    Location: Burnside Hall 1104

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    Principal component analysis (PCA) is one of the most commonly used techniques for dimension reduction and feature extraction. Though it has been well-studied for high-dimensional sparse PCA, little is known when the noise is heteroskedastic, which turns out to be ubiquitous in many scenarios, like biological sequencing data and information network data. We propose an iterative algorithm for sparse PCA in the presence of heteroskedastic noise, which alternatively updates the estimates of the sparse eigenvectors using the power method with adaptive thresholding in one step, and imputes the diagonal values of the sample covariance matrix to reduce the estimation bias due to heteroskedasticity in the other step. Our procedure is computationally fast and provably optimal under the generalized spiked covariance model, assuming the leading eigenvectors are sparse. A comprehensive simulation study demonstrates its robustness and effectiveness in various settings.

  • High Dimensional Logistic Regression Under Network Dependence

    Date: 2023-03-10

    Time: 14:15-15:15 (Montreal time)

    Hybrid: In person / Zoom

    Location: Burnside Hall 1104

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    The classical formulation of logistic regression relies on the independent sampling assumption, which is often violated when the outcomes interact through an underlying network structure, such as over a temporal/spatial domain or on a social network. This necessitates the development of models that can simultaneously handle both the network peer-effect (arising from neighborhood interactions) and the effect of (possibly) high-dimensional covariates. In this talk, I will describe a framework for incorporating such dependencies in a high-dimensional logistic regression model by introducing a quadratic interaction term, as in the Ising model, designed to capture the pairwise interactions from the underlying network. The resulting model can also be viewed as an Ising model, where the node-dependent external fields linearly encode the high-dimensional covariates. We use a penalized maximum pseudo-likelihood method for estimating the network peer-effect and the effect of the covariates (the regression coefficients), which, in addition to handling the high-dimensionality of the parameters, conveniently avoids the computational intractability of the maximum likelihood approach. Our results imply that even under network dependence it is possible to consistently estimate the model parameters at the same rate as in classical (independent) logistic regression, when the true parameter is sparse and the underlying network is not too dense. Towards the end, I will talk about the rates of consistency of our proposed estimator for various natural graph ensembles, such as bounded degree graphs, sparse Erdos-Renyi random graphs, and stochastic block models, which follow as a consequence of our general results. This is a joint work with Ziang Niu, Sagnik Halder, Bhaswar Bhattacharya and George Michailidis.

  • Epidemic Forecasting using Delayed Time Embedding

    Date: 2023-02-17

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    Forecasting the future trajectory of an outbreak plays a crucial role in the mission of managing emerging infectious disease epidemics. Compartmental models, such as the Susceptible-Exposed-Infectious-Recovered (SEIR), are the most popular tools for this task. They have been used extensively to combat many infectious disease outbreaks including the current COVID-19 pandemic. One downside of these models is that they assume that the dynamics of an epidemic follow a pre-defined dynamical system which may not capture the true trajectories of an outbreak. Consequently, the users need to make several modifications throughout an epidemic to ensure their models fit well with the data. However, there is no guarantee that these modifications can also help increase the precision of forecasting. In this talk, I will introduce a new method for predicting epidemics that does not make any assumption on the underlying dynamical system. Our method combines sparse random feature expansion and delay embedding to learn the trajectory of an epidemic.

  • Efficient Label Shift Adaptation through the Lens of Semiparametric Models

    Date: 2023-02-10

    Time: 15:00-16:00 (Montreal time)

    Hybrid: In person / Zoom

    Location: Burnside Hall 1205

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    We study the domain adaptation problem with label shift in this work. Under the label shift context, the marginal distribution of the label varies across the training and testing datasets, while the conditional distribution of features given the label is the same. Traditional label shift adaptation methods either suffer from large estimation errors or require cumbersome post-prediction calibrations. To address these issues, we first propose a moment-matching framework for adapting the label shift based on the geometry of the influence function. Under such a framework, we propose a novel method named efficient label shift adaptation (ELSA), in which the adaptation weights can be estimated by solving linear systems. Theoretically, the ELSA estimator is root-n consistent (n is the sample size of the source data) and asymptotically normal. Empirically, we show that ELSA can achieve state-of-the-art estimation performances without post-prediction calibrations, thus, gaining computational efficiency.

  • Learning from a Biased Sample

    Date: 2023-02-03

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    The empirical risk minimization approach to data-driven decision making assumes that we can learn a decision rule from training data drawn under the same conditions as the ones we want to deploy it under. However, in a number of settings, we may be concerned that our training sample is biased, and that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. Building on concepts from distributionally robust optimization and sensitivity analysis, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions whose conditional distributions of outcomes given covariates differ from the conditional training distribution by at most a constant factor, and whose covariate distributions are absolutely continuous with respect to the covariate distribution of the training data. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a robust model using the method of sieves and propose a deep learning algorithm whose loss function captures our robustness target. We empirically validate our proposed method in simulations and a case study with the MIMIC-III dataset.

  • What is TWAS and how do we use it in integrating gene expression data

    Date: 2023-01-20

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    The transcriptome-wide association studies (TWAS) is a pioneering approach utilizing gene expression data to identify genetic basis of complex diseases. Its core component is called “genetically regulated expression (GReX)”. GReX links gene expression information with phenotype by serving as both the outcome of genotype-based expression models and the predictor for downstream association testing. Although it is popular and has been used in many high-profile projects, its mathematical nature and interpretation haven’t been rigorously verified. As such, we have first conducted power analysis using NCP-based closed forms (Cao et al, PLoS Genet 2021), based on which we realized that the common interpretation of TWAS that looks biologically sensible is actually mathematically questionable. Following this insight, by real data analysis and simulations, we demonstrated that current linear models of GReX inadvertently combine two separable steps of machine learning - feature selection and aggregation - which can be independently replaced to improve overall power (Cao et al, Genetics 2021). Based on this new interpretation, we have developed novel protocols disentangling feature selections and aggregations, leading to improved power and novel biological discoveries (Cao et al, BiB 2021; Genetics 2021). To promote this new understanding, we moved forward to develop two statistical tools utilizing gene expressions in identifying genetic basis of gene-gene interactions (Kossinna et al, in preparation) and low-effect genetic variants (Li et al, in review), respectively. Looking forward, our mathematical characterization of TWAS opens a door for a new way to integrate gene expressions in genetic studies towards the realization of precision medicine.

  • To split or not to split that is the question: From cross validation to debiased machine learning

    Date: 2023-01-13

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    Data splitting is an ubiquitous method in statistics with examples ranging from cross validation to cross-fitting. However, despite its prevalence, theoretical guidance regarding its use is still lacking. In this talk we will explore two examples and establish an asymptotic theory for it. In the first part of this talk, we study the cross-validation method, a ubiquitous method for risk estimation, and establish its asymptotic properties for a large class of models and with an arbitrary number of folds. Under stability conditions, we establish a central limit theorem and Berry-Esseen bounds for the cross-validated risk, which enable us to compute asymptotically accurate confidence intervals. Using our results, we study the statistical speed-up offered by cross validation compared to a train-test split procedure. We reveal some surprising behavior of the cross-validated risk and establish the statistically optimal choice for the number of folds. In the second part of this talk, we study the role of cross fitting in the generalized method of moments with moments that also depend on some auxiliary functions. Recent lines of work show how one can use generic machine learning estimators for these auxiliary problems, while maintaining asymptotic normality and root-n consistency of the target parameter of interest. The literature typically requires that these auxiliary problems are fitted on a separate sample or in a cross-fitting manner. We show that when these auxiliary estimation algorithms satisfy natural leave-one-out stability properties, then sample splitting is not required. This allows for sample re-use, which can be beneficial in moderately sized sample regimes.

  • Optimal One-pass Nonparametric Estimation Under Memory Constraint

    Date: 2022-11-18

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    For nonparametric regression in the streaming setting, where data constantly flow in and require real-time analysis, a main challenge is that data are cleared from the computer system once processed due to limited computer memory and storage. We tackle the challenge by proposing a novel one-pass estimator based on penalized orthogonal basis expansions and developing a general framework to study the interplay between statistical efficiency and memory consumption of estimators. We show that, the proposed estimator is statistically optimal under memory constraint, and has asymptotically minimal memory footprints among all one-pass estimators of the same estimation quality. Numerical studies demonstrate that the proposed one-pass estimator is nearly as efficient as its non-streaming counterpart that has access to all historical data.

  • Automated Inference on Sharp Bounds

    Date: 2022-11-11

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    Many causal parameters involving the joint distribution of potential outcomes in treated and control states cannot be point-identified, but only be bounded from above and below. The bounds can be further tightened by conditioning on pre-treatment covariates, and the sharp version of the bounds corresponds to using a full covariate vector. This paper gives a method for estimation and inference on sharp bounds determined by a linear system of under-identified equalities (e.g., as in Heckman et al (ReSTUD, 1997)). In the sharp bounds’ case, the RHS of this system involves a nuisance function of (many) covariates (e.g., the conditional probability of employment in treated or control state). Combining Neyman-orthogonality and sample splitting, I provide an asymptotically Gaussian estimator of sharp bound that does not require solving the linear system in closed form. I demonstrate the method in an empirical application to Connecticut’s Jobs First welfare reform experiment.