  • Three Myths About Causal Mediation

    Date: 2023-09-15

    Time: 15:30-16:30 (Montreal time)

    Location: Burnside 1104


    Meeting ID: 864 0479 8712

    Passcode: None


    Causal mediation techniques are a means for identifying the degree to which a cause influences its effect along particular causal paths. For example, in a model where a cause influences its effect both indirectly via a mediator and directly via factors not included in the model, mediation techniques enable one to measure both direct and indirect effects. Although mediation techniques are widely employed, they are often misunderstood. This is in part due to the long-term influence of Baron and Kenny’s (1986) treatment of mediation, which applies only to linear models without interaction, and which leads one to develop intuitions about direct and indirect effects that do not generalize to non-parametric causal models. In my talk, I identify and reject three persistent myths about mediation. I argue that such methods: 1. Should not be understood as decomposing the total effect into additive components corresponding to the contributions of the paths; 2. Are not a means for eliminating latent heterogeneity; and 3. Do not require one to appeal to causal concepts other than the counterfactual causal ones built into structural causal models. These points are crucial for understanding mediation effects in any contexts in which they are studied, and have particular applications for studies of fairness and discrimination, in which such effects play an increasingly central role (Plečko and Bareinboim, 2022).

  • Empirical Bayes Control of the False Discovery Exceedance

    Date: 2023-08-17

    Time: 15:30-16:30 (Montreal time)

    Hybrid: In person / Zoom

    Location: Burnside Hall 1104


    Meeting ID: 896 2334 4755

    Passcode: 287381


    In sparse large-scale testing problems where the false discovery proportion (FDP) is highly variable, the false discovery exceedance (FDX) provides a valuable alternative to the widely used false discovery rate (FDR). We develop an empirical Bayes approach to controlling the FDX. We show that for independent hypotheses from a two-group model and dependent hypotheses from a Gaussian model fulfilling the exchangeability condition, an oracle decision rule based on ranking and thresholding the local false discovery rate (lfdr) is optimal in the sense that the power is maximized subject to FDX constraint. We propose a data-driven FDX procedure that emulates the oracle via carefully designed computational shortcuts. We investigate the empirical performance of the proposed method using simulations and illustrate the merits of FDX control through an application for identifying abnormal stock trading strategies.

  • Residual-based estimation of parametric copulas under regression

    Date: 2023-08-14

    Time: 15:30-16:30 (Montreal time)

    Hybrid: In person / Zoom

    Location: Burnside Hall 1104


    Meeting ID: 834 3668 6293

    Passcode: 12345


    We study a multivariate response regression model where each coordinate is described by a location-scale regression, and where the dependence structure of the “noise” terms in the regression is described by a parametric copula. Our goal is to estimate the associated Euclidean copula parameter given a sample of the response and the covariate. In the absence of the copula sample, the oracle ranks in the usual pseudo-likelihood estimation procedure are no longer computable. Instead, we base our estimation on the residual ranks calculated from some preliminary estimators of the regression functions. We show that the residual-based estimators are asymptotically equivalent to their oracle counterparts, even when the dimension of the covariate in the regression is moderately diverging. Partially to serve this objective, we also study the weighted convergence of the residual empirical processes.

  • Confidence sets for Causal Discovery

    Date: 2023-03-24

    Time: 15:30-16:30 (Montreal time)

    On Zoom only


    Meeting ID: 834 3668 6293

    Passcode: 12345


    Causal discovery procedures are popular methods for discovering causal structure across the physical, biological, and social sciences. However, most procedures for causal discovery only output a single estimated causal model or single equivalence class of models. We propose a procedure for quantifying uncertainty in causal discovery. Specifically, we consider linear structural equation models with non-Gaussian errors and propose a procedure which returns a confidence sets of causal orderings which are not ruled out by the data. We show that asymptotically, the true causal ordering will be contained in the returned set with some user specified probability.

  • Excursions in Statistical History: Highlights

    Date: 2023-03-17

    Time: 15:30-16:30 (Montreal time)

    Hybrid: In person / Zoom

    Location: Burnside Hall 1104


    Meeting ID: 834 3668 6293

    Passcode: 12345


    Over the last 20 years, the speaker has delved into the origins of ‘regression’; the development of the ’t’ and ‘Poisson’ distributions; forerunners of the ‘hazard’ function; and the statistical design and conduct of US Selective Service lotteries from 1917 onwards. This talk will recount the stories, data and simulations behind some of these, and provide some modern-day re-enactments.

  • Heteroskedastic Sparse PCA in High Dimensions

    Date: 2023-03-10

    Time: 15:30-16:30 (Montreal time)

    Hybrid: In person / Zoom

    Location: Burnside Hall 1104


    Meeting ID: 834 3668 6293

    Passcode: 12345


    Principal component analysis (PCA) is one of the most commonly used techniques for dimension reduction and feature extraction. Though it has been well-studied for high-dimensional sparse PCA, little is known when the noise is heteroskedastic, which turns out to be ubiquitous in many scenarios, like biological sequencing data and information network data. We propose an iterative algorithm for sparse PCA in the presence of heteroskedastic noise, which alternatively updates the estimates of the sparse eigenvectors using the power method with adaptive thresholding in one step, and imputes the diagonal values of the sample covariance matrix to reduce the estimation bias due to heteroskedasticity in the other step. Our procedure is computationally fast and provably optimal under the generalized spiked covariance model, assuming the leading eigenvectors are sparse. A comprehensive simulation study demonstrates its robustness and effectiveness in various settings.

  • High Dimensional Logistic Regression Under Network Dependence

    Date: 2023-03-10

    Time: 14:15-15:15 (Montreal time)

    Hybrid: In person / Zoom

    Location: Burnside Hall 1104


    Meeting ID: 834 3668 6293

    Passcode: 12345


    The classical formulation of logistic regression relies on the independent sampling assumption, which is often violated when the outcomes interact through an underlying network structure, such as over a temporal/spatial domain or on a social network. This necessitates the development of models that can simultaneously handle both the network peer-effect (arising from neighborhood interactions) and the effect of (possibly) high-dimensional covariates. In this talk, I will describe a framework for incorporating such dependencies in a high-dimensional logistic regression model by introducing a quadratic interaction term, as in the Ising model, designed to capture the pairwise interactions from the underlying network. The resulting model can also be viewed as an Ising model, where the node-dependent external fields linearly encode the high-dimensional covariates. We use a penalized maximum pseudo-likelihood method for estimating the network peer-effect and the effect of the covariates (the regression coefficients), which, in addition to handling the high-dimensionality of the parameters, conveniently avoids the computational intractability of the maximum likelihood approach. Our results imply that even under network dependence it is possible to consistently estimate the model parameters at the same rate as in classical (independent) logistic regression, when the true parameter is sparse and the underlying network is not too dense. Towards the end, I will talk about the rates of consistency of our proposed estimator for various natural graph ensembles, such as bounded degree graphs, sparse Erdos-Renyi random graphs, and stochastic block models, which follow as a consequence of our general results. This is a joint work with Ziang Niu, Sagnik Halder, Bhaswar Bhattacharya and George Michailidis.

  • Epidemic Forecasting using Delayed Time Embedding

    Date: 2023-02-17

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    Forecasting the future trajectory of an outbreak plays a crucial role in the mission of managing emerging infectious disease epidemics. Compartmental models, such as the Susceptible-Exposed-Infectious-Recovered (SEIR), are the most popular tools for this task. They have been used extensively to combat many infectious disease outbreaks including the current COVID-19 pandemic. One downside of these models is that they assume that the dynamics of an epidemic follow a pre-defined dynamical system which may not capture the true trajectories of an outbreak. Consequently, the users need to make several modifications throughout an epidemic to ensure their models fit well with the data. However, there is no guarantee that these modifications can also help increase the precision of forecasting. In this talk, I will introduce a new method for predicting epidemics that does not make any assumption on the underlying dynamical system. Our method combines sparse random feature expansion and delay embedding to learn the trajectory of an epidemic.

  • Efficient Label Shift Adaptation through the Lens of Semiparametric Models

    Date: 2023-02-10

    Time: 15:00-16:00 (Montreal time)

    Hybrid: In person / Zoom

    Location: Burnside Hall 1205


    Meeting ID: 834 3668 6293

    Passcode: 12345


    We study the domain adaptation problem with label shift in this work. Under the label shift context, the marginal distribution of the label varies across the training and testing datasets, while the conditional distribution of features given the label is the same. Traditional label shift adaptation methods either suffer from large estimation errors or require cumbersome post-prediction calibrations. To address these issues, we first propose a moment-matching framework for adapting the label shift based on the geometry of the influence function. Under such a framework, we propose a novel method named efficient label shift adaptation (ELSA), in which the adaptation weights can be estimated by solving linear systems. Theoretically, the ELSA estimator is root-n consistent (n is the sample size of the source data) and asymptotically normal. Empirically, we show that ELSA can achieve state-of-the-art estimation performances without post-prediction calibrations, thus, gaining computational efficiency.

  • Learning from a Biased Sample

    Date: 2023-02-03

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    The empirical risk minimization approach to data-driven decision making assumes that we can learn a decision rule from training data drawn under the same conditions as the ones we want to deploy it under. However, in a number of settings, we may be concerned that our training sample is biased, and that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. Building on concepts from distributionally robust optimization and sensitivity analysis, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions whose conditional distributions of outcomes given covariates differ from the conditional training distribution by at most a constant factor, and whose covariate distributions are absolutely continuous with respect to the covariate distribution of the training data. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a robust model using the method of sieves and propose a deep learning algorithm whose loss function captures our robustness target. We empirically validate our proposed method in simulations and a case study with the MIMIC-III dataset.