/categories/mcgill-statistics-seminar/index.xml McGill Statistics Seminar - McGill Statistics Seminars
  • What is TWAS and how do we use it in integrating gene expression data

    Date: 2023-01-20

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    The transcriptome-wide association studies (TWAS) is a pioneering approach utilizing gene expression data to identify genetic basis of complex diseases. Its core component is called “genetically regulated expression (GReX)”. GReX links gene expression information with phenotype by serving as both the outcome of genotype-based expression models and the predictor for downstream association testing. Although it is popular and has been used in many high-profile projects, its mathematical nature and interpretation haven’t been rigorously verified. As such, we have first conducted power analysis using NCP-based closed forms (Cao et al, PLoS Genet 2021), based on which we realized that the common interpretation of TWAS that looks biologically sensible is actually mathematically questionable. Following this insight, by real data analysis and simulations, we demonstrated that current linear models of GReX inadvertently combine two separable steps of machine learning - feature selection and aggregation - which can be independently replaced to improve overall power (Cao et al, Genetics 2021). Based on this new interpretation, we have developed novel protocols disentangling feature selections and aggregations, leading to improved power and novel biological discoveries (Cao et al, BiB 2021; Genetics 2021). To promote this new understanding, we moved forward to develop two statistical tools utilizing gene expressions in identifying genetic basis of gene-gene interactions (Kossinna et al, in preparation) and low-effect genetic variants (Li et al, in review), respectively. Looking forward, our mathematical characterization of TWAS opens a door for a new way to integrate gene expressions in genetic studies towards the realization of precision medicine.

  • To split or not to split that is the question: From cross validation to debiased machine learning

    Date: 2023-01-13

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    Data splitting is an ubiquitous method in statistics with examples ranging from cross validation to cross-fitting. However, despite its prevalence, theoretical guidance regarding its use is still lacking. In this talk we will explore two examples and establish an asymptotic theory for it. In the first part of this talk, we study the cross-validation method, a ubiquitous method for risk estimation, and establish its asymptotic properties for a large class of models and with an arbitrary number of folds. Under stability conditions, we establish a central limit theorem and Berry-Esseen bounds for the cross-validated risk, which enable us to compute asymptotically accurate confidence intervals. Using our results, we study the statistical speed-up offered by cross validation compared to a train-test split procedure. We reveal some surprising behavior of the cross-validated risk and establish the statistically optimal choice for the number of folds. In the second part of this talk, we study the role of cross fitting in the generalized method of moments with moments that also depend on some auxiliary functions. Recent lines of work show how one can use generic machine learning estimators for these auxiliary problems, while maintaining asymptotic normality and root-n consistency of the target parameter of interest. The literature typically requires that these auxiliary problems are fitted on a separate sample or in a cross-fitting manner. We show that when these auxiliary estimation algorithms satisfy natural leave-one-out stability properties, then sample splitting is not required. This allows for sample re-use, which can be beneficial in moderately sized sample regimes.

  • Optimal One-pass Nonparametric Estimation Under Memory Constraint

    Date: 2022-11-18

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    For nonparametric regression in the streaming setting, where data constantly flow in and require real-time analysis, a main challenge is that data are cleared from the computer system once processed due to limited computer memory and storage. We tackle the challenge by proposing a novel one-pass estimator based on penalized orthogonal basis expansions and developing a general framework to study the interplay between statistical efficiency and memory consumption of estimators. We show that, the proposed estimator is statistically optimal under memory constraint, and has asymptotically minimal memory footprints among all one-pass estimators of the same estimation quality. Numerical studies demonstrate that the proposed one-pass estimator is nearly as efficient as its non-streaming counterpart that has access to all historical data.

  • Automated Inference on Sharp Bounds

    Date: 2022-11-11

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    Many causal parameters involving the joint distribution of potential outcomes in treated and control states cannot be point-identified, but only be bounded from above and below. The bounds can be further tightened by conditioning on pre-treatment covariates, and the sharp version of the bounds corresponds to using a full covariate vector. This paper gives a method for estimation and inference on sharp bounds determined by a linear system of under-identified equalities (e.g., as in Heckman et al (ReSTUD, 1997)). In the sharp bounds’ case, the RHS of this system involves a nuisance function of (many) covariates (e.g., the conditional probability of employment in treated or control state). Combining Neyman-orthogonality and sample splitting, I provide an asymptotically Gaussian estimator of sharp bound that does not require solving the linear system in closed form. I demonstrate the method in an empirical application to Connecticut’s Jobs First welfare reform experiment.

  • Max-linear Graphical Models for Extreme Risk Modelling

    Date: 2022-11-04

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    Graphical models can represent multivariate distributions in an intuitive way and, hence, facilitate statistical analysis of high-dimensional data. Such models are usually modular so that high-dimensional distributions can be described and handled by careful combination of lower dimensional factors. Furthermore, graphs are natural data structures for algorithmic treatment. Moreover, graphical models can allow for causal interpretation, often provided through a recursive system on a directed acyclic graph (DAG) and the max-linear Bayesian network we introduced in [1] is a specific example. This talk contributes to the recently emerged topic of graphical models for extremes, in particular to max-linear Bayesian networks, which are max-linear graphical models on DAGs. Generalized MLEs are derived in [2]. In this context, the Latent River Problem has emerged as a flagship problem for causal discovery in extreme value statistics. In [3] we provide a simple and efficient algorithm QTree to solve the Latent River Problem. QTree returns a directed graph and achieves almost perfect recovery on the Upper Danube, the existing benchmark dataset, as well as on new data from the Lower Colorado River in Texas. It can handle missing data, and has an automated parameter tuning procedure. In our paper, we also show that, under a max-linear Bayesian network model for extreme values with propagating noise, the QTree algorithm returns asymptotically a.s. the correct tree. Here we use the fact that the non-noisy model has a left-sided atom for every bivariate marginal distribution, when there is a directed edge between the the nodes.

  • A Conformal-Based Two-Sample Conditional Distribution Test

    Date: 2022-10-21

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    We consider the problem of testing the equality of the conditional distribution of a response variable given a set of covariates between two populations. Such a testing problem is related to transfer learning and causal inference. We develop a nonparametric procedure by combining recent advances in conformal prediction with some new ingredients such as a novel choice of conformity score and data-driven choices of weight and score functions. To our knowledge, this is the first successful attempt of using conformal prediction for testing statistical hypotheses beyond exchangeability. The final test statistic reveals a natural connection between conformal inference and the classical rank-sum test. Our method is suitable for modern machine learning scenarios where the data has high dimensionality and the sample size is large, and can be effectively combined with existing classification algorithms to find good weight and score functions. The performance of the proposed method is demonstrated in synthetic and real data examples.

  • Some steps towards causal representation learning

    Date: 2022-10-07

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    High-dimensional unstructured data such images or sensor data can often be collected cheaply in experiments, but is challenging to use in a causal inference pipeline without extensive engineering and domain knowledge to extract underlying latent factors. The long term goal of causal representation learning is to find appropriate assumptions and methods to disentangle latent variables and learn the causal mechanisms that explain a system’s behaviour. In this talk, I’ll present results from a series of recent papers that describe how we can leverage assumptions about a system’s causal mechanisms to provably disentangle latent factors. I will also talk about the limitations of a commonly used injectivity assumption, and discuss a hierarchy of settings that relax this assumption.

  • Statistical Inference for Functional Linear Quantile Regression

    Date: 2022-09-16

    Time: 15:20-16:20 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    We propose inferential tools for functional linear quantile regression where the conditional quantile of a scalar response is assumed to be a linear functional of a functional covariate. In contrast to conventional approaches, we employ kernel convolution to smooth the original loss function. The coefficient function is estimated under a reproducing kernel Hilbert space framework. A gradient descent algorithm is designed to minimize the smoothed loss function with a roughness penalty. With the aid of the Banach fixed-point theorem, we show the existence and uniqueness of our proposed estimator as the minimizer of the regularized loss function in an appropriate Hilbert space. Furthermore, we establish the convergence rate as well as the weak convergence of our estimator. As far as we know, this is the first weak convergence result for a functional quantile regression model. Pointwise confidence intervals and a simultaneous confidence band for the true coefficient function are then developed based on these theoretical properties. Numerical studies including both simulations and a data application are conducted to investigate the performance of our estimator and inference tools in finite sample.

  • Markov-Switching State Space Models For Uncovering Musical Interpretation

    Date: 2022-09-09

    Time: 15:30-16:30 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    For concertgoers, musical interpretation is the most important factor in determining whether or not we enjoy a classical performance. Every performance includes mistakes—intonation issues, a lost note, an unpleasant sound—but these are all easily forgotten (or unnoticed) when a performer engages her audience, imbuing a piece with novel emotional content beyond the vague instructions inscribed on the printed page. In this research, we use data from the CHARM Mazurka Project—forty-six professional recordings of Chopin’s Mazurka Op. 68 No. 3 by consummate artists—with the goal of elucidating musically interpretable performance decisions. We focus specifically on each performer’s use of musical tempo by examining the inter-onset intervals of the note attacks in the recording. To explain these tempo decisions, we develop a switching state space model and estimate it by maximum likelihood combined with prior information gained from music theory and performance practice. We use the estimated parameters to quantitatively describe individual performance decisions and compare recordings. These comparisons suggest methods for informing music instruction, discovering listening preferences, and analyzing performances.

  • Enriched post-selection models for high dimensional data

    Date: 2022-04-08

    Time: 15:35-16:35 (Montreal time)


    Meeting ID: 834 3668 6293

    Passcode: 12345


    High dimensional data are rapidly growing in many domains, for example, in microarray gene expression studies, fMRI data analysis, large-scale healthcare analytics, text/image analysis, natural language processing and astronomy, to name but a few. In the last two decades regularisation approaches have become the methods of choice for analysing high dimensional data. However, obtaining accurate estimates and predictions as well as valid statistical inference remains a major challenge in high dimensional situations. In this talk, we present enriched post-selection models that aim to improve parameter estimation and prediction, and to facilitate statistical inferences in high dimensional regression models. The enriched post-selection method enables us to construct valid post-selection inference for regression parameters in high dimensions. We discuss the empirical and asymptotic properties of the enriched post-selection method.