/tags/2020-fall/index.xml 2020 Fall - McGill Statistics Seminars
  • Quasi-random sampling for multivariate distributions via generative neural networks

    Date: 2020-12-04

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    A novel approach based on generative neural networks is introduced for constructing quasi-random number generators for multivariate models with any underlying copula in order to estimate expectations with variance reduction. So far, quasi-random number generators for multivariate distributions required a careful design, exploiting specific properties (such as conditional distributions) of the implied copula or the underlying quasi-Monte Carlo point set, and were only tractable for a small number of models. Utilizing specific generative neural networks allows one to construct quasi-random number generators for a much larger variety of multivariate distributions without such restrictions. Once trained with a pseudo-random sample, these neural networks only require a multivariate standard uniform randomized quasi-Monte Carlo point set as input and are thus fast in estimating expectations under dependence with variance reduction. Reproducible numerical examples are considered to demonstrate the approach. Emphasis is put on ideas rather than mathematical proofs.

  • Probabilistic Approaches to Machine Learning on Tensor Data

    Date: 2020-11-27

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    In contemporary scientific research, it is often of great interest to predict a categorical response based on a high-dimensional tensor (i.e. multi-dimensional array). Motivated by applications in science and engineering, we propose two probabilistic methods for machine learning on tensor data in the supervised and the unsupervised context, respectively. For supervised problems, we develop a comprehensive discriminant analysis model, called the CATCH model. The CATCH model integrates the information from the tensor and additional covariates to predict the categorical outcome with high accuracy. We further consider unsupervised problems, where no categorical response is available even on the training data. A doubly-enhanced EM (DEEM) algorithm is proposed for model-based tensor clustering, in which both the E-step and the M-step are carefully tailored for tensor data. CATCH and DEEM are developed under explicit statistical models with clear interpretations. They aggressively take advantage of the tensor structure and sparsity to tackle the new computational and statistical challenges arising from the intimidating tensor dimensions. Efficient algorithms are developed to solve the related optimization problems. Under mild conditions, CATCH and DEEM are shown to be consistent even when the dimension of each mode grows at an exponential rate of the sample size. Numerical studies also strongly support the application of CATCH and DEEM.

  • Modeling viral rebound trajectories after analytical antiretroviral treatment interruption

    Date: 2020-11-20

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Despite the success of combined antiretroviral therapy (ART) in achieving sustained control of viral replication, the concerns about side-effects, drug-drug interactions, drug resistance and cost call for a need to identify strategies for achieving HIV eradication or an ART-free remission. Following ART withdrawal, patients’ viral load levels usually increase rapidly to a peak followed by a dip, and then stabilize at a viral load set point. Characterizing features of the viral rebound trajectories (e.g., time to viral rebound and viral set points) and identifying host, virological, and immunological factors that are predictive of these features requires addressing analytical challenges such as non-linear viral rebound trajectories, coarsened data due to the assay’s limit of quantification, and intermittent measurements of viral load values. We first introduce a parametric nonlinear mixed effects (NLME) model for the viral rebound trajectory and compare its performance to a mechanistic modeling approach. We then develop a smoothed simulated pseudo maximum likelihood method for fitting NLME models that permits flexible specification of random effects distributions. Finally, we investigate the association between the time to viral suppression after ART initiation and the time to viral rebound after ART interruption through a Cox proportional hazard regression model where both the outcome and the covariate are interval-censored observations.

  • Approximate Cross-Validation for Large Data and High Dimensions

    Date: 2020-11-13

    Time: 15:30-16:30

    Zoom Link

    Abstract:

    The error or variability of statistical and machine learning algorithms is often assessed by repeatedly re-fitting a model with different weighted versions of the observed data. The ubiquitous tools of cross-validation (CV) and the bootstrap are examples of this technique. These methods are powerful in large part due to their model agnosticism but can be slow to run on modern, large data sets due to the need to repeatedly re-fit the model. We use a linear approximation to the dependence of the fitting procedure on the weights, producing results that can be faster than repeated re-fitting by orders of magnitude. This linear approximation is sometimes known as the “infinitesimal jackknife” (IJ) in the statistics literature, where it has mostly been used as a theoretical tool to prove asymptotic results. We provide explicit finite-sample error bounds for the infinitesimal jackknife in terms of a small number of simple, verifiable assumptions. Without further modification, though, we note that the IJ deteriorates in accuracy in high dimensions and incurs a running time roughly cubic in dimension. We additionally show, then, how dimensionality reduction can be used to successfully run the IJ in high dimensions when data is sparse or low rank. Simulated and real-data experiments support our theory.

  • Generalized Energy-Based Models

    Date: 2020-11-06

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    I will introduce Generalized Energy Based Models (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the “generator”). In particular, while the energy function is analogous to the GAN critic function, it is not discarded after training. GEBMs are trained by alternating between learning the energy and the base, much like a GAN. Both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via MCMC, thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. GEBMs also return state-of-the-art performance on density modelling tasks, and when using base measures with an explicit form.

  • Test-based integrative analysis of randomized trial and real-world data for treatment heterogeneity estimation

    Date: 2020-10-30

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Parallel randomized clinical trial (RCT) and real-world data (RWD) are becoming increasingly available for treatment evaluation. Given the complementary features of the RCT and RWD, we propose a test-based integrative analysis of the RCT and RWD for accurate and robust estimation of the heterogeneity of treatment effect (HTE), which lies at the heart of precision medicine. When the RWD are not subject to bias, e.g., due to unmeasured confounding, our approach combines the RCT and RWD for optimal estimation by exploiting semiparametric efficiency theory. Utilizing the design advantage of RTs, we construct a built-in test procedure to gauge the reliability of the RWD and decide whether or not to use RWD in an integrative analysis. We characterize the asymptotic distribution of the test-based integrative estimator under local alternatives, which provides a better approximation of the finite-sample behaviors of the test and estimator when the idealistic assumption required for the RWD is weakly violated. We provide a data-adaptive procedure to select the threshold of the test statistic that promises the smallest mean square error of the proposed estimator of the HTE. Lastly, we construct an adaptive confidence interval that has a good finite-sample coverage property. We apply the proposed method to characterize who can benefit from adjuvant chemotherapy in patients with stage IB non-small cell lung cancer.

  • Linear Regression and its Inference on Noisy Network-linked Data

    Date: 2020-10-23

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Linear regression on a set of observations linked by a network has been an essential tool in modeling the relationship between response and covariates with additional network data. Despite its wide range of applications in many areas, such as social sciences and health-related research, the problem has not been well-studied in statistics so far. Previous methods either lack of inference tools or rely on restrictive assumptions on social effects, and usually treat the network structure as precisely observed, which is too good to be true in many problems. We propose a linear regression model with nonparametric social effects. Our model does not assume the relational data or network structure to be accurately observed; thus, our method can be provably robust to a certain level of perturbation of the network structure. We establish a full set of computationally efficient asymptotic inference tools under a general requirement of the perturbation and then study the robustness of our method in the specific setting when the perturbation is from random network models. We discover a phase-transition phenomenon of inference validity concerning the network density when no prior knowledge about the network model is available, while also show the significant improvement achieved by knowing the network model. A by-product of our analysis is a rate-optimal concentration bound about subspace projection that may be of independent interest. We conduct extensive simulation studies to verify our theoretical observations and demonstrate the advantage of our method compared to a few benchmarks under different data-generating models. The method is then applied to adolescent network data to study the gender and racial differences in social activities.

  • Adaptive MCMC For Everyone

    Date: 2020-10-16

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Markov chain Monte Carlo (MCMC) algorithms, such as the Metropolis Algorithm and the Gibbs Sampler, are an extremely useful and popular method of approximately sampling from complicated probability distributions. Adaptive MCMC attempts to automatically modify the algorithm while it runs, to improve its performance on the fly. However, such adaptation often destroys the ergodicity properties necessary for the algorithm to be valid. In this talk, we first illustrate MCMC algorithms using simple graphical examples. We then discuss adaptive MCMC, and present examples and theorems concerning its ergodicity and efficiency.

  • Machine Learning and Neural Networks: Foundations and Some Fundamental Questions

    Date: 2020-10-09

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Statistical learning theory is by now a mature branch of data science that hosts a vast variety of practical techniques for tackling data-related problems. In this talk we present some fundamental concepts upon which statistical learning theory has been based. Different approaches to statistical inference will be discussed and the main problem of learning from Vapnik’s point of view will be explained. Further we discuss the topic of function estimation as the heart of Vapnik-Chervonenkis theory. There exist several state-of-the-art methods for estimating functional dependencies, such as maximum margin estimator and artificial neural networks. While for some of these methods, e.g., the support vector machines, there has already been developed a profound theory, others require more investigation. Accordingly, we pay a closer attention to the so-called mapping neural networks and try to shed some light on certain theoretical aspects of them. We highlight some of the fundamental challenges that have attracted the attention of researcher and they are yet to be fully resolved. One of these challenges is estimation of the intrinsic dimension of data that will be discussed in detail. Another challenge is inferring causal direction when the training data set is not representative of the target population.

  • Data Science, Classification, Clustering and Three-Way Data

    Date: 2020-10-02

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    Data science is discussed along with some historical perspective. Selected problems in classification are considered, either via specific datasets or general problem types. In each case, the problem is introduced before one or more potential solutions are discussed and applied. The problems discussed include data with outliers, longitudinal data, and three-way data. The proposed approaches are generally mixture model-based.