/categories/mcgill-statistics-seminar/index.xml McGill Statistics Seminar - McGill Statistics Seminars
  • Genomics like it's 1960: Inferring human history

    Date: 2017-09-08

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    A central goal of population genetics is the inference of the biological, evolutionary and demographic forces that shaped human diversity. Large-scale sequencing experiments provide fantastic opportunities to learn about human history and biology if we can overcome computational and statistical challenges. I will discuss how simple mid-century statistical approaches, such as the jackknife and Kolmogorov equations, can be combined in unexpected ways to solve partial differential equations, optimize genomic study design, and learn about the spread of modern humans since our common African origins.

  • Distributed kernel regression for large-scale data

    Date: 2017-03-31

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    In modern scientific research, massive datasets with huge numbers of observations are frequently encountered. To facilitate the computational process, a divide-and-conquer scheme is often used for the analysis of big data. In such a strategy, a full dataset is first split into several manageable segments; the final output is then aggregated from the individual outputs of the segments. Despite its popularity in practice, it remains largely unknown that whether such a distributive strategy provides valid theoretical inferences to the original data; if so, how efficient does it work? In this talk, I address these fundamental issues for the non-parametric distributed kernel regression, where accurate prediction is the main learning task. I will begin with the naive simple averaging algorithm and then talk about an improved approach via ADMM. The promising preference of these methods is supported by both simulation and real data examples.

  • Bayesian sample size determination for clinical trials

    Date: 2017-03-24

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Sample size determination problem is an important task in the planning of clinical trials. The problem may be formulated formally in statistical terms. The most frequently used methods are based on the required size, and power of the trial for a specified treatment effect. In contrast to the Bayesian decision-theoretic approach, there is no explicit balancing of the cost of a possible increase in the size of the trial against the benefit of the more accurate information which it would give. In this talk a fully Bayesian approach to the sample size determination problem is discussed. This approach treats the problem as a decision problem and employs a utility function to find the optimal sample size of a trial. Furthermore, we assume that a regulatory authority, which is deciding on whether or not to grant a licence to a new treatment, uses a frequentist approach. The optimal sample size for the trial is then found by maximising the expected net benefit, which is the expected benefit of subsequent use of the new treatment minus the cost of the trial.

  • High-throughput single-cell biology: The challenges and opportunities for machine learning scientists

    Date: 2017-03-10

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    The immune system does a lot more than killing “foreign” invaders. It’s a powerful sensory system that can detect stress levels, infections, wounds, and even cancer tumors. However, due to the complex interplay between different cell types and signaling pathways, the amount of data produced to characterize all different aspects of the immune system (tens of thousands of genes measured and hundreds of millions of cells, just from a single patient) completely overwhelms existing bioinformatics tools. My laboratory specializes in the development of machine learning techniques that address the unique challenges of high-throughput single-cell immunology. Sharing our lab space with a clinical and an immunological research laboratory, my students and fellows are directly exposed to the real-world challenges and opportunities of bringing machine learning and immunology to the (literal) bedside.

  • The first pillar of statistical wisdom

    Date: 2017-02-24

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    This talk will provide an introduction to the first of the pillars in Stephen Stigler’s 2016 book The Seven Pillars of Statistical Wisdom, namely “Aggregation.” It will focus on early instances of the sample mean in scientific work, on the early error distributions, and on how their “centres” were fitted.

    Speaker

    James A. Hanley is a Professor in the Department of Epidemiology, Biostatistics and Occupational Health, at McGill University.

  • Building end-to-end dialogue systems using deep neural architectures

    Date: 2017-02-17

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    The ability for a computer to converse in a natural and coherent manner with a human has long been held as one of the important steps towards solving artificial intelligence. In this talk I will present recent results on building dialogue systems from large corpuses using deep neural architectures. I will highlight several challenges related to data acquisition, algorithmic development, and performance evaluation.

  • Sparse envelope model: Efficient estimation and response variable selection in multivariate linear regression

    Date: 2017-02-10

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    The envelope model is a method for efficient estimation in multivariate linear regression. In this article, we propose the sparse envelope model, which is motivated by applications where some response variables are invariant to changes of the predictors and have zero regression coefficients. The envelope estimator is consistent but not sparse, and in many situations it is important to identify the response variables for which the regression coefficients are zero. The sparse envelope model performs variable selection on the responses and preserves the efficiency gains offered by the envelope model. Response variable selection arises naturally in many applications, but has not been studied as thoroughly as predictor variable selection. In this article, we discuss response variable selection in both the standard multivariate linear regression and the envelope contexts. In response variable selection, even if a response has zero coefficients, it still should be retained to improve the estimation efficiency of the nonzero coefficients. This is different from the practice in predictor variable selection. We establish consistency, the oracle property and obtain the asymptotic distribution of the sparse envelope estimator.

  • MM algorithms for variance component models

    Date: 2017-02-03

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Variance components estimation and mixed model analysis are central themes in statistics with applications in numerous scientific disciplines. Despite the best efforts of generations of statisticians and numerical analysts, maximum likelihood estimation and restricted maximum likelihood estimation of variance component models remain numerically challenging. In this talk, we present a novel iterative algorithm for variance components estimation based on the minorization-maximization (MM) principle. MM algorithm is trivial to implement and competitive on large data problems. The algorithm readily extends to more complicated problems such as linear mixed models, multivariate response models possibly with missing data, maximum a posteriori estimation, and penalized estimation. We demonstrate, both numerically and theoretically, that it converges faster than the classical EM algorithm when the number of variance components is greater than two.

  • Order selection in multidimensional finite mixture models

    Date: 2017-01-20

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Finite mixture models provide a natural framework for analyzing data from heterogeneous populations. In practice, however, the number of hidden subpopulations in the data may be unknown. The problem of estimating the order of a mixture model, namely the number of subpopulations, is thus crucial for many applications. In this talk, we present a new penalized likelihood solution to this problem, which is applicable to models with a multidimensional parameter space. The order of the model is estimated by starting with a large number of mixture components, which are clustered and then merged via two penalty functions. Doing so estimates the unknown parameters of the mixture, at the same time as the order. We will present extensive simulation studies, showing our approach outperforms many of the most common methods for this problem, such as the Bayesian Information Criterion. Real data examples involving normal and multinomial mixtures further illustrate its performance.

  • (Sparse) exchangeable graphs

    Date: 2017-01-13

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Many popular statistical models for network valued datasets fall under the remit of the graphon framework, which (implicitly) assumes the networks are densely connected. However, this assumption rarely holds for the real-world networks of practical interest. We introduce a new class of models for random graphs that generalises the dense graphon models to the sparse graph regime, and we argue that this meets many of the desiderata one would demand of a model to serve as the foundation for a statistical analysis of real-world networks. The key insight is to define the models by way of a novel notion of exchangeability; this is analogous to the specification of conditionally i.i.d. models by way of de Finetti’s representation theorem. We further develop this model class by explaining the foundations of sampling and estimation of network models in this setting. The later result can be can be understood as the (sparse) graph analogue of estimation via the empirical distribution in the i.i.d. sequence setting.