/post/index.xml Past Seminar Series - McGill Statistics Seminars
  • Sparse Penalized Quantile Regression: Method, Theory, and Algorithm

    Date: 2018-02-23

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Sparse penalized quantile regression is a useful tool for variable selection, robust estimation, and heteroscedasticity detection in high-dimensional data analysis. We discuss the variable selection and estimation properties of the lasso and folded concave penalized quantile regression via non-asymptotic arguments. We also consider consistent parameter tuning therein. The computational issue of the sparse penalized quantile regression has not yet been fully resolved in the literature, due to non-smoothness of the quantile regression loss function. We introduce fast alternating direction method of multipliers (ADMM) algorithms for computing the sparse penalized quantile regression. Numerical examples demonstrate the competitive performance of our algorithm: it significantly outperforms several other fast solvers for high-dimensional penalized quantile regression.

  • The Law of Large Populations: The return of the long-ignored N and how it can affect our 2020 vision

    Date: 2018-02-16

    Time: 15:30-16:30

    Location: McGill University, OTTO MAASS 217

    Abstract:

    For over a century now, we statisticians have successfully convinced ourselves and almost everyone else, that in statistical inference the size of the population N can be ignored, especially when it is large. Instead, we focused on the size of the sample, n, the key driving force for both the Law of Large Numbers and the Central Limit Theorem. We were thus taught that the statistical error (standard error) goes down with n typically at the rate of 1/√n. However, all these rely on the presumption that our data have perfect quality, in the sense of being equivalent to a probabilistic sample. A largely overlooked statistical identity, a potential counterpart to the Euler identity in mathematics, reveals a Law of Large Populations (LLP), a law that we should be all afraid of. That is, once we lose control over data quality, the systematic error (bias) in the usual estimators, relative to the benchmarking standard error from simple random sampling, goes up with N at the rate of √N. The coefficient in front of √N can be viewed as a data defect index, which is the simple Pearson correlation between the reporting/recording indicator and the value reported/recorded. Because of the multiplier√N, a seemingly tiny correlation, say, 0.005, can have detrimental effect on the quality of inference. Without understanding of this LLP, “big data” can do more harm than good because of the drastically inflated precision assessment hence a gross overconfidence, setting us up to be caught by surprise when the reality unfolds, as we all experienced during the 2016 US presidential election. Data from Cooperative Congressional Election Study (CCES, conducted by Stephen Ansolabehere, Douglas River and others, and analyzed by Shiro Kuriwaki), are used to estimate the data defect index for the 2016 US election, with the aim to gain a clearer vision for the 2020 election and beyond.

  • Methodological considerations for the analysis of relative treatment effects in multi-drug-resistant tuberculosis from fused observational studies

    Date: 2018-02-09

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Multi-drug-resistant tuberculosis (MDR-TB) is defined as strains of tuberculosis that do not respond to at least the two most used anti-TB drugs. After diagnosis, the intensive treatment phase for MDR-TB involves taking several alternative antibiotics concurrently. The Collaborative Group for Meta-analysis of Individual Patient Data in MDR-TB has assembled a large, fused dataset of over 30 observational studies comparing the effectiveness of 15 antibiotics. The particular challenges that we have considered in the analysis of this dataset are the large number of potential drug regimens, the resistance of MDR-TB strains to specific antibiotics, and the identifiability of a generalized parameter of interest though most drugs were not observed in all studies. In this talk, I describe causal inference theory and methodology that we have appropriated or developed for the estimation of treatment importance and relative effectiveness of different antibiotic regimens with a particular emphasis on targeted learning approaches

  • A new approach to model financial data: The Factorial Hidden Markov Volatility Model

    Date: 2018-02-02

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    A new process, the factorial hidden Markov volatility (FHMV) model, is proposed to model financial returns or realized variances. This process is constructed based on a factorial hidden Markov model structure and corresponds to a parsimoniously parametrized hidden Markov model that includes thousands of volatility states. The transition probability matrix of the underlying Markov chain is structured so that the multiplicity of its second largest eigenvalue can be greater than one. This distinctive feature allows for a better representation of volatility persistence in financial data. Jumps and a leverage effect are also incorporated into the model and statistical properties are discussed. An empirical study on six financial time series shows that the FHMV process compares favorably to state-of-the-art volatility models in terms of in-sample fit and out-of-sample forecasting performance over time horizons ranging from one to one hundred days.

  • Back to the future: why I think REGRESSION is the new black in genetic association studies

    Date: 2018-01-26

    Time: 15:30-16:30

    Location: ROOM 6254 Pavillon Andre-Aisenstadt 2920, UdeM

    Abstract:

    Linear regression remains an important framework in the era of big and complex data. In this talk I present some recent examples where we resort to the classical simple linear regression model and its celebrated extensions in novel settings. The Eureka moment came while reading Wu and Guan’s (2015) comments on our generalized Kruskal-Wallis (GKW) test (Elif Acar and Sun 2013, Biometrics). Wu and Guan presented an alternative “rank linear regression model and derived the proposed GKW statistic as a score test statistic", and astutely pointed out that “the linear model approach makes the derivation more straightforward and transparent, and leads to a simplified and unified approach to the general rank based multi-group comparison problem." More recently, we turned our attention to extending Levene’s variance test for data with group uncertainty and sample correlation. While a direct modification of the original statistic is indeed challenging, I will demonstrate that a two-stage regression framework makes the ensuing development quite straightforward, eventually leading to a generalized joint location-scale test (David Soave and Sun 2017, Biometrics). Finally, I will discuss on-going work, with graduate student Lin Zhang, on developing an allele-based association test that is robust to the assumption of Hardy-Weinberg equilibrium and is generalizable to complex data structure. The crux of this work is, again, reformulating the problem as a regression!

  • Generalized Sparse Additive Models

    Date: 2018-01-19

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    I will present a unified approach to the estimation of generalized sparse additive models in high dimensional regression problems. Our approach is based on combining structure-inducing and sparsity penalties in a single regression problem. It allows for the use of a large family of structure-inducing penalties: Those characterized by semi-norm constraints. This includes finite dimensional linear subspaces, sobolev and holder classes, classes with bounded total variation, among others. We give an efficient computational algorithm to fit this family of models that easily scales to thousands of observations and features. In addition we develop a framework for proving convergence bounds on these estimators; and show that our estimators converge at the minimax optimal rate under suitable conditions. We also compare the performance of existing methods in an empirical study and discuss directions for future work.

  • Modelling RNA stability for decoding the regulatory programs that drive human diseases

    Date: 2018-01-12

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    The key determinant of the identity and behaviour of the cell is gene regulation, i.e. which genes are active and which genes are inactive in a particular cell. One of the least understood aspects of gene regulation is RNA stability: genes produce RNA molecules to carry their genetic information – the more stable these RNA molecules are, the longer they can function within the cell, and the less stable they are, the more rapidly they are removed from the pool of active molecules. The cell can effectively switch the genes on and off by regulating RNA stability. However, we do not know which genes are regulated at the RNA stability level, and what factors affect their stability. The focus of our research is development of novel computational methods that enables the measurement of RNA stability and decay rate from functional genomics data, and inference of models that explain how human cells regulate RNA stability. We are particularly interested in how defects in regulation of RNA stability can lead to development and progression of various human diseases, such as cancer.

  • Fisher’s method revisited: set-based genetic association and interaction studies

    Date: 2017-12-01

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Fisher’s method, also known as Fisher’s combined probability test, is commonly used in meta-analyses to combine p-values from the same test applied to K independent samples to evaluate a common null hypothesis. Here we propose to use it to combine p-values from different tests applied to the same sample in two settings: when jointly analyzing multiple genetic variants in set-based genetic association studies, or when jointly capturing main and interaction effects in the presence of missing one of the interacting variables. In the first setting, we show that many existing methods (e.g. the so called burden test and SKAT) can be classified into a class of linear statistics and another class of quadratic statistics, where each class is powerful only in part of the high-dimensional parameter space. In the second setting, we show that the class of scale-tests for heteroscedasticity can be utilized to indirectly identify unspecified interaction effects, complementing the class of location-tests designed for detecting main effects only. In both settings, we show that the two classes of tests are asymptotically independent of each other under the global null hypothesis. Thus, we can evaluate the significance of the resulting Fisher’s test statistic using the chi-squared distribution with four degrees of freedom; this is a desirable feature for analyzing big data. In addition to analytical results, we provide empirical evidence to show that the new class of joint test is not only robust but can also have better power than the individual tests. This is based on join work with formal graduate students Andriy Derkach (Derkach et al. 2013, Genetic Epidemiology; Derkach et al. 2014, Statistical Science) and David Soave (Soave et al. 2015, The American Journal of Human Genetics; Soave and Sun 2017, Biometrics).

  • 150 years (and more) of data analysis in Canada

    Date: 2017-11-24

    Time: 15:30-16:30

    Location: LEA 232

    Abstract:

    As Canada celebrates its 150th anniversary, it may be good to reflect on the past and future of data analysis and statistics in this country. In this talk, I will review the Victorian Statistics Movement and its effect in Canada, data analysis by a Montréal physician in the 1850s, a controversy over data analysis in the 1850s and 60s centred in Montréal, John A. MacDonald’s use of statistics, the Canadian insurance industry and the use of statistics, the beginning of mathematical statistics in Canada, the Fisherian revolution, the influence of Fisher, Neyman and Pearson, the computer revolution, and the emergence of data science.

  • A log-linear time algorithm for constrained changepoint detection

    Date: 2017-11-17

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Changepoint detection is a central problem in time series and genomic data. For some applications, it is natural to impose constraints on the directions of changes. One example is ChIP-seq data, for which adding an up-down constraint improves peak detection accuracy, but makes the optimization problem more complicated. In this talk I will explain how a recently proposed functional pruning algorithm can be generalized to solve such constrained changepoint detection problems. Our proposed log-linear time algorithm achieves state-of-the-art peak detection accuracy in a benchmark of several genomic data sets, and is orders of magnitude faster than our previous quadratic time algorithm. Our implementation is available as the PeakSegPDPA function in the PeakSegOptimal R package, https://cran.r-project.org/package=PeakSegOptimal