/tags/2018-winter/index.xml 2018 Winter - McGill Statistics Seminars
  • Methodological challenges in using point-prevalence versus cohort data in risk factor analyses of hospital-acquired infections

    Date: 2018-04-27

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    To explore the impact of length-biased sampling on the evaluation of risk factors of nosocomial infections in point-prevalence studies. We used cohort data with full information including the exact date of the nosocomial infection and mimicked an artificial one-day prevalence study by picking a sample from this cohort study. Based on the cohort data, we studied the underlying multi-state model which accounts for nosocomial infection as an intermediate and discharge/death as competing events. Simple formulas are derived to display relationships between risk-, hazard- and prevalence odds ratios. Due to length-biased sampling, long-stay and thus sicker patients are more likely to be sampled. In addition, patients with nosocomial infections usually stay longer in hospital. We explored mechanisms which are -due to the design- hidden in prevalence data. In our example, we showed that prevalence odds ratios were usually less pronounced than risk odds ratios but more pronounced than hazard ratios. Thus, to avoid misinterpretation, knowledge of the mechanisms from the underlying multi-state model are essential for the interpretation of risk factors derived from point-prevalence data.

  • Kernel Nonparametric Overlap-based Syncytial Clustering

    Date: 2018-04-20

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Standard clustering algorithms can find regular-structured clusters such as ellipsoidally- or spherically-dispersed groups, but are more challenged with groups lacking formal structure or definition. Syncytial clustering is the name that we introduce for methods that merge groups obtained from standard clustering algorithms in order to reveal complex group structure in the data. Here, we develop a distribution-free fully-automated syncytial algorithm that can be used with the computationally efficient k-means or other algorithms. Our approach computes the cumulative distribution function of the normed residuals from an appropriately fit k-groups model and calculates the nonparametric overlap between all pairs of groups. Groups with high pairwise overlap are merged as long as the generalized overlap decreases. Our methodology is always a top performer in identifying groups with regular and irregular structures in many datasets. We use our method to identify the distinct kinds of activation in a functional Magnetic Resonance Imaging study.

  • Empirical likelihood and robust regression in diffusion tensor imaging data analysis

    Date: 2018-04-06

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    With modern technology development, functional responses are observed frequently in various scientific fields including neuroimaging data analysis. Empirical likelihood as a nonparametric data-driven technique has become an important statistical inference methodology. In this paper, motivated by diffusion tensor imaging (DTI) data we propose three generalized empirical likelihood-based methods that accommodate within-curve dependence on the varying coefficient model with functional responses and embed a robust regression idea. To avoid the loss of efficiency in statistical inference, we take into consideration within-curve variance-covariance matrix in the subjectwise and elementwise empirical likelihood methods. We develop several statistical inference procedures for maximum empirical likelihood estimators (MELEs) and empirical log likelihood (ELL) ratio functions, and systematically study their asymptotic properties. We first establish the weak convergence of the MELEs and the ELL ratio processes, and derived a nonparametric version of the Wilks theorem for the limiting distributions of the ELLs at any designed point. We propose a global test for linear hypotheses of varying coefficient functions and construct simultaneous confidence bands for each individual effect curve based on MELEs, and construct simultaneous confidence regions for varying coefficient functions based on ELL ratios. A Monte Carlo simulation is conducted to examine the finite-sample performance of the proposed procedures. Finally, we illustrate the estimation and inference procedures on MELEs of varying coefficient model to a diffusion tensor imaging data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) study. Joint work with Xingcai Zhou (Nanjing Audit University), Rohana Karunamuni and Adam Kashlak (University of Alberta).

  • Some development on dynamic computer experiments

    Date: 2018-03-23

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Computer experiments refer to the study of real systems using complex simulation models. They have been widely used as efficient, economical alternatives to physical experiments. Computer experiments with time series outputs are called dynamic computer experiments. In this talk, we consider two problems of such experiments: emulation of large-scale dynamic computer experiments and inverse problem. For the first problem, we proposed a computationally efficient modelling approach which sequentially finds a set of local design points based on a new criterion specifically designed for emulating dynamic computer simulators. Singular value decomposition based Gaussian process models are built with the sequentially chosen local data. To update the models efficiently, an empirical Bayesian approach is introduced. The second problem aims to extract an optimal input of dynamic computer simulator whose response matches a field observation as closely as possible. A sequential design approach is employed and a novel expected improvement criterion is proposed. A real application is discussed to support the efficiency of the proposed approaches.

  • Statistical Genomics for Understanding Complex Traits

    Date: 2018-03-16

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Over the last decade, advances in measurement technologies has enabled researchers to generate multiple types of high-dimensional “omics” datasets for large cohorts. These data provide an opportunity to derive a mechanistic understanding of human complex traits. However, inferring meaningful biological relationships from these data is challenging due to high-dimensionality , noise, and abundance of confounding factors. In this talk, I’ll describe statistical approaches for robust analysis of genomic data from large population studies, with a focus on 1) understanding the nature of confounding and approaches for addressing them and 2) understanding the genomic correlates of aging and dementia.

  • Sparse Penalized Quantile Regression: Method, Theory, and Algorithm

    Date: 2018-02-23

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Sparse penalized quantile regression is a useful tool for variable selection, robust estimation, and heteroscedasticity detection in high-dimensional data analysis. We discuss the variable selection and estimation properties of the lasso and folded concave penalized quantile regression via non-asymptotic arguments. We also consider consistent parameter tuning therein. The computational issue of the sparse penalized quantile regression has not yet been fully resolved in the literature, due to non-smoothness of the quantile regression loss function. We introduce fast alternating direction method of multipliers (ADMM) algorithms for computing the sparse penalized quantile regression. Numerical examples demonstrate the competitive performance of our algorithm: it significantly outperforms several other fast solvers for high-dimensional penalized quantile regression.

  • The Law of Large Populations: The return of the long-ignored N and how it can affect our 2020 vision

    Date: 2018-02-16

    Time: 15:30-16:30

    Location: McGill University, OTTO MAASS 217

    Abstract:

    For over a century now, we statisticians have successfully convinced ourselves and almost everyone else, that in statistical inference the size of the population N can be ignored, especially when it is large. Instead, we focused on the size of the sample, n, the key driving force for both the Law of Large Numbers and the Central Limit Theorem. We were thus taught that the statistical error (standard error) goes down with n typically at the rate of 1/√n. However, all these rely on the presumption that our data have perfect quality, in the sense of being equivalent to a probabilistic sample. A largely overlooked statistical identity, a potential counterpart to the Euler identity in mathematics, reveals a Law of Large Populations (LLP), a law that we should be all afraid of. That is, once we lose control over data quality, the systematic error (bias) in the usual estimators, relative to the benchmarking standard error from simple random sampling, goes up with N at the rate of √N. The coefficient in front of √N can be viewed as a data defect index, which is the simple Pearson correlation between the reporting/recording indicator and the value reported/recorded. Because of the multiplier√N, a seemingly tiny correlation, say, 0.005, can have detrimental effect on the quality of inference. Without understanding of this LLP, “big data” can do more harm than good because of the drastically inflated precision assessment hence a gross overconfidence, setting us up to be caught by surprise when the reality unfolds, as we all experienced during the 2016 US presidential election. Data from Cooperative Congressional Election Study (CCES, conducted by Stephen Ansolabehere, Douglas River and others, and analyzed by Shiro Kuriwaki), are used to estimate the data defect index for the 2016 US election, with the aim to gain a clearer vision for the 2020 election and beyond.

  • Methodological considerations for the analysis of relative treatment effects in multi-drug-resistant tuberculosis from fused observational studies

    Date: 2018-02-09

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Multi-drug-resistant tuberculosis (MDR-TB) is defined as strains of tuberculosis that do not respond to at least the two most used anti-TB drugs. After diagnosis, the intensive treatment phase for MDR-TB involves taking several alternative antibiotics concurrently. The Collaborative Group for Meta-analysis of Individual Patient Data in MDR-TB has assembled a large, fused dataset of over 30 observational studies comparing the effectiveness of 15 antibiotics. The particular challenges that we have considered in the analysis of this dataset are the large number of potential drug regimens, the resistance of MDR-TB strains to specific antibiotics, and the identifiability of a generalized parameter of interest though most drugs were not observed in all studies. In this talk, I describe causal inference theory and methodology that we have appropriated or developed for the estimation of treatment importance and relative effectiveness of different antibiotic regimens with a particular emphasis on targeted learning approaches

  • A new approach to model financial data: The Factorial Hidden Markov Volatility Model

    Date: 2018-02-02

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    A new process, the factorial hidden Markov volatility (FHMV) model, is proposed to model financial returns or realized variances. This process is constructed based on a factorial hidden Markov model structure and corresponds to a parsimoniously parametrized hidden Markov model that includes thousands of volatility states. The transition probability matrix of the underlying Markov chain is structured so that the multiplicity of its second largest eigenvalue can be greater than one. This distinctive feature allows for a better representation of volatility persistence in financial data. Jumps and a leverage effect are also incorporated into the model and statistical properties are discussed. An empirical study on six financial time series shows that the FHMV process compares favorably to state-of-the-art volatility models in terms of in-sample fit and out-of-sample forecasting performance over time horizons ranging from one to one hundred days.

  • Back to the future: why I think REGRESSION is the new black in genetic association studies

    Date: 2018-01-26

    Time: 15:30-16:30

    Location: ROOM 6254 Pavillon Andre-Aisenstadt 2920, UdeM

    Abstract:

    Linear regression remains an important framework in the era of big and complex data. In this talk I present some recent examples where we resort to the classical simple linear regression model and its celebrated extensions in novel settings. The Eureka moment came while reading Wu and Guan’s (2015) comments on our generalized Kruskal-Wallis (GKW) test (Elif Acar and Sun 2013, Biometrics). Wu and Guan presented an alternative “rank linear regression model and derived the proposed GKW statistic as a score test statistic", and astutely pointed out that “the linear model approach makes the derivation more straightforward and transparent, and leads to a simplified and unified approach to the general rank based multi-group comparison problem." More recently, we turned our attention to extending Levene’s variance test for data with group uncertainty and sample correlation. While a direct modification of the original statistic is indeed challenging, I will demonstrate that a two-stage regression framework makes the ensuing development quite straightforward, eventually leading to a generalized joint location-scale test (David Soave and Sun 2017, Biometrics). Finally, I will discuss on-going work, with graduate student Lin Zhang, on developing an allele-based association test that is robust to the assumption of Hardy-Weinberg equilibrium and is generalizable to complex data structure. The crux of this work is, again, reformulating the problem as a regression!