/categories/mcgill-statistics-seminar/index.xml McGill Statistics Seminar - McGill Statistics Seminars
  • Model-based methods of classification with applications

    Date: 2014-11-28

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Model-based clustering via finite mixture models is a popular clustering method for finding hidden structures in data. The model is often assumed to be a finite mixture of multivariate normal distributions; however, flexible extensions have been developed over recent years. This talk demonstrates some methods employed in unsupervised, semi-supervised, and supervised classification that include skew-normal and skew-t mixture models. Both real and simulated data sets are used to demonstrate the efficacy of these techniques.

  • Estimating by solving nonconvex programs: Statistical and computational guarantees

    Date: 2014-11-21

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Many statistical estimators are based on solving nonconvex programs. Although the practical performance of such methods is often excellent, the associated theory is frequently incomplete, due to the potential gaps between global and local optima. In this talk, we present theoretical results that apply to all local optima of various regularized M-estimators, where both loss and penalty functions are allowed to be nonconvex. Our theory covers a broad class of nonconvex objective functions, including corrected versions of the Lasso for error-in-variables linear models; regression in generalized linear models using nonconvex regularizers such as SCAD and MCP; and graph and inverse covariance matrix estimation. Under suitable regularity conditions, our theory guarantees that any local optimum of the composite objective function lies within statistical precision of the true parameter vector. This result closes the gap between theory and practice for these methods.

  • Bridging the gap: A likelihood function approach for the analysis of ranking data

    Date: 2014-11-14

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    In the parametric setting, the notion of a likelihood function forms the basis for the development of tests of hypotheses and estimation of parameters. Tests in connection with the analysis of variance stem entirely from considerations of the likelihood function. On the other hand, non- parametric procedures have generally been derived without any formal mechanism and are often the result of clever intuition. In this talk, we propose a more formal approach for deriving tests involving the use of ranks. Specifically, we define a likelihood function motivated by characteristics of the ranks of the data and demonstrate that this leads to well-known tests of hypotheses. We also point to various areas of further exploration.

  • Bayesian regression with B-splines under combinations of shape constraints and smoothness properties

    Date: 2014-11-07

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    We approach the problem of shape constrained regression from a Bayesian perspective. A B-spline basis is used to model the regression function. The smoothness of the regression function is controlled by the order of the B-splines and the shape is controlled by the shape of an associated control polygon. Controlling the shape of the control polygon reduces to some inequality constraints on the spline coefficients. Our approach enables us to take into account combinations of shape constraints and to localize each shape constraint on a given interval. The performances of our method is investigated through a simulation study. Applications to real data sets from the food industry and Global Warming are provided.

  • A copula-based model for risk aggregation

    Date: 2014-10-31

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    A flexible approach is proposed for risk aggregation. The model consists of a tree structure, bivariate copulas, and marginal distributions. The construction relies on a conditional independence assumption whose implications are studied. Selection the tree structure, estimation and model validation are illustrated using data from a Canadian property and casualty insurance company.

    Speaker

    Marie-Pier Côté is a PhD student in the Department of Mathematics and Statistics at McGill University.

  • PREMIER: Probabilistic error-correction using Markov inference in error reads

    Date: 2014-10-24

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Next generation sequencing (NGS) is a technology revolutionizing genetics and biology. Compared with the old Sanger sequencing method, the throughput is astounding and has fostered a slew of innovative sequencing applications. Unfortunately, the error rates are also higher, complicating many downstream analyses. For example, de novo assembly of genomes is less accurate and slower when reads include many errors. We develop a probabilistic model for NGS reads that can detect and correct errors without a reference genome and while flexibly modeling and estimating the error properties of the sequencing machine. It uses a penalized likelihood to enforce our prior belief that the kmer spectrum (collection of k-length strings observed in the reads) generated from a genome is sparse when k is sufficiently large. The model formalizes core ideas that are used in many ad hoc algorithmic approaches to error correction. We show our method can detect and remove more errors from sequencing reads than existing methods. Though our method carries a higher computational burden than the best algorithmic approaches, the probabilistic approach is extensible, flexible, and well-positioned to support downstream statistical analysis of the increasing volume of sequence data.

  • Patient privacy, big data, and specimen pooling: Using an old tool for new challenges

    Date: 2014-10-17

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    In the recent past, electronic health records and distributed data networks emerged as a viable resource for medical and scientific research. As the use of confidential patient information from such sources become more common, maintaining privacy of patients is of utmost importance. For a binary disease outcome of interest, we show that the techniques of specimen pooling could be applied for analysis of large and/or distributed data while respecting patient privacy. I will review the pooled analysis for a binary outcome and then show how it can be used for distributed data. Aggregate level data are passed from the nodes of the network to the analysis center and can be used very easily with logistic regression for estimation of disease odds ratio associated with a set of categorical or continuous covariates. Pooling approach allows for consistent estimation of the parameters of logistic regression that can include confounders. Additionally, since the individual covariate values can be accessed within a network, effect modifiers can be accommodated and consistently estimated. Since pooling effectively reduces the size of the dataset by creating pools or sets of individual, the resulting dataset can be analyzed much more quickly as compared to an original dataset that is too big as compared to computing environment.

  • A margin-free clustering algorithm appropriate for dependent maxima in the domain of attraction of an extreme-value copula

    Date: 2014-10-10

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Extracting relevant information in complex spatial-temporal data sets is of paramount importance in statistical climatology. This is especially true when identifying spatial dependencies between quantitative extremes like heavy rainfall. The paper of Bernard et al. (2013) develops a fast and simple clustering algorithm for finding spatial patterns appropriate for extremes. They develop their algorithm by adapting multivariate extreme-value theory to the context of spatial clustering. This is done by relating the variogram, a well-known distance used in geostatistics, to the extremal coefficient of a pair of joint maxima. This gives rise to a straightforward nonparametric estimator of this distance using the empirical distribution function. Their clustering approach is used to analyze weekly maxima of hourly precipitation recorded in France and a spatial pattern consistent with existing weather models arises. This applied talk is devoted to the validation and extension of this clustering approach. A simulation study using the multivariate logistic distribution as well as max-stable random fields shows that this approach provides accurate clustering when the maxima belong to an extreme-value distribution. Furthermore this clustering distance can be viewed as an average absolute rank difference, implying that it is appropriate for margin-free clustering of dependent variables. In particular it is appropriate for dependent maxima in the domain of attraction of an extreme-value copula.

  • Statistical exploratory data analysis in the modern era

    Date: 2014-10-03

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Major challenges arising from today’s “data deluge” include how to handle the commonly occurring situation of different types of variables (say, continuous and categorical) being simultaneously measured, as well as how to assess the accompanying flood of questions. Based on information theory, a bias-corrected mutual information (BCMI) measure of association that is valid and estimable between all basic types of variables has been proposed. It has the advantage of being able to identify non-linear as well as linear relationships. Based on the BCMI measure, a novel exploratory approach to finding associations in data sets having a large number of variables of different types has been developed. These associations can be used as a basis for downstream analyses such as finding clusters and networks. The application of this exploratory approach is very general. Comparisons also will be made with other measures. Illustrative examples include exploring relationships (i) in clinical and genomic (say, gene expression and genotypic) data, and (ii) between social, economic, health and political indicators from the World Health Organisation.

  • Analysis of palliative care studies with joint models for quality-of-life measures and survival

    Date: 2014-09-26

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    In palliative care studies, the primary outcomes are often health related quality of life measures (HRLQ). Randomized trials and prospective cohorts typically recruit patients with advanced stage of disease and follow them until death or end of the study. An important feature of such studies is that, by design, some patients, but not all, are likely to die during the course of the study. This affects the interpretation of the conventional analysis of palliative care trials and suggests the need for specialized methods of analysis. We have developed a “terminal decline model” for palliative care trials that, by jointly modeling the time until death and the HRQL measures, leads to flexible interpretation and efficient analysis of the trial data (Li, Tosteson, Bakitas, STMED 2012).