/categories/mcgill-statistics-seminar/index.xml McGill Statistics Seminar - McGill Statistics Seminars
  • Topics in statistical inference for the semiparametric elliptical copula model

    Date: 2015-09-25

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    This talk addresses aspects of the statistical inference problem for the semiparametric elliptical copula model. The semiparametric elliptical copula model is the family of distributions whose dependence structures are specified by parametric elliptical copulas but whose marginal distributions are left unspecified. An elliptical copula is uniquely characterized by a characteristic generator and a copula correlation matrix Sigma. In the first part of this talk, I will consider the estimation of Sigma. A natural estimate for Sigma is the plug-in estimator Sigmahat with Kendall’s tau statistic. I will first exhibit a sharp bound on the operator norm of Sigmahat - Sigma. I will then consider a factor model of Sigma, for which I will propose a refined estimator Sigmatilde by fitting a low-rank matrix plus a diagonal matrix to Sigmahat using least squares with a nuclear norm penalty on the low-rank matrix. The bound on the operator norm of Sigmahat - Sigma serves to scale the penalty term, and we obtained finite-sample oracle inequalities for Sigmatilde that I will present. In the second part of this talk, we will look at the classification of two distributions that have the same Gaussian copula but that are otherwise arbitrary in high dimensions. Under this semiparametric Gaussian copula setting, I will give an accurate semiparametric estimator of the log-density ratio, which leads to an empirical decision rule and a bound on its associated excess risk. Our estimation procedure takes advantage of the potential sparsity as well as the low noise condition in the problem, which allows us to achieve faster convergence rate of the excess risk than is possible in the existing literature on semiparametric Gaussian copula classification. I will demonstrate the efficiency of our semiparametric empirical decision rule by showing that the bound on the excess risk nearly achieves a convergence rate of 1 over square-root-n in the simple setting of Gaussian distribution classification.

  • A unified algorithm for fitting penalized models with high-dimensional data

    Date: 2015-09-18

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    In the light of high-dimensional problems, research on the penalized model has received much interest. Correspondingly, several algorithms have been developed for solving penalized high-dimensional models. I will describe fast and efficient unified algorithms for computing the solution path for a collection of penalized models. In particular, we will look at an algorithm for solving L1-penalized learning problems and an algorithm for solving group-lasso learning problems. These algorithm take advantage of a majorization-minimization trick to make each update simple and efficient. The algorithms also enjoy a proven convergence property. To demonstrate the generality of these algorithms, I extend them to a class of elastic net penalized large margin classification methods and to elastic net penalized Cox proportional hazards models. These algorithms have been implemented in three R packages gglasso, gcdnet and fastcox, which are publicly available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/web/packages. On simulated and real data, our algorithms consistently outperform the existing software in speed for computing penalized models and often delivers better quality solutions.

  • Bias correction in multivariate extremes

    Date: 2015-09-11

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    The estimation of the extremal dependence structure of a multivariate extreme-value distribution is spoiled by the impact of the bias, which increases with the number of observations used for the estimation. Already known in the univariate setting, the bias correction procedure is studied in this talk under the multivariate framework. New families of estimators of the stable tail dependence function are obtained. They are asymptotically unbiased versions of the empirical estimator introduced by Huang (1992). Given that the new estimators have a regular behavior with respect to the number of observations, it is possible to deduce aggregated versions so that the choice of threshold is substantially simplified. An extensive simulation study is provided as well as an application on real data.

  • Some new classes of bivariate distributions based on conditional specification

    Date: 2015-05-14

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    A bivariate distribution can sometimes be characterized completely by properties of its conditional distributions. In this talk, we will discuss models of bivariate distributions whose conditionals are members of prescribed parametric families of distributions. Some relevant models with specified conditionals will be discussed, including the normal and lognormal cases, the skew-normal and other families of distributions. Finally, some conditionally specified densities will be shown to provide convenient flexible conjugate prior families in certain multiparameter Bayesian settings.

  • Testing for network community structure

    Date: 2015-03-20

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Networks provide a useful means to summarize sparse yet structured massive datasets, and so are an important aspect of the theory of big data. A key question in this setting is to test for the significance of community structure or what in social networks is termed homophily, the tendency of nodes to be connected based on similar characteristics. Network models where a single parameter per node governs the propensity of connection are popular in practice, because they are simple to understand and analyze. They frequently arise as null models to indicate a lack of community structure, since they cannot readily describe the division of a network into groups of nodes whose aggregate links behave in a block-like manner. Here we discuss asymptotic regimes under families of such models, and show their potential for enabling hypothesis tests in this setting. As an important special case, we treat network modularity, which summarizes the difference between observed and expected within-community edges under such null models, and which has seen much success in practical applications of large-scale network analysis. Our focus here is on statistical rather than algorithmic properties, however, in order to yield new insights into the canonical problem of testing for network community structure.

  • Bayesian approaches to causal inference: A lack-of-success story

    Date: 2015-03-13

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Despite almost universal acceptance across most fields of statistics, Bayesian inferential methods have yet to breakthrough to widespread use in causal inference, despite Bayesian arguments being a core component of early developments in the field. Some quasi-Bayesian procedures have been proposed, but often these approaches rely on heuristic, sometimes flawed, arguments. In this talk I will discuss some formulations of classical causal inference problems from the perspective of standard Bayesian representations, and propose some inferential solutions. This is joint work with Olli Saarela, Dalla Lana School of Public Health, University of Toronto, Erica Moodie, Department of Epidemiology, Biostatistics and Occupational Health, McGill University, and Marina Klein, Division of Infectious Diseases, Faculty of Medicine, McGill University.

  • A novel statistical framework to characterize antigen-specific T-cell functional diversity in single-cell expression data

    Date: 2015-02-27

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    I will talk about COMPASS, a new Bayesian hierarchical framework for characterizing functional differences in antigen-specific T cells by leveraging high-throughput, single-cell flow cytometry data. In particular, I will illustrate, using a variety of data sets, how COMPASS can reveal subtle and complex changes in antigen-specific T-cell activation profiles that correlate with biological endpoints. Applying COMPASS to data from the RV144 (“the Thai trial”) HIV clinical trial, it identified novel T-cell subsets that were inverse correlates of HIV infection risk. I also developed intuitive metrics for summarizing multivariate antigen-specific T-cell activation profiles for endpoints analysis. In addition, COMPASS identified correlates of latent infection in an immune study of Tuberculosis among South African adolescents. COMPASS is available as an R package and is sufficiently general that it can be adapted to new high-throughput data types, such as Mass Cytometry (CyTOF) and single-cell gene expressions, enabling interdisciplinary collaboration, which I will also highlight in my talk.

  • Comparison and assessment of particle diffusion models in biological fluids

    Date: 2015-02-20

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Rapidly progressing particle tracking techniques have revealed that foreign particles in biological fluids exhibit rich and at times unexpected behavior, with important consequences for disease diagnosis and drug delivery. Yet, there remains a frustrating lack of coherence in the description of these particles’ motion. Largely this is due to a reliance on functional statistics (e.g., mean-squared displacement) to perform model selection and assess goodness-of-fit. However, not only are such functional characteristics typically estimated with substantial variability, but also they may fail to distinguish between a number of stochastic processes — each making fundamentally different predictions for relevant quantities of scientific interest. In this talk, I will describe a detailed Bayesian analysis of leading candidate models for subdiffusive particle trajectories in human pulmonary mucus. Efficient and scalable computational strategies will be proposed. Model selection will be achieved by way of intrinsic Bayes factors, which avoid both non-informative priors and “using the data twice”. Goodness-of-fit will be evaluated via second-order criteria along with exact model residuals. Our findings suggest that a simple model of fractional Brownian motion describes the data just as well as a first-principles physical model of visco-elastic subdiffusion.

  • Tuning parameters in high-dimensional statistics

    Date: 2015-02-13

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    High-dimensional statistics is the basis for analyzing large and complex data sets that are generated by cutting-edge technologies in genetics, neuroscience, astronomy, and many other fields. However, Lasso, Ridge Regression, Graphical Lasso, and other standard methods in high-dimensional statistics depend on tuning parameters that are difficult to calibrate in practice. In this talk, I present two novel approaches to overcome this difficulty. My first approach is based on a novel testing scheme that is inspired by Lepski’s idea for bandwidth selection in non-parametric statistics. This approach provides tuning parameter calibration for estimation and prediction with the Lasso and other standard methods and is to date the only way to ensure high performance, fast computations, and optimal finite sample guarantees. My second approach is based on the minimization of an objective function that avoids tuning parameters altogether. This approach provides accurate variable selection in regression settings and, additionally, opens up new possibilities for the estimation of gene regulation networks, microbial ecosystems, and many other network structures.

  • A fast unified algorithm for solving group Lasso penalized learning problems

    Date: 2015-02-05

    Time: 15:30-16:30

    Location: BURN 1B39

    Abstract:

    We consider a class of group-lasso learning problems where the objective function is the sum of an empirical loss and the group-lasso penalty. For a class of loss function satisfying a quadratic majorization condition, we derive a unified algorithm called groupwise-majorization-descent (GMD) for efficiently computing the solution paths of the corresponding group-lasso penalized learning problem. GMD allows for general design matrices, without requiring the predictors to be group-wise orthonormal. As illustration examples, we develop concrete algorithms for solving the group-lasso penalized least squares and several group-lasso penalized large margin classifiers. These group-lasso models have been implemented in an R package gglasso publicly available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/web/packages/gglasso. On simulated and real data, gglasso consistently outperforms the existing software for computing the group-lasso that implements either the classical groupwise descent algorithm or Nesterov’s method. An application in risk segmentation of insurance business is illustrated by analysis of an auto insurance claim dataset.