/post/index.xml Past Seminar Series - McGill Statistics Seminars
  • Changbao Wu: Analysis of complex survey data with missing observations

    Date: 2013-02-22

    Time: 14:30-15:30

    Location: CRM, Université de Montréal, Pav. André-Ainsenstadt, salle 1360

    Abstract:

    In this talk, we first provide an overview of issues arising from and methods dealing with complex survey data in the presence of missing observations, with a major focus on the estimating equation approach for analysis and imputation methods for missing data. We then propose a semiparametric fractional imputation method for handling item nonresponses, assuming certain baseline auxiliary variables can be observed for all units in the sample. The proposed strategy combines the strengths of conventional single imputation and multiple imputation methods, and is easy to implement even with a large number of auxiliary variables available, which is typically the case for large scale complex surveys. Simulation results and some general discussion on related issues will also be presented.

  • Eric Cormier: Data Driven Nonparametric Inference for Bivariate Extreme-Value Copulas

    Date: 2013-02-15

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    It is often crucial to know whether the dependence structure of a bivariate distribution belongs to the class of extreme-­‐value copulas. In this talk, I will describe a graphical tool that allows judgment regarding the existence of extreme-­‐value dependence. I will also present a data-­‐ driven nonparametric estimator of the Pickands dependence function. This estimator, which is constructed from constrained b-­‐splines, is intrinsic and differentiable, thereby enabling sampling from the fitted model. I will illustrate its properties via simulation. This will lead me to highlight some of the limitations associated with currently available tests of extremeness.

  • Celia Greenwood: Multiple testing and region-based tests of rare genetic variation

    Date: 2013-02-08

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    In the context of univariate association tests between a trait of interest and common genetic variants (SNPs) across the whole genome, corrections for multiple testing have been well-studied. Due to the patterns of correlation (i.e. linkage disequilibrium), the number of independent tests remains close to 1 million, even when many more common genetic markers are available. With the advent of the DNA sequencing era, however, newly-identified genetic variants tend to be rare or even unique, and consequently single-variant tests of association have little power. As a result, region-based tests of association are being developed that examine associations between the trait and all the genetic variability in a small pre-defined region of the genome. However, coping with multiple testing in this situation has had little attention. I will discuss two aspects of multiple testing for region-based tests. First, I will describe a method for estimating the effective number of independent tests, and second, I will discuss an approach for controlling type I error that is based stratified false discovery rates, where strata are defined by external information such as genomic annotation.

  • Daniela Witten: Structured learning of multiple Gaussian graphical models

    Date: 2013-02-01

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    I will consider the task of estimating high-dimensional Gaussian graphical models (or networks) corresponding to a single set of features under several distinct conditions. In other words, I wish to estimate several distinct but related networks. I assume that most aspects of the networks are shared, but that there are some structured differences between them. The goal is to exploit the similarity among the networks in order to obtain more accurate estimates of each individual network, as well as to identify the differences between the networks.

  • Mylène Bédard: On the empirical efficiency of local MCMC algorithms with pools of proposals

    Date: 2013-01-25

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    In an attempt to improve on the Metropolis algorithm, various MCMC methods with auxiliary variables, such as the multiple-try and delayed rejection Metropolis algorithms, have been proposed. These methods generate several candidates in a single iteration; accordingly they are computationally more intensive than the Metropolis algorithm. It is usually difficult to provide a general estimate for the computational cost of a method without being overly conservative; potentially efficient methods could thus be overlooked by relying on such estimates. In this talk, we describe three algorithms with auxiliary variables - the multiple-try Metropolis (MTM) algorithm, the multiple-try Metropolis hit-and-run (MTM-HR) algorithm, and the delayed rejection Metropolis algorithm with antithetic proposals (DR-A) - and investigate the net performance of these algorithms in various contexts. To allow for a fair comparison, the study is carried under optimal mixing conditions for each of these algorithms. The DR-A algorithm, whose proposal scheme introduces correlation in the pool of candidates, seems particularly promising. The algorithms are used in the contexts of Bayesian logistic regressions and classical inference for a linear regression model. This talk is based on work in collaboration with M. Mireuta, E. Moulines, and R. Douc.

  • Victor Chernozhukov: Inference on treatment effects after selection amongst high-dimensional controls

    Date: 2013-01-18

    Time: 14:30-15:30

    Location: BURN 306

    Abstract:

    We propose robust methods for inference on the effect of a treatment variable on a scalar outcome in the presence of very many controls. Our setting is a partially linear model with possibly non-Gaussian and heteroscedastic disturbances. Our analysis allows the number of controls to be much larger than the sample size. To make informative inference feasible, we require the model to be approximately sparse; that is, we require that the effect of confounding factors can be controlled for up to a small approximation error by conditioning on a relatively small number of controls whose identities are unknown. The latter condition makes it possible to estimate the treatment effect by selecting approximately the right set of controls. We develop a novel estimation and uniformly valid inference method for the treatment effect in this setting, called the “post-double-selection” method. Our results apply to Lasso-type methods used for covariate selection as well as to any other model selection method that is able to find a sparse model with good approximation properties.

  • Ana Best: Risk-set sampling, left truncation, and Bayesian methods in survival analysis

    Date: 2013-01-11

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    Statisticians are often faced with budget concerns when conducting studies. The collection of some covariates, such as genetic data, is very expensive. Other covariates, such as detailed histories, might be difficult or time-consuming to measure. This helped bring about the invention of the nested case-control study, and its more generalized version, risk-set sampled survival analysis. The literature has a good discussion of the properties of risk-set sampling in standard right-censored survival data. My interest is in extending the methods of risk-set sampling to left-truncated survival data, which arise in prevalent longitudinal studies. Since prevalent studies are easier and cheaper to conduct than incident studies, this extension is extremely practical and relevant. I will introduce the partial likelihood in this scenario, and briefly discuss the asymptotic properties of my estimator. I will also introduce Bayesian methods for standard survival analysis, and discuss methods for analyzing risk-set-sampled survival data using Bayesian methods.

  • What percentage of children in the U.S. are eating a healthy diet? A statistical approach

    Date: 2012-12-14

    Time: 14:30-15:30

    Location: Concordia, Room LB 921-04

    Abstract:

    In the United States the preferred method of obtaining dietary intake data is the 24-hour dietary recall, yet the measure of most interest is usual or long-term average daily intake, which is impossible to measure. Thus, usual dietary intake is assessed with considerable measurement error. Also, diet represents numerous foods, nutrients and other components, each of which have distinctive attributes. Sometimes, it is useful to examine intake of these components separately, but increasingly nutritionists are interested in exploring them collectively to capture overall dietary patterns and their effect on various diseases. Consumption of these components varies widely: some are consumed daily by almost everyone on every day, while others are episodically consumed so that 24-hour recall data are zero-inflated. In addition, they are often correlated with each other. Finally, it is often preferable to analyze the amount of a dietary component relative to the amount of energy (calories) in a diet because dietary recommendations often vary with energy level.

  • Sample size and power determination for multiple comparison procedures aiming at rejecting at least r among m false hypotheses

    Date: 2012-12-07

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    Multiple testing problems arise in a variety of situations, notably in clinical trials with multiple endpoints. In such cases, it is often of interest to reject either all hypotheses or at least one of them. More generally, the question arises as to whether one can reject at least r out of m hypotheses. Statistical tools addressing this issue are rare in the literature. In this talk, I will recall well-known hypothesis testing concepts, both in a single- and in a multiple-hypothesis context. I will then present general power formulas for three important multiple comparison procedures: the Bonferroni and Hochberg procedures, as well as Holm’s sequential procedure. Next, I will describe an R package that we developed for sample size calculations in multiple endpoints trials where it is desired to reject at least r out of m hypotheses. This package covers the case where all the variables are continuous and four common variance-covariance patterns. I will show how to use this package to compute the sample size needed in a real-life application.

  • Sharing confidential datasets using differential privacy

    Date: 2012-11-30

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    While statistical agencies would like to share their data with researchers, they must also protect the confidentiality of the data provided by their respondents. To satisfy these two conflicting objectives, agencies use various techniques to restrict and modify the data before publication. Most of these techniques however share a common flaw: their confidentiality protection can not be rigorously measured. In this talk, I will present the criterion of differential privacy, a rigorous measure of the protection offered by such methods. Designed to guarantee confidentiality even in a worst-case scenario, differential privacy protects the information of any individual in the database against an adversary with complete knowledge of the rest of the dataset. I will first give a brief overview of recent and current research on the topic of differential privacy. I will then focus on the publication of differentially-private synthetic contingency tables and present some of my results on the methods for the generation and proper analysis of such datasets.