/post/index.xml Past Seminar Series - McGill Statistics Seminars
  • Outlier detection for functional data using principal components

    Date: 2016-02-11

    Time: 16:00-17:00

    Location: CRM 6254 (U. de Montréal)

    Abstract:

    Principal components analysis is a widely used technique that provides an optimal lower-dimensional approximation to multivariate observations. In the functional case, a new characterization of elliptical distributions on separable Hilbert spaces allows us to obtain an equivalent stochastic optimality property for the principal component subspaces of random elements on separable Hilbert spaces. This property holds even when second moments do not exist. These lower-dimensional approximations can be very useful in identifying potential outliers among high-dimensional or functional observations. In this talk we propose a new class of robust estimators for principal components, which is consistent for elliptical random vectors, and Fisher-consistent for elliptically distributed random elements on arbitrary Hilbert spaces. We illustrate our method on two real functional data sets, where the robust estimator is able to discover atypical observations in the data that would have been missed otherwise. This talk is the result of recent collaborations with Graciela Boente (Buenos Aires, Argentina) and David Tyler (Rutgers, USA).

  • The Bayesian causal effect estimation algorithm

    Date: 2016-02-05

    Time: 15:30-16:30

    Location: BURN 1214

    Abstract:

    Estimating causal exposure effects in observational studies ideally requires the analyst to have a vast knowledge of the domain of application. Investigators often bypass difficulties related to the identification and selection of confounders through the use of fully adjusted outcome regression models. However, since such models likely contain more covariates than required, the variance of the regression coefficient for exposure may be unnecessarily large. Instead of using a fully adjusted model, model selection can be attempted. Most classical statistical model selection approaches, such as Bayesian model averaging, do not readily address causal effect estimation. We present a new model averaged approach to causal inference, Bayesian causal effect estimation (BCEE), which is motivated by the graphical framework for causal inference. BCEE aims to unbiasedly estimate the causal effect of a continuous exposure on a continuous outcome while being more efficient than a fully adjusted approach.

  • Estimating high-dimensional networks with hubs with an application to microbiome data

    Date: 2016-01-29

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    In this talk, we investigate the problem of estimating high-dimensional networks in which there are a few highly connected “hub" nodes. Methods based on L1-regularization have been widely used for performing sparse selection in the graphical modelling context. However, the L1 penalty penalizes each edge equally and independently of each other without taking into account any structural information. We introduce a new method for estimating undirected graphical models with hubs, called the hubs weighted graphical lasso (HWGL). This is a two-step procedure with a hub screening step, followed by network reconstruction in the second step using a weighted lasso approach that incorporates the inferred network topology. Empirically, we show that the HWGL outperforms competing methods and illustrate the methodology with an application to microbiome data.

  • Robust estimation in the presence of influential units in surveys

    Date: 2016-01-22

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Influential units are those which make classical estimators (e.g., the Horvitz-Thompson estimator or calibration estimators) very unstable. The problem of influential units is particularly important in business surveys, which collect economic variables, whose distribution are highly skewed (heavy right tail). In this talk, we will attempt to answer the following questions:

    (1) What is an influential value in surveys? (2) How measure the influence of unit? (3) How reduce the impact of influential units at the estimation stage?

  • Causal discovery with confidence using invariance principles

    Date: 2015-12-10

    Time: 15:30-16:30

    Location: UdeM, Pav. Roger-Gaudry, salle S-116

    Abstract:

    What is interesting about causal inference? One of the most compelling aspects is that any prediction under a causal model is valid in environments that are possibly very different to the environment used for inference. For example, variables can be actively changed and predictions will still be valid and useful. This invariance is very useful but still leaves open the difficult question of inference. We propose to turn this invariance principle around and exploit the invariance for inference. If we observe a system in different environments (or under different but possibly not well specified interventions) we can identify all models that are invariant. We know that any causal model has to be in this subset of invariant models. This allows causal inference with valid confidence intervals. We propose different estimators, depending on the nature of the interventions and depending on whether hidden variables and feedbacks are present. Some empirical examples demonstrate the power and possible pitfalls of this approach.

  • Inference regarding within-family association in disease onset times under biased sampling schemes

    Date: 2015-11-26

    Time: 15:30-16:30

    Location: BURN 306

    Abstract:

    In preliminary studies of the genetic basis for chronic conditions, interest routinely lies in the within-family dependence in disease status. When probands are selected from disease registries and their respective families are recruited, a variety of ascertainment bias-corrected methods of inference are available which are typically based on models for correlated binary data. This approach ignores the age that family members are at the time of assessment. We consider copula-based models for assessing the within-family dependence in the disease onset time and disease progression, based on right-censored and current status observation of the non-probands. Inferences based on likelihood, composite likelihood and estimating functions are each discussed and compared in terms of asymptotic and empirical relative efficiency. This is joint work with Yujie Zhong.

  • Prevalent cohort studies: Length-biased sampling with right censoring

    Date: 2015-11-13

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Logistic or other constraints often preclude the possibility of conducting incident cohort studies. A feasible alternative in such cases is to conduct a cross-sectional prevalent cohort study for which we recruit prevalent cases, i.e., subjects who have already experienced the initiating event, say the onset of a disease. When the interest lies in estimating the lifespan between the initiating event and a terminating event, say death for instance, such subjects may be followed prospectively until the terminating event or loss to follow-up, whichever happens first. It is well known that prevalent cases have, on average, longer lifespans. As such, they do not form a representative random sample from the target population; they comprise a biased sample. If the initiating events are generated from a stationary Poisson process, the so-called stationarity assumption, this bias is called length bias. I present the basics of nonparametric inference using length-biased right censored failure time data. I’ll then discuss some recent progress and current challenges. Our study is mainly motivated by challenges and questions raised in analyzing survival data collected on patients with dementia as part of a nationwide study in Canada, called the Canadian Study of Health and Aging (CSHA). I’ll use these data throughout the talk to discuss and motivate our methodology and its applications.

  • Bayesian analysis of non-identifiable models, with an example from epidemiology and biostatistics

    Date: 2015-11-06

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Most regression models in biostatistics assume identifiability, which means that each point in the parameter space corresponds to a unique likelihood function for the observable data. Recently there has been interest in Bayesian inference for non-identifiable models, which can better represent uncertainty in some contexts. One example is in the field of epidemiology, where the investigator is concerned with bias due to unmeasured confounders (omitted variables). In this talk, I will illustrate Bayesian analysis of a non-identifiable model from epidemiology using government administrative data from British Columbia. I will show how to use the software STAN, which is new software developed by Andrew Gelman and others in the USA. STAN allows the careful study of posterior distributions in a vast collection of Bayesian models, including non-identifiable models for bias in epidemiology, which are poorly suited to conventional Gibbs sampling.

  • A knockoff filter for controlling the false discovery rate

    Date: 2015-10-30

    Time: 16:00-17:00

    Location: Salle 1360, Pavillon André-Aisenstadt, Université de Montréa

    Abstract:

    The big data era has created a new scientific paradigm: collect data first, ask questions later. Imagine that we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR) - the expected fraction of false discoveries among all discoveries - is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. We introduce the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method works by constructing fake variables, knockoffs, which can then be used as controls for the true variables; the method achieves exact FDR control in finite-sample settings no matter the design or covariates, the number of variables in the model, and the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. This is joint work with Rina Foygel Barber.

  • Robust mixture regression and outlier detection via penalized likelihood

    Date: 2015-10-23

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Finite mixture regression models have been widely used for modeling mixed regression relationships arising from a clustered and thus heterogenous population. The classical normal mixture model, despite of its simplicity and wide applicability, may fail dramatically in the presence of severe outliers. We propose a robust mixture regression approach based on a sparse, case-specific, and scale-dependent mean-shift parameterization, for simultaneously conducting outlier detection and robust parameter estimation. A penalized likelihood approach is adopted to induce sparsity among the mean-shift parameters so that the outliers are distinguished from the good observations, and a thresholding-embedded Expectation-Maximization (EM) algorithm is developed to enable stable and efficient computation. The proposed penalized estimation approach is shown to have strong connections with other robust methods including the trimmed likelihood and the M-estimation methods. Comparing with several existing methods, the proposed methods show outstanding performance in numerical studies.