/tags/2015-fall/index.xml 2015 Fall - McGill Statistics Seminars
  • Causal discovery with confidence using invariance principles

    Date: 2015-12-10

    Time: 15:30-16:30

    Location: UdeM, Pav. Roger-Gaudry, salle S-116

    Abstract:

    What is interesting about causal inference? One of the most compelling aspects is that any prediction under a causal model is valid in environments that are possibly very different to the environment used for inference. For example, variables can be actively changed and predictions will still be valid and useful. This invariance is very useful but still leaves open the difficult question of inference. We propose to turn this invariance principle around and exploit the invariance for inference. If we observe a system in different environments (or under different but possibly not well specified interventions) we can identify all models that are invariant. We know that any causal model has to be in this subset of invariant models. This allows causal inference with valid confidence intervals. We propose different estimators, depending on the nature of the interventions and depending on whether hidden variables and feedbacks are present. Some empirical examples demonstrate the power and possible pitfalls of this approach.

  • Inference regarding within-family association in disease onset times under biased sampling schemes

    Date: 2015-11-26

    Time: 15:30-16:30

    Location: BURN 306

    Abstract:

    In preliminary studies of the genetic basis for chronic conditions, interest routinely lies in the within-family dependence in disease status. When probands are selected from disease registries and their respective families are recruited, a variety of ascertainment bias-corrected methods of inference are available which are typically based on models for correlated binary data. This approach ignores the age that family members are at the time of assessment. We consider copula-based models for assessing the within-family dependence in the disease onset time and disease progression, based on right-censored and current status observation of the non-probands. Inferences based on likelihood, composite likelihood and estimating functions are each discussed and compared in terms of asymptotic and empirical relative efficiency. This is joint work with Yujie Zhong.

  • Prevalent cohort studies: Length-biased sampling with right censoring

    Date: 2015-11-13

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Logistic or other constraints often preclude the possibility of conducting incident cohort studies. A feasible alternative in such cases is to conduct a cross-sectional prevalent cohort study for which we recruit prevalent cases, i.e., subjects who have already experienced the initiating event, say the onset of a disease. When the interest lies in estimating the lifespan between the initiating event and a terminating event, say death for instance, such subjects may be followed prospectively until the terminating event or loss to follow-up, whichever happens first. It is well known that prevalent cases have, on average, longer lifespans. As such, they do not form a representative random sample from the target population; they comprise a biased sample. If the initiating events are generated from a stationary Poisson process, the so-called stationarity assumption, this bias is called length bias. I present the basics of nonparametric inference using length-biased right censored failure time data. I’ll then discuss some recent progress and current challenges. Our study is mainly motivated by challenges and questions raised in analyzing survival data collected on patients with dementia as part of a nationwide study in Canada, called the Canadian Study of Health and Aging (CSHA). I’ll use these data throughout the talk to discuss and motivate our methodology and its applications.

  • Bayesian analysis of non-identifiable models, with an example from epidemiology and biostatistics

    Date: 2015-11-06

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Most regression models in biostatistics assume identifiability, which means that each point in the parameter space corresponds to a unique likelihood function for the observable data. Recently there has been interest in Bayesian inference for non-identifiable models, which can better represent uncertainty in some contexts. One example is in the field of epidemiology, where the investigator is concerned with bias due to unmeasured confounders (omitted variables). In this talk, I will illustrate Bayesian analysis of a non-identifiable model from epidemiology using government administrative data from British Columbia. I will show how to use the software STAN, which is new software developed by Andrew Gelman and others in the USA. STAN allows the careful study of posterior distributions in a vast collection of Bayesian models, including non-identifiable models for bias in epidemiology, which are poorly suited to conventional Gibbs sampling.

  • A knockoff filter for controlling the false discovery rate

    Date: 2015-10-30

    Time: 16:00-17:00

    Location: Salle 1360, Pavillon André-Aisenstadt, Université de Montréa

    Abstract:

    The big data era has created a new scientific paradigm: collect data first, ask questions later. Imagine that we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR) - the expected fraction of false discoveries among all discoveries - is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. We introduce the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method works by constructing fake variables, knockoffs, which can then be used as controls for the true variables; the method achieves exact FDR control in finite-sample settings no matter the design or covariates, the number of variables in the model, and the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. This is joint work with Rina Foygel Barber.

  • Robust mixture regression and outlier detection via penalized likelihood

    Date: 2015-10-23

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Finite mixture regression models have been widely used for modeling mixed regression relationships arising from a clustered and thus heterogenous population. The classical normal mixture model, despite of its simplicity and wide applicability, may fail dramatically in the presence of severe outliers. We propose a robust mixture regression approach based on a sparse, case-specific, and scale-dependent mean-shift parameterization, for simultaneously conducting outlier detection and robust parameter estimation. A penalized likelihood approach is adopted to induce sparsity among the mean-shift parameters so that the outliers are distinguished from the good observations, and a thresholding-embedded Expectation-Maximization (EM) algorithm is developed to enable stable and efficient computation. The proposed penalized estimation approach is shown to have strong connections with other robust methods including the trimmed likelihood and the M-estimation methods. Comparing with several existing methods, the proposed methods show outstanding performance in numerical studies.

  • Estimating high-dimensional multi-layered networks through penalized maximum likelihood

    Date: 2015-10-16

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Gaussian graphical models represent a good tool for capturing interactions between nodes represent the underlying random variables. However, in many applications in biology one is interested in modeling associations both between, as well as within molecular compartments (e.g., interactions between genes and proteins/metabolites). To this end, inferring multi-layered network structures from high-dimensional data provides insight into understanding the conditional relationships among nodes within layers, after adjusting for and quantifying the effects of nodes from other layers. We propose an integrated algorithmic approach for estimating multi-layered networks, that incorporates a screening step for significant variables, an optimization algorithm for estimating the key model parameters and a stability selection step for selecting the most stable effects. The proposed methodology offers an efficient way of estimating the edges within and across layers iteratively, by solving an optimization problem constructed based on penalized maximum likelihood (under a Gaussianity assumption). The optimization is solved on a reduced parameter space that is identified through screening, which remedies the instability in high-dimension. Theoretical properties are considered to ensure identifiability and consistent estimation of the parameters and convergence of the optimization algorithm, despite the lack of global convexity. The performance of the methodology is illustrated on synthetic data sets and on an application on gene and metabolic expression data for patients with renal disease.

  • Parameter estimation of partial differential equations over irregular domains

    Date: 2015-10-09

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    Spatio-temporal data are abundant in many scientific fields; examples include daily satellite images of the earth, hourly temperature readings from multiple weather stations, and the spread of an infectious disease over a particular region. In many instances the spatio-temporal data are accompanied by mathematical models expressed in terms of partial differential equations (PDEs). These PDEs determine the theoretical aspects of the behavior of the physical, chemical or biological phenomena considered. Azzimonti (2013) showed that including the associated PDE as a regularization term as opposed to the conventional two-dimensional Laplacian provides a considerable improvement in the estimation accuracy. The PDEs parameters often have interesting interpretations. Although they are typically unknown and must be inferred from expert knowledge of the phenomena considered. In this talk I will discuss extending the profiling with a parameter cascading procedure outlined in Ramsay et al. (2007) to incorporate PDE parameter estimation. I will also show how, following Sangalli et al. (2013), the estimation procedure can be extended to include finite-element methods (FEMs). This allows the proposed method to account for attributes of the geometry of the physical problem such as irregular shaped domains, external and internal boundary features, as well as strong concavities. Thus this talk will introduce a methodology for data-driven estimates of the parameters of PDEs defined over irregular domains.

  • Estimating covariance matrices of intermediate size

    Date: 2015-10-02

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    In finance, the covariance matrix of many assets is a key component of financial portfolio optimization and is usually estimated from historical data. Much research in the past decade has focused on improving estimation by studying the asymptotics of large covariance matrices in the so-called high-dimensional regime, where the dimension p grows at the same pace as the sample size n, and this approach has been very successful. This choice of growth makes sense in part because, based on results for eigenvalues, it appears that there are only two limits: the high-dimensional one when p grows like n, and the classical one, when p grows more slowly than n. In this talk, I will present evidence that this binary view is false, and that there could be hidden intermediate regimes lying in between. In turn, this allows for corrections to the sample covariance matrix that are more appropriate when the dimension is large but moderate with respect to the sample size, as is often the case; this can also lead to better optimization for portfolio volatility in many situations of interest.

  • Topics in statistical inference for the semiparametric elliptical copula model

    Date: 2015-09-25

    Time: 15:30-16:30

    Location: BURN 1205

    Abstract:

    This talk addresses aspects of the statistical inference problem for the semiparametric elliptical copula model. The semiparametric elliptical copula model is the family of distributions whose dependence structures are specified by parametric elliptical copulas but whose marginal distributions are left unspecified. An elliptical copula is uniquely characterized by a characteristic generator and a copula correlation matrix Sigma. In the first part of this talk, I will consider the estimation of Sigma. A natural estimate for Sigma is the plug-in estimator Sigmahat with Kendall’s tau statistic. I will first exhibit a sharp bound on the operator norm of Sigmahat - Sigma. I will then consider a factor model of Sigma, for which I will propose a refined estimator Sigmatilde by fitting a low-rank matrix plus a diagonal matrix to Sigmahat using least squares with a nuclear norm penalty on the low-rank matrix. The bound on the operator norm of Sigmahat - Sigma serves to scale the penalty term, and we obtained finite-sample oracle inequalities for Sigmatilde that I will present. In the second part of this talk, we will look at the classification of two distributions that have the same Gaussian copula but that are otherwise arbitrary in high dimensions. Under this semiparametric Gaussian copula setting, I will give an accurate semiparametric estimator of the log-density ratio, which leads to an empirical decision rule and a bound on its associated excess risk. Our estimation procedure takes advantage of the potential sparsity as well as the low noise condition in the problem, which allows us to achieve faster convergence rate of the excess risk than is possible in the existing literature on semiparametric Gaussian copula classification. I will demonstrate the efficiency of our semiparametric empirical decision rule by showing that the bound on the excess risk nearly achieves a convergence rate of 1 over square-root-n in the simple setting of Gaussian distribution classification.