/tags/2012-fall/index.xml 2012 Fall - McGill Statistics Seminars
  • What percentage of children in the U.S. are eating a healthy diet? A statistical approach

    Date: 2012-12-14

    Time: 14:30-15:30

    Location: Concordia, Room LB 921-04

    Abstract:

    In the United States the preferred method of obtaining dietary intake data is the 24-hour dietary recall, yet the measure of most interest is usual or long-term average daily intake, which is impossible to measure. Thus, usual dietary intake is assessed with considerable measurement error. Also, diet represents numerous foods, nutrients and other components, each of which have distinctive attributes. Sometimes, it is useful to examine intake of these components separately, but increasingly nutritionists are interested in exploring them collectively to capture overall dietary patterns and their effect on various diseases. Consumption of these components varies widely: some are consumed daily by almost everyone on every day, while others are episodically consumed so that 24-hour recall data are zero-inflated. In addition, they are often correlated with each other. Finally, it is often preferable to analyze the amount of a dietary component relative to the amount of energy (calories) in a diet because dietary recommendations often vary with energy level.

  • Sample size and power determination for multiple comparison procedures aiming at rejecting at least r among m false hypotheses

    Date: 2012-12-07

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    Multiple testing problems arise in a variety of situations, notably in clinical trials with multiple endpoints. In such cases, it is often of interest to reject either all hypotheses or at least one of them. More generally, the question arises as to whether one can reject at least r out of m hypotheses. Statistical tools addressing this issue are rare in the literature. In this talk, I will recall well-known hypothesis testing concepts, both in a single- and in a multiple-hypothesis context. I will then present general power formulas for three important multiple comparison procedures: the Bonferroni and Hochberg procedures, as well as Holm’s sequential procedure. Next, I will describe an R package that we developed for sample size calculations in multiple endpoints trials where it is desired to reject at least r out of m hypotheses. This package covers the case where all the variables are continuous and four common variance-covariance patterns. I will show how to use this package to compute the sample size needed in a real-life application.

  • Sharing confidential datasets using differential privacy

    Date: 2012-11-30

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    While statistical agencies would like to share their data with researchers, they must also protect the confidentiality of the data provided by their respondents. To satisfy these two conflicting objectives, agencies use various techniques to restrict and modify the data before publication. Most of these techniques however share a common flaw: their confidentiality protection can not be rigorously measured. In this talk, I will present the criterion of differential privacy, a rigorous measure of the protection offered by such methods. Designed to guarantee confidentiality even in a worst-case scenario, differential privacy protects the information of any individual in the database against an adversary with complete knowledge of the rest of the dataset. I will first give a brief overview of recent and current research on the topic of differential privacy. I will then focus on the publication of differentially-private synthetic contingency tables and present some of my results on the methods for the generation and proper analysis of such datasets.

  • A nonparametric Bayesian model for local clustering

    Date: 2012-11-23

    Time: 14:30-15:30

    Location: BURN 107

    Abstract:

    We propose a nonparametric Bayesian local clustering (NoB-LoC) approach for heterogeneous data. Using genomics data as an example, the NoB-LoC clusters genes into gene sets and simultaneously creates multiple partitions of samples, one for each gene set. In other words, the sample partitions are nested within the gene sets. Inference is guided by a joint probability model on all random elements. Biologically, the model formalizes the notion that biological samples cluster differently with respect to different genetic processes, and that each process is related to only a small subset of genes. These local features are importantly different from global clustering approaches such as hierarchical clustering, which create one partition of samples that applies for all genes in the data set. Furthermore, the NoB-LoC includes a special cluster of genes that do not give rise to any meaningful partition of samples. These genes could be irrelevant to the disease conditions under investigation. Similarly, for a given gene set, the NoB-LoC includes a subset of samples that do not co-cluster with other samples. The samples in this special cluster could, for example, be those whose disease subtype is not characterized by the particular gene set.

  • Copula-based regression estimation and Inference

    Date: 2012-11-16

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    In this paper we investigate a new approach of estimating a regression function based on copulas. The main idea behind this approach is to write the regression function in terms of a copula and marginal distributions. Once the copula and the marginal distributions are estimated we use the plug-in method to construct the new estimator. Because various methods are available in the literature for estimating both a copula and a distribution, this idea provides a rich and flexible alternative to many existing regression estimators. We provide some asymptotic results related to this copula-based regression modeling when the copula is estimated via profile likelihood and the marginals are estimated nonparametrically. We also study the finite sample performance of the estimator and illustrate its usefulness by analyzing data from air pollution studies.

  • The multidimensional edge: Seeking hidden risks

    Date: 2012-11-09

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    Assessing tail risks using the asymptotic models provided by multivariate extreme value theory has the danger that when asymptotic independence is present (as with the Gaussian copula model), the asymptotic model provides estimates of probabilities of joint tail regions that are zero. In diverse applications such as finance, telecommunications, insurance and environmental science, it may be difficult to believe in the absence of risk contagion. This problem can be partly ameliorated by using hidden regular variation which assumes a lower order asymptotic behavior on a subcone of the state space and this theory can be made more flexible by extensions in the following directions: (i) higher dimensions than two; (ii) where the lower order variation on a subcone is of extreme value type different from regular variation; and (iii) where the concept is extended to searching for lower order behavior on the complement of the support of the limit measure of regular variation. We discuss some challenges and potential applications to this ongoing effort.

  • Multivariate extremal dependence: Estimation with bias correction

    Date: 2012-11-02

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    Estimating extreme risks in a multivariate framework is highly connected with the estimation of the extremal dependence structure. This structure can be described via the stable tail dependence function L, for which several estimators have been introduced. Asymptotic normality is available for empirical estimates of L, with rate of convergence k^1/2, where k denotes the number of high order statistics used in the estimation. Choosing a higher k might be interesting for an improved accuracy of the estimation, but may lead to an increased asymptotic bias. We provide a bias correction procedure for the estimation of L. Combining estimators of L is done in such a way that the asymptotic bias term disappears. The new estimator of L is shown to allow more flexibility in the choice of k. Its asymptotic behavior is examined, and a simulation study is provided to assess its small sample behavior. This is a joint work with Cécile Mercadier (Université Lyon 1) and Laurens de Haan (Erasmus University Rotterdam).

  • Simulation model calibration and prediction using outputs from multi-fidelity simulators

    Date: 2012-10-26

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    Computer simulators are used widely to describe physical processes in lieu of physical observations. In some cases, more than one computer code can be used to explore the same physical system - each with different degrees of fidelity. In this work, we combine field observations and model runs from deterministic multi-fidelity computer simulators to build a predictive model for the real process. The resulting model can be used to perform sensitivity analysis for the system and make predictions with associated measures of uncertainty. Our approach is Bayesian and will be illustrated through a simple example, as well as a real application in predictive science at the Center for Radiative Shock Hydrodynamics at the University of Michigan.

  • Observational studies in healthcare: are they any good?

    Date: 2012-10-19

    Time: 14:30-15:30

    Location: UdeM

    Abstract:

    Observational healthcare data, such as administrative claims and electronic health records, play an increasingly prominent role in healthcare. Pharmacoepidemiologic studies in particular routinely estimate temporal associations between medical product exposure and subsequent health outcomes of interest, and such studies influence prescribing patterns and healthcare policy more generally. Some authors have questioned the reliability and accuracy of such studies, but few previous efforts have attempted to measure their performance.

  • Modeling operational risk using a Bayesian approach to EVT

    Date: 2012-10-12

    Time: 14:30-15:30

    Location: BURN 1205

    Abstract:

    Extreme Value Theory has been widely used for assessing risk for highly unusual events, either by using block maxima or peaks over the threshold (POT) methods. However, one of the main drawbacks of the POT method is the choice of a threshold, which plays an important role in the estimation since the parameter estimates strongly depend on this value. Bayesian inference is an alternative to handle these difficulties; the threshold can be treated as another parameter in the estimation, avoiding the classical empirical approach. In addition, it is possible to incorporate internal and external observations in combination with expert opinion, providing a natural, probabilistic framework in which to evaluate risk models. In this talk, we analyze operational risk data using a mixture model which combines a parametric form for the center and a GPD for the tail of the distribution, using all observations for inference about the unknown parameters from both distributions, the threshold included. A Bayesian analysis is performed and inference is carried out through Markov Chain Monte Carlo (MCMC) methods in order to determine the minimum capital requirement for operational risk.