/categories/crm-colloquium/index.xml CRM-Colloquium - McGill Statistics Seminars
  • Structure learning for extremal graphical models

    Date: 2022-02-18

    Time: 15:30-16:30 (Montreal time)

    https://umontreal.zoom.us/j/85105423917?pwd=enM3MGpFNkZKU2daMjRITmo0N0JUUT09

    Meeting ID: 851 0542 3917

    Passcode: 403790

    Abstract:

    Extremal graphical models are sparse statistical models for multivariate extreme events. The underlying graph encodes conditional independencies and enables a visual interpretation of the complex extremal dependence structure. For the important case of tree models, we provide a data-driven methodology for learning the graphical structure. We show that sample versions of the extremal correlation and a new summary statistic, which we call the extremal variogram, can be used as weights for a minimum spanning tree to consistently recover the true underlying tree. Remarkably, this implies that extremal tree models can be learned in a completely non-parametric fashion by using simple summary statistics and without the need to assume discrete distributions, existence of densities, or parametric models for marginal or bivariate distributions. Extensions to more general graphs are also discussed.

  • Risk assessment, heavy tails, and asymmetric least squares techniques

    Date: 2022-01-28

    Time: 15:30-16:30 (Montreal time)

    https://umontreal.zoom.us/j/93983313215?pwd=clB6cUNsSjAvRmFMME1PblhkTUtsQT09

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    Statistical risk assessment, in particular in finance and insurance, requires estimating simple indicators to summarize the risk incurred in a given situation. Of most interest is to infer extreme levels of risk so as to be able to manage high-impact rare events such as extreme climate episodes or stock market crashes. A standard procedure in this context, whether in the academic, industrial or regulatory circles, is to estimate a well-chosen single quantile (or Value-at-Risk). One drawback of quantiles is that they only take into account the frequency of an extreme event, and in particular do not give an idea of what the typical magnitude of such an event would be. Another issue is that they do not induce a coherent risk measure, which is a serious concern in actuarial and financial applications. In this talk, after giving a leisurely tour of extreme quantile estimation, I will explain how, starting from the formulation of a quantile as the solution of an optimization problem, one may come up with two alternative families of risk measures, called expectiles and extremiles, in order to address these two drawbacks. I will give a broad overview of their properties, as well as of their estimation at extreme levels in heavy-tailed models, and explain why they constitute sensible alternatives for risk assessment using real data applications. This is based on joint work with Abdelaati Daouia, Irène Gijbels, Stéphane Girard, Simone Padoan and Antoine Usseglio-Carleve.

  • Adventures with Partial Identifications in Studies of Marked Individuals

    Date: 2021-11-26

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    Monitoring marked individuals is a common strategy in studies of wild animals (referred to as mark-recapture or capture-recapture experiments) and hard to track human populations (referred to as multi-list methods or multiple-systems estimation). A standard assumption of these techniques is that individuals can be identified uniquely and without error, but this can be violated in many ways. In some cases, it may not be possible to identify individuals uniquely because of the study design or the choice of marks. Other times, errors may occur so that individuals are incorrectly identified. I will discuss work with my collaborators over the past 10 ye ars developing methods to account for problems that arise when are only individuals are only partially identified. I will present theoretical aspects of this research, including an introduction to the latent multinomial model and algebraic statistics, and also describe applications to studies of species ranging from the golden mantella (an endangered frog endemic to Madagascar measuring only 20 mm) to the whale shark (the largest known species of sh measuring up to 19 m).

  • Opinionated practices for teaching reproducibility: motivation, guided instruction and practice

    Date: 2021-10-29

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    In the data science courses at the University of British Columbia, we define data science as the study, development and practice of reproducible and auditable processes to obtain insight from data. While reproducibility is core to our definition, most data science learners enter the field with other aspects of data science in mind, for example predictive modelling, which is often one of the most interesting topic to novices. This fact, along with the highly technical nature of the industry standard reproducibility tools currently employed in data science, present out-ofthe gate challenges in teaching reproducibility in the data science classroom. Put simply, students are not as intrinsically motivated to learn this topic, and it is not an easy one for them to learn. What can a data science educator do? Over several iterations of teaching courses focused on reproducible data science tools and workflows, we have found that providing extra motivation, guided instruction and lots of practice are key to effectively teaching this challenging, yet important subject. Here we present examples of how we deeply motivate, effectively guide and provide ample practice opportunities to data science students to effectively engage them in learning about this topic.

  • Deep down, everyone wants to be causal

    Date: 2021-09-24

    Time: 15:00-16:00 (Montreal time)

    https://mcgill.zoom.us/j/9791073141

    Meeting ID: 979 107 3141

    Abstract:

    In the data science courses at the University of British Columbia, we define data science as the study, development and practice of reproducible and auditable processes to obtain insight from data. While reproducibility is core to our definition, most data science learners enter the field with other aspects of data science in mind, for example predictive modelling, which is often one of the most interesting topic to novices. This fact, along with the highly technical nature of the industry standard reproducibility tools currently employed in data science, present out-ofthe gate challenges in teaching reproducibility in the data science classroom. Put simply, students are not as intrinsically motivated to learn this topic, and it is not an easy one for them to learn. What can a data science educator do? Over several iterations of teaching courses focused on reproducible data science tools and workflows, we have found that providing extra motivation, guided instruction and lots of practice are key to effectively teaching this challenging, yet important subject. Here we present examples of how we deeply motivate, effectively guide and provide ample practice opportunities to data science students to effectively engage them in learning about this topic.

  • Nonparametric Tests for Informative Selection in Complex Surveys

    Date: 2021-03-12

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    Informative selection, in which the distribution of response variables given that they are sampled is different from their distribution in the population, is pervasive in complex surveys. Failing to take such informativeness into account can produce severe inferential errors, including biased and inconsistent estimation of population parameters. While several parametric procedures exist to test for informative selection, these methods are limited in scope and their parametric assumptions are difficult to assess. We consider two classes of nonparametric tests of informative selection. The first class is motivated by classic nonparametric two-sample tests. We compare weighted and unweighted empirical distribution functions and obtain tests for informative selection that are analogous to Kolmogorov-Smirnov and Cramer-von Mises. For the second class of tests, we adapt a kernel-based learning method that compares distributions based on their maximum mean discrepancy. The asymptotic distributions of the test statistics are established under the null hypothesis of noninformative selection. Simulation results show that our tests have power competitive with existing parametric tests in a correctly specified parametric setting, and better than those tests under model misspecification. A recreational angling application illustrates the methodology.

  • Spatio-temporal methods for estimating subsurface ocean thermal response to tropical cyclones

    Date: 2021-02-12

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    Tropical cyclones (TCs), driven by heat exchange between the air and sea, pose a substantial risk to many communities around the world. Accurate characterization of the subsurface ocean thermal response to TC passage is crucial for accurate TC intensity forecasts and for understanding the role TCs play in the global climate system, yet that characterization is complicated by the high-noise ocean environment, correlations inherent in spatio-temporal data, relative scarcity of in situ observations and the entanglement of the TC-induced signal with seasonal signals. We present a general methodological framework that addresses these difficulties, integrating existing techniques in seasonal mean field estimation, Gaussian process modeling, and nonparametric regression into a functional ANOVA model. Importantly, we improve upon past work by properly handling seasonality, providing rigorous uncertainty quantification, and treating time as a continuous variable, rather than producing estimates that are binned in time. This functional ANOVA model is estimated using in situ subsurface temperature profiles from the Argo fleet of autonomous floats through a multi-step procedure, which (1) characterizes the upper ocean seasonal shift during the TC season; (2) models the variability in the temperature observations; (3) fits a thin plate spline using the variability estimates to account for heteroskedasticity and correlation between the observations. This spline fit reveals the ocean thermal response to TC passage. Through this framework, we obtain new scientific insights into the interaction between TCs and the ocean on a global scale, including a three-dimensional characterization of the near-surface and subsurface cooling along the TC storm track and the mixing-induced subsurface warming on the track’s right side. Joint work with Addison Hu, Ann Lee, Donata Giglio and Kimberly Wood.

  • Small Area Estimation in Low- and Middle-Income Countries

    Date: 2021-01-29

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    The under-five mortality rate (U5MR) is a key barometer of the health of a nation. Unfortunately, many people living in low- and middle-income countries are not covered by civil registration systems. This makes estimation of the U5MR, particularly at the subnational level, difficult. In this talk, I will describe models that have been developed to produce the official United Nations (UN) subnational U5MR estimates in 22 countries. Estimation is based on household surveys, which use stratified, two-stage cluster sampling. I will describe a range of area- and unit-level models and describe the rationale for the modeling we carry out. Data sparsity in time and space is a key challenge, and smoothing models are vital. I will discuss the advantages and disadvantages of discrete and continuous spatial models, in the context of estimation at the scale at which health interventions are made. Other issues that will be touched upon include: design-based versus model-based inference; adjustments for HIV epidemics; the inclusion of so-called indirect (summary birth history) data; reproducibility through software availability; benchmarking; how to deal with incomplete geographical data; and working with the UN to produce estimates.

  • Approximate Cross-Validation for Large Data and High Dimensions

    Date: 2020-11-13

    Time: 15:30-16:30

    Zoom Link

    Abstract:

    The error or variability of statistical and machine learning algorithms is often assessed by repeatedly re-fitting a model with different weighted versions of the observed data. The ubiquitous tools of cross-validation (CV) and the bootstrap are examples of this technique. These methods are powerful in large part due to their model agnosticism but can be slow to run on modern, large data sets due to the need to repeatedly re-fit the model. We use a linear approximation to the dependence of the fitting procedure on the weights, producing results that can be faster than repeated re-fitting by orders of magnitude. This linear approximation is sometimes known as the “infinitesimal jackknife” (IJ) in the statistics literature, where it has mostly been used as a theoretical tool to prove asymptotic results. We provide explicit finite-sample error bounds for the infinitesimal jackknife in terms of a small number of simple, verifiable assumptions. Without further modification, though, we note that the IJ deteriorates in accuracy in high dimensions and incurs a running time roughly cubic in dimension. We additionally show, then, how dimensionality reduction can be used to successfully run the IJ in high dimensions when data is sparse or low rank. Simulated and real-data experiments support our theory.

  • Data Science, Classification, Clustering and Three-Way Data

    Date: 2020-10-02

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    Data science is discussed along with some historical perspective. Selected problems in classification are considered, either via specific datasets or general problem types. In each case, the problem is introduced before one or more potential solutions are discussed and applied. The problems discussed include data with outliers, longitudinal data, and three-way data. The proposed approaches are generally mixture model-based.