/tags/2020-winter/index.xml 2020 Winter - McGill Statistics Seminars
  • A gentle introduction to generalized structured component analysis and its recent developments

    Date: 2020-03-27

    Time: 15:30-16:30

    Location: BURNSIDE 1205

    Abstract:

    Generalized structured component analysis (GSCA) was developed as a component-based approach to structural equation modeling, where constructs are represented by components or weighted composites of observed variables, rather than (common) factors. Unlike another long-lasting component-based approach – partial least squares path modeling, GSCA is a full-information method that optimizes a single criterion to estimate model parameters simultaneously, utilizing all information available in the entire system of equations. Over the decade, this approach has been refined and extended in various ways to enhance its data-analytic capability. I will briefly discuss the theoretical underpinnings of GSCA and demonstrate the use of an R package for GSCA - gesca. Moreover, I will outline some recent developments in GSCA, which include GSCA_M for estimating models with factors and integrated GSCA (IGSCA) for estimating models with both factors and components.

  • Informative Prior Elicitation from Historical Individual Patient Data

    Date: 2020-03-20

    Time: 15:30-16:30

    Location: BURNSIDE 1205

    Abstract:

    Historical data from previous studies may be utilized to strengthen statistical inference. Under the Bayesian framework incorporation of information obtained from any source other than the current data is facilitated through construction of an informative prior. The existing methodology for defining an informative prior based on historical data relies on measuring similarity to the current data at the study level that can result in discarding useful individual patient data (IPD). In this talk I present a family of priors that utilize IPD to strengthen statistical inference. IPD-based priors can be obtained as a weighted likelihood of the historical data where each individual’s weight is a function of their distance to the current study population. It is demonstrated that the proposed prior construction approach can considerably improve estimation accuracy and precision in compare with existing methods.

  • Geometry-based Data Exploration

    Date: 2020-03-13

    Time: 15:30-16:30

    Location: BURNSIDE 1205

    Abstract:

    High-throughput data collection technologies are becoming increasingly common in many fields, especially in biomedical applications involving single cell data (e.g., scRNA-seq and CyTOF). These introduce a rising need for exploratory analysis to reveal and understand hidden structure in the collected (high-dimensional) Big Data. A crucial aspect in such analysis is the separation of intrinsic data geometry from data distribution, as (a) the latter is typically biased by collection artifacts and data availability, and (b) rare subpopulations and sparse transitions between meta-stable states are often of great interest in biomedical data analysis. In this talk, I will show several tools that leverage manifold learning, graph signal processing, and harmonic analysis for biomedical (in particular, genomic/proteomic) data exploration, with emphasis on visualization, data generation/augmentation, and nonlinear feature extraction. A common thread in the presented tools is the construction of a data-driven diffusion geometry that both captures intrinsic structure in data and provides a generalization of Fourier harmonics on it. These, in turn, are used to process data features along the data geometry for denoising and generative purposes. Finally, I will relate this approach to the recently-proposed geometric scattering transform that generalizes Mallat’s scattering to non-Euclidean domains, and provides a mathematical framework for theoretical understanding of the emerging field of geometric deep learning.

  • Neyman-Pearson classification: parametrics and sample size requirement

    Date: 2020-02-28

    Time: 15:30-16:30

    Location: BURNSIDE 1104

    Abstract:

    The Neyman-Pearson (NP) paradigm in binary classification seeks classifiers that achieve a minimal type II error while enforcing the prioritized type I error controlled under some user-specified level alpha. This paradigm serves naturally in applications such as severe disease diagnosis and spam detection, where people have clear priorities among the two error types. Recently, Tong, Feng and Li (2018) proposed a nonparametric umbrella algorithm that adapts all scoring-type classification methods (e.g., logistic regression, support vector machines, random forest) to respect the given type I error (i.e., conditional probability of classifying a class 0 observation as class 1 under the 0-1 coding) upper bound alpha with high probability, without specific distributional assumptions on the features and the responses. Universal the umbrella algorithm is, it demands an explicit minimum sample size requirement on class 0, which is often the more scarce class, such as in rare disease diagnosis applications. In this work, we employ the parametric linear discriminant analysis (LDA) model and propose a new parametric thresholding algorithm, which does not need the minimum sample size requirements on class 0 observations and thus is suitable for small sample applications such as rare disease diagnosis. Leveraging both the existing nonparametric and the newly proposed parametric thresholding rules, we propose four LDA-based NP classifiers, for both low- and high-dimensional settings. On the theoretical front, we prove NP oracle inequalities for one proposed classifier, where the rate for excess type II error benefits from the explicit parametric model assumption. Furthermore, as NP classifiers involve a sample splitting step of class 0 observations, we construct a new adaptive sample splitting scheme that can be applied universally to NP classifiers, and this adaptive strategy reduces the type II error of these classifiers. The proposed NP classifiers are implemented in the R package nproc.

  • Non-central squared copulas: properties and applications

    Date: 2020-02-21

    Time: 15:30-16:30

    Location: BURNSIDE 1205

    Abstract:

    The goal of this presentation is to introduce new families of multivariate copulas, extending the chi-square copulas, the Fisher copula, and squared copulas. The new families are constructed from existing copulas by first transforming their margins to standard Gaussian distributions, then transforming these variables into non-central chi-square variables with one degree of freedom, and finally by considering the copula associated with these new variables. It is shown that by varying the non-centrality parameters, one can model non-monotonic dependence, and when one or many non-centrality parameters are outside a given hyper-rectangle, then the copula is almost the same as the one when these parameters are infinite. For these new families, the tail behavior, the monotonicity of dependence measures such as Kendall’s tau and Spearman’s rho are investigated, and estimation is discussed. Some examples will illustrate the usefulness of these new copula families.

  • Sharing Sustainable Mobility in Smart Cities

    Date: 2020-02-14

    Time: 15:30-16:30

    Location: BURNSIDE 1205

    Abstract:

    Many cities worldwide are embracing electric vehicle (EV) sharing as a flexible and sustainable means of urban transit. However, it remains challenging for the operators to charge the fleet due to limited or costly access to charging facilities. In this work, we focus on answering the core question - how to charge the fleet to make EV sharing viable and profitable. Our work is motivated by the recent setback that struck San Diego, California, where car2go ceased its EV sharing operations. We integrate charging infrastructure planning and vehicle repositioning operations that were often considered separately in the literature. More interestingly, our modeling emphasizes the operator-controlled charging operations and customers’ EV picking behavior, which are both central to EV sharing but were largely overlooked. Motivated by the actual data of car2go, our model explicitly characterizes how customers endogenously pick EVs based on energy levels, and how the operator dispatches EV charging under a targeted charging policy. We formulate the integrated model as a nonlinear optimization program with fractional constraints. We then develop both lower- and upper-bound formulations as mixed-integer second order cone programs, which are computationally tractable with small optimality gap. Contrary to car2go’s practice, we find that the viability of EV sharing can be enhanced by concentrating limited charger resources at selected locations. Charging EVs in a proactive fashion (rather than car2go’s policy of charging EVs only when their energy level drops below 20%) can boost the profit by 10.7%. Given the demand profile in San Diego, the fleet size may reduce by up to 34% without incurring significant profit loss. Moreover, sufficient charger availability is crucial when collaborating with a public charger network. Finally, increasing the charging power relieves the charger resource constraint, whereas extending per-charge range or adopting unmanned repositioning improves profitability. In summary, our work demonstrates a data-verified and high-granularity modeling approach. Both the high-level planning guidelines and operational policies can be useful for practitioners. We also highlight the value of jointly managing demand fulfilment and EV charging.

  • Adapting black-box machine learning methods for causal inference

    Date: 2020-01-31

    Time: 15:30-16:30

    Location: BURNSIDE 1104

    Abstract:

    I’ll discuss the use of observational data to estimate the causal effect of a treatment on an outcome. This task is complicated by the presence of “confounders” that influence both treatment and outcome, inducing observed associations that are not causal. Causal estimation is achieved by adjusting for this confounding by using observed covariate information. I’ll discuss the case where we observe covariates that carry sufficient information for the adjustment. But where explicit models relating treatment, outcome, covariates and confounding are not available.

  • Estimation and inference for changepoint models

    Date: 2020-01-13

    Time: 15:30-16:30

    Location: BURNSIDE 1205

    Abstract:

    This talk is motivated by statistical challenges that arise in the analysis of calcium imaging data, a new technology in neuroscience that makes it possible to record from huge numbers of neurons at single-neuron resolution. In the first part of this talk, I will consider the problem of estimating a neuron’s spike times from calcium imaging data. A simple and natural model suggests a non-convex optimization problem for this task. I will show that by recasting the non-convex problem as a changepoint detection problem, we can efficiently solve it for the global optimum using a clever dynamic programming strategy.