/post/index.xml Past Seminar Series - McGill Statistics Seminars
  • Test-based integrative analysis of randomized trial and real-world data for treatment heterogeneity estimation

    Date: 2020-10-30

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Parallel randomized clinical trial (RCT) and real-world data (RWD) are becoming increasingly available for treatment evaluation. Given the complementary features of the RCT and RWD, we propose a test-based integrative analysis of the RCT and RWD for accurate and robust estimation of the heterogeneity of treatment effect (HTE), which lies at the heart of precision medicine. When the RWD are not subject to bias, e.g., due to unmeasured confounding, our approach combines the RCT and RWD for optimal estimation by exploiting semiparametric efficiency theory. Utilizing the design advantage of RTs, we construct a built-in test procedure to gauge the reliability of the RWD and decide whether or not to use RWD in an integrative analysis. We characterize the asymptotic distribution of the test-based integrative estimator under local alternatives, which provides a better approximation of the finite-sample behaviors of the test and estimator when the idealistic assumption required for the RWD is weakly violated. We provide a data-adaptive procedure to select the threshold of the test statistic that promises the smallest mean square error of the proposed estimator of the HTE. Lastly, we construct an adaptive confidence interval that has a good finite-sample coverage property. We apply the proposed method to characterize who can benefit from adjuvant chemotherapy in patients with stage IB non-small cell lung cancer.

  • Linear Regression and its Inference on Noisy Network-linked Data

    Date: 2020-10-23

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Linear regression on a set of observations linked by a network has been an essential tool in modeling the relationship between response and covariates with additional network data. Despite its wide range of applications in many areas, such as social sciences and health-related research, the problem has not been well-studied in statistics so far. Previous methods either lack of inference tools or rely on restrictive assumptions on social effects, and usually treat the network structure as precisely observed, which is too good to be true in many problems. We propose a linear regression model with nonparametric social effects. Our model does not assume the relational data or network structure to be accurately observed; thus, our method can be provably robust to a certain level of perturbation of the network structure. We establish a full set of computationally efficient asymptotic inference tools under a general requirement of the perturbation and then study the robustness of our method in the specific setting when the perturbation is from random network models. We discover a phase-transition phenomenon of inference validity concerning the network density when no prior knowledge about the network model is available, while also show the significant improvement achieved by knowing the network model. A by-product of our analysis is a rate-optimal concentration bound about subspace projection that may be of independent interest. We conduct extensive simulation studies to verify our theoretical observations and demonstrate the advantage of our method compared to a few benchmarks under different data-generating models. The method is then applied to adolescent network data to study the gender and racial differences in social activities.

  • Adaptive MCMC For Everyone

    Date: 2020-10-16

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Markov chain Monte Carlo (MCMC) algorithms, such as the Metropolis Algorithm and the Gibbs Sampler, are an extremely useful and popular method of approximately sampling from complicated probability distributions. Adaptive MCMC attempts to automatically modify the algorithm while it runs, to improve its performance on the fly. However, such adaptation often destroys the ergodicity properties necessary for the algorithm to be valid. In this talk, we first illustrate MCMC algorithms using simple graphical examples. We then discuss adaptive MCMC, and present examples and theorems concerning its ergodicity and efficiency.

  • Machine Learning and Neural Networks: Foundations and Some Fundamental Questions

    Date: 2020-10-09

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    Statistical learning theory is by now a mature branch of data science that hosts a vast variety of practical techniques for tackling data-related problems. In this talk we present some fundamental concepts upon which statistical learning theory has been based. Different approaches to statistical inference will be discussed and the main problem of learning from Vapnik’s point of view will be explained. Further we discuss the topic of function estimation as the heart of Vapnik-Chervonenkis theory. There exist several state-of-the-art methods for estimating functional dependencies, such as maximum margin estimator and artificial neural networks. While for some of these methods, e.g., the support vector machines, there has already been developed a profound theory, others require more investigation. Accordingly, we pay a closer attention to the so-called mapping neural networks and try to shed some light on certain theoretical aspects of them. We highlight some of the fundamental challenges that have attracted the attention of researcher and they are yet to be fully resolved. One of these challenges is estimation of the intrinsic dimension of data that will be discussed in detail. Another challenge is inferring causal direction when the training data set is not representative of the target population.

  • Data Science, Classification, Clustering and Three-Way Data

    Date: 2020-10-02

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    Data science is discussed along with some historical perspective. Selected problems in classification are considered, either via specific datasets or general problem types. In each case, the problem is introduced before one or more potential solutions are discussed and applied. The problems discussed include data with outliers, longitudinal data, and three-way data. The proposed approaches are generally mixture model-based.

  • Large-scale Network Inference

    Date: 2020-09-25

    Time: 14:00-15:00

    Zoom Link

    Meeting ID: 939 4707 7997

    Passcode: no password

    Abstract:

    Network data is prevalent in many contemporary big data applications in which a common interest is to unveil important latent links between different pairs of nodes. Yet a simple fundamental question of how to precisely quantify the statistical uncertainty associated with the identification of latent links still remains largely unexplored. In this paper, we propose the method of statistical inference on membership profiles in large networks (SIMPLE) in the setting of degree-corrected mixed membership model, where the null hypothesis assumes that the pair of nodes share the same profile of community memberships. In the simpler case of no degree heterogeneity, the model reduces to the mixed membership model for which an alternative more robust test is also proposed. Both tests are of the Hotelling-type statistics based on the rows of empirical eigenvectors or their ratios, whose asymptotic covariance matrices are very challenging to derive and estimate. Nevertheless, their analytical expressions are unveiled and the unknown covariance matrices are consistently estimated. Under some mild regularity conditions, we establish the exact limiting distributions of the two forms of SIMPLE test statistics under the null hypothesis and contiguous alternative hypothesis. They are the chi-square distributions and the noncentral chi-square distributions, respectively, with degrees of freedom depending on whether the degrees are corrected or not. We also address the important issue of estimating the unknown number of communities and establish the asymptotic properties of the associated test statistics. The advantages and practical utility of our new procedures in terms of both size and power are demonstrated through several simulation examples and real network applications.

  • BdryGP: a boundary-integrated Gaussian process model for computer code emulation

    Date: 2020-09-18

    Time: 15:30-16:30

    Zoom Link

    Meeting ID: 924 5390 4989

    Passcode: 690084

    Abstract:

    With advances in mathematical modeling and computational methods, complex phenomena (e.g., universe formations, rocket propulsion) can now be reliably simulated via computer code. This code solves a complicated system of equations representing the underlying science of the problem. Such simulations can be very time-intensive, requiring months of computation for a single run. Gaussian processes (GPs) are widely used as predictive models for “emulating” this expensive computer code. Yet with limited training data on a high-dimensional parameter space, such models can suffer from poor predictive performance and physical interpretability. Fortunately, in many physical applications, there is additional boundary information on the code beforehand, either from governing physics or scientific knowledge. We propose a new BdryGP model which incorporates such boundary information for prediction. We show that BdryGP not only enjoys improved convergence rates over standard GP models which do not incorporate boundaries, but is also more resistant to the ``curse-of-dimensionality’’ in nonparametric regression. We then demonstrate the improved predictive performance and posterior contraction of the BdryGP model on several test problems in the literature.

  • Machine Learning for Causal Inference

    Date: 2020-09-11

    Time: 16:00-17:00

    Zoom Link

    Meeting ID: 965 2536 7383

    Passcode: 421254

    Abstract:

    Given advances in machine learning over the past decades, it is now possible to accurately solve difficult non-parametric prediction problems in a way that is routine and reproducible. In this talk, I’ll discuss how machine learning tools can be rigorously integrated into observational study analyses, and how they interact with classical statistical ideas around randomization, semiparametric modeling, double robustness, etc. I’ll also survey some recent advances in methods for treatment heterogeneity. When deployed carefully, machine learning enables us to develop causal estimators that reflect an observational study design more closely than basic linear regression based methods.

  • A gentle introduction to generalized structured component analysis and its recent developments

    Date: 2020-03-27

    Time: 15:30-16:30

    Location: BURNSIDE 1205

    Abstract:

    Generalized structured component analysis (GSCA) was developed as a component-based approach to structural equation modeling, where constructs are represented by components or weighted composites of observed variables, rather than (common) factors. Unlike another long-lasting component-based approach – partial least squares path modeling, GSCA is a full-information method that optimizes a single criterion to estimate model parameters simultaneously, utilizing all information available in the entire system of equations. Over the decade, this approach has been refined and extended in various ways to enhance its data-analytic capability. I will briefly discuss the theoretical underpinnings of GSCA and demonstrate the use of an R package for GSCA - gesca. Moreover, I will outline some recent developments in GSCA, which include GSCA_M for estimating models with factors and integrated GSCA (IGSCA) for estimating models with both factors and components.

  • Informative Prior Elicitation from Historical Individual Patient Data

    Date: 2020-03-20

    Time: 15:30-16:30

    Location: BURNSIDE 1205

    Abstract:

    Historical data from previous studies may be utilized to strengthen statistical inference. Under the Bayesian framework incorporation of information obtained from any source other than the current data is facilitated through construction of an informative prior. The existing methodology for defining an informative prior based on historical data relies on measuring similarity to the current data at the study level that can result in discarding useful individual patient data (IPD). In this talk I present a family of priors that utilize IPD to strengthen statistical inference. IPD-based priors can be obtained as a weighted likelihood of the historical data where each individual’s weight is a function of their distance to the current study population. It is demonstrated that the proposed prior construction approach can considerably improve estimation accuracy and precision in compare with existing methods.