/categories/mcgill-statistics-seminar/index.xml McGill Statistics Seminar - McGill Statistics Seminars
  • Variational Bayes for high-dimensional linear regression with sparse priors

    Date: 2021-11-12

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    A core problem in Bayesian statistics is approximating difficult to compute posterior distributions. In variational Bayes (VB), a method from machine learning, one approximates the posterior through optimization, which is typically faster than Markov chain Monte Carlo. We study a mean-field (i.e. factorizable) VB approximation to Bayesian model selection priors, including the popular spike-and-slab prior, in sparse high-dimensional linear regression. We establish convergence rates for this VB approach, studying conditions under which it provides good estimation. We also discuss some computational issues and study the empirical performance of the algorithm.

  • Model-assisted analyses of cluster-randomized experiments

    Date: 2021-10-22

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    Cluster-randomized experiments are widely used due to their logistical convenience and policy relevance. To analyze them properly, we must address the fact that the treatment is assigned at the cluster level instead of the individual level. Standard analytic strategies are regressions based on individual data, cluster averages, and cluster totals, which differ when the cluster sizes vary. These methods are often motivated by models with strong and unverifiable assumptions, and the choice among them can be subjective. Without any outcome modeling assumption, we evaluate these regression estimators and the associated robust standard errors from a design-based perspective where only the treatment assignment itself is random and controlled by the experimenter. We demonstrate that regression based on cluster averages targets a weighted average treatment effect, regression based on individual data is suboptimal in terms of efficiency, and regression based on cluster totals is consistent and more efficient with a large number of clusters. We highlight the critical role of covariates in improving estimation efficiency, and illustrate the efficiency gain via both simulation studies and data analysis. Moreover, we show that the robust standard errors are convenient approximations to the true asymptotic standard errors under the design-based perspective. Our theory holds even when the outcome models are misspecified, so it is model-assisted rather than model-based. We also extend the theory to a wider class of weighted average treatment effects.

  • Imbalanced learning using actuarial modified loss function in tree-based models

    Date: 2021-10-08

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    Tree-based models have gained momentum in insurance claim loss modeling; however, the point mass at zero and the heavy tail of insurance loss distribution pose the challenge to apply conventional methods directly to claim loss modeling. With a simple illustrative dataset, we first demonstrate how the traditional tree-based algorithm’s splitting function fails to cope with a large proportion of data with zero responses. To address the imbalance issue presented in such loss modeling, this paper aims to modify the traditional splitting function of Classification and Regression Tree (CART). In particular, we propose two novel actuarial modified loss functions, namely, the weighted sum of squared error and the sum of squared Canberra error. These modified loss functions impose a significant penalty on grouping observations of non-zero response with those of zero response at the splitting procedure, and thus significantly enhance their separation. Finally, we examine and compare the predictive performance of such actuarial modified tree-based models to the traditional model on synthetic datasets that imitate insurance loss. The results show that such modification leads to substantially different tree structures and improved prediction performance.

  • The HulC: Hull based Confidence Regions

    Date: 2021-10-01

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    We develop and analyze the HulC, an intuitive and general method for constructing confidence sets using the convex hull of estimates constructed from subsets of the data. Unlike classical methods which are based on estimating the (limiting) distribution of an estimator, the HulC is often simpler to use and effectively bypasses this step. In comparison to the bootstrap, the HulC requires fewer regularity conditions and succeeds in many examples where the bootstrap provably fails. Unlike subsampling, the HulC does not require knowledge of the rate of convergence of the estimators on which it is based. The validity of the HulC requires knowledge of the (asymptotic) median-bias of the estimators. We further analyze a variant of our basic method, called the Adaptive HulC, which is fully data-driven and estimates the median-bias using subsampling. We show that the Adaptive HulC retains the aforementioned strengths of the HulC. In certain cases where the underlying estimators are pathologically asymmetric, the HulC and Adaptive HulC can fail to provide useful confidence sets. We discuss these methods in the context of several challenging inferential problems which arise in parametric, semi-parametric, and non-parametric inference. Although our focus is on validity under weak regularity conditions, we also provide some general results on the width of the HulC confidence sets, showing that in many cases the HulC confidence sets have near-optimal width. Please let me know if you need anything else.

  • On the Minimal Error of Empirical Risk Minimization

    Date: 2021-09-17

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    In recent years, highly expressive machine learning models, i.e. models that can express rich classes of functions, are becoming more and more commonly used due their success both in regression and classification tasks, such models are deep neural nets, kernel machines and more. From the classical theory statistics point of view (the minimax theory), rich models tend to have a higher minimax rate, i.e. any estimator must have a high risk (a “worst case scenario” error). Therefore, it seems that for modern models the classical theory may be too conservative and strict. In this talk, we consider the most popular procedure for regression task, that is Empirical Risk Minimization with squared loss (ERM) and we shall analyze its minimal squared error both in the random and the fixed design settings, under the assumption of a convex family of functions. Namely, the minimal squared error that the ERM attains on estimating any function in our class in both settings. In the fixed design setting, we show that the error is governed by the global complexity of the entire class. In contrast, in random design, the ERM may only adapt to simpler models if the local neighborhoods around the regression function are nearly as complex as the class itself, a somewhat counter-intuitive conclusion. We provide sharp lower bounds for performance of ERM for both Donsker and non-Donsker classes. This talk is based on joint work with Alexander Rakhlin.

  • Weighted empirical processes

    Date: 2021-09-10

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    Empirical processes concern the uniform behavior of averaged sums over a sample of observations where the sums are indexed by a class of functions. Classical empirical processes typically study the empirical distribution function over the real line, while more modern empirical processes study much more general indexing function classes (e.g., Vapnik-Chervonenkis class, smoothness class); typical results include moment bounds and deviation inequalities. In this talk we will survey some of these results, but for the weighted empirical process that is obtained by weighing the original process by a factor related to the standard deviation of the process, which will make the resulting process more difficult to bound. Applications to multivaraite rank order statistics and residual empirical processes will be discussed.

  • Dependence Modeling of Mixed Insurance Claim Data

    Date: 2021-04-09

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    Multivariate claim data are common in insurance applications, e.g. claims of each policyholder for different types of insurance coverages. Understanding the dependencies among such multivariate risks is essential for the solvency and profitability of insurers. Effectively modeling insurance claim data is challenging due to their special complexities. At the policyholder level, claims data usually follow a two-part mixed distribution: a probability mass at zero corresponding to no claim and an otherwise positive claim from a skewed and long-tailed distribution. Copula models are often employed in order to simultaneously model the relationship between outcomes and covariates while flexibly quantifying the dependencies among the different outcomes. However, due to the mixed data feature, specification of copula models has been a problem. We fill this gap by developing a consistent nonparametric copula estimator for mixed data. Under our framework, both the models for the i) marginal relationship between covariates and claims and ii) dependence structure between claims can be chosen in a principled way. We show the uniform convergence of the proposed nonparametric copula estimator. Using the claim data from the Wisconsin Local Government Property Insurance Fund, we illustrate that our nonparametric copula estimator can assist analysts in identifying important features of the underlying dependence structure, revealing how different claims or risks are related to one another.

  • Learning Causal Structures via Continuous Optimization

    Date: 2021-03-26

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    There has been a recent surge of interest in the machine learning community in developing causal models that handle the effect of interventions in a system. In this talk, I will consider the problem of learning (estimating) a causal graphical model from data. The search over possible directed acyclic graphs modeling the causal structure is inherently combinatorial, but I’ll describe our recent work which use gradient-based continuous optimization for learning both the parameters of the distribution and the causal graph jointly, and can be combined naturally with flexible parametric families that use neural networks.

  • Measuring timeliness of annual reports filing by jump additive models

    Date: 2021-03-19

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    Foreign public issuers (FPIs) are required by the Securities and Exchanges Commission (SEC) to file Form 20-F as comprehensive annual reports. In an effort to increase the usefulness of 20-Fs, the SEC recently enacted a regulation to accelerate the deadline of 20-F filing from six months to four months after the fiscal year-end. The rationale is that the shortened reporting lag would improve the informational relevance of 20-Fs. In this work we propose a jump additive model to evaluate the SEC’s rationale by investigating the relationship between the timeliness of 20-F filing and its decision usefulness using the market data. The proposed model extends the conventional additive models to allow possible discontinuities in the regression functions. We suggest a two-step jump-preserving estimation procedure and show that it is statistically consistent. By applying the procedure to the 20-F study, we find a moderate positive association between the magnitude of the market reaction and the filing timeliness when the acceleration is less than 17 days. We also find that the market considers the filings significantly more informative when the acceleration is more than 18 days and such reaction tapers off when the acceleration exceeds 40 days.

  • CoinPress: Practical Private Point Estimation and Confidence Intervals

    Date: 2021-02-26

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    We consider point estimation and generation of confidence intervals under the constraint of differential privacy. We provide a simple and practical framework for these tasks in relatively general settings. Our investigation addresses a novel challenge that arises in the differentially private setting, which involves the cost of weak a priori bounds on the parameters of interest. This framework is applied to the problems of Gaussian mean and covariance estimation. Despite the simplicity of our method, we are able to achieve minimax near-optimal rates for these problems. Empirical evaluations, on the problems of mean estimation, covariance estimation, and principal component analysis, demonstrate significant improvements in comparison to previous work.