/post/index.xml Past Seminar Series - McGill Statistics Seminars
  • Imbalanced learning using actuarial modified loss function in tree-based models

    Date: 2021-10-08

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    Tree-based models have gained momentum in insurance claim loss modeling; however, the point mass at zero and the heavy tail of insurance loss distribution pose the challenge to apply conventional methods directly to claim loss modeling. With a simple illustrative dataset, we first demonstrate how the traditional tree-based algorithm’s splitting function fails to cope with a large proportion of data with zero responses. To address the imbalance issue presented in such loss modeling, this paper aims to modify the traditional splitting function of Classification and Regression Tree (CART). In particular, we propose two novel actuarial modified loss functions, namely, the weighted sum of squared error and the sum of squared Canberra error. These modified loss functions impose a significant penalty on grouping observations of non-zero response with those of zero response at the splitting procedure, and thus significantly enhance their separation. Finally, we examine and compare the predictive performance of such actuarial modified tree-based models to the traditional model on synthetic datasets that imitate insurance loss. The results show that such modification leads to substantially different tree structures and improved prediction performance.

  • The HulC: Hull based Confidence Regions

    Date: 2021-10-01

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    We develop and analyze the HulC, an intuitive and general method for constructing confidence sets using the convex hull of estimates constructed from subsets of the data. Unlike classical methods which are based on estimating the (limiting) distribution of an estimator, the HulC is often simpler to use and effectively bypasses this step. In comparison to the bootstrap, the HulC requires fewer regularity conditions and succeeds in many examples where the bootstrap provably fails. Unlike subsampling, the HulC does not require knowledge of the rate of convergence of the estimators on which it is based. The validity of the HulC requires knowledge of the (asymptotic) median-bias of the estimators. We further analyze a variant of our basic method, called the Adaptive HulC, which is fully data-driven and estimates the median-bias using subsampling. We show that the Adaptive HulC retains the aforementioned strengths of the HulC. In certain cases where the underlying estimators are pathologically asymmetric, the HulC and Adaptive HulC can fail to provide useful confidence sets. We discuss these methods in the context of several challenging inferential problems which arise in parametric, semi-parametric, and non-parametric inference. Although our focus is on validity under weak regularity conditions, we also provide some general results on the width of the HulC confidence sets, showing that in many cases the HulC confidence sets have near-optimal width. Please let me know if you need anything else.

  • Deep down, everyone wants to be causal

    Date: 2021-09-24

    Time: 15:00-16:00 (Montreal time)

    https://mcgill.zoom.us/j/9791073141

    Meeting ID: 979 107 3141

    Abstract:

    In the data science courses at the University of British Columbia, we define data science as the study, development and practice of reproducible and auditable processes to obtain insight from data. While reproducibility is core to our definition, most data science learners enter the field with other aspects of data science in mind, for example predictive modelling, which is often one of the most interesting topic to novices. This fact, along with the highly technical nature of the industry standard reproducibility tools currently employed in data science, present out-ofthe gate challenges in teaching reproducibility in the data science classroom. Put simply, students are not as intrinsically motivated to learn this topic, and it is not an easy one for them to learn. What can a data science educator do? Over several iterations of teaching courses focused on reproducible data science tools and workflows, we have found that providing extra motivation, guided instruction and lots of practice are key to effectively teaching this challenging, yet important subject. Here we present examples of how we deeply motivate, effectively guide and provide ample practice opportunities to data science students to effectively engage them in learning about this topic.

  • On the Minimal Error of Empirical Risk Minimization

    Date: 2021-09-17

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    In recent years, highly expressive machine learning models, i.e. models that can express rich classes of functions, are becoming more and more commonly used due their success both in regression and classification tasks, such models are deep neural nets, kernel machines and more. From the classical theory statistics point of view (the minimax theory), rich models tend to have a higher minimax rate, i.e. any estimator must have a high risk (a “worst case scenario” error). Therefore, it seems that for modern models the classical theory may be too conservative and strict. In this talk, we consider the most popular procedure for regression task, that is Empirical Risk Minimization with squared loss (ERM) and we shall analyze its minimal squared error both in the random and the fixed design settings, under the assumption of a convex family of functions. Namely, the minimal squared error that the ERM attains on estimating any function in our class in both settings. In the fixed design setting, we show that the error is governed by the global complexity of the entire class. In contrast, in random design, the ERM may only adapt to simpler models if the local neighborhoods around the regression function are nearly as complex as the class itself, a somewhat counter-intuitive conclusion. We provide sharp lower bounds for performance of ERM for both Donsker and non-Donsker classes. This talk is based on joint work with Alexander Rakhlin.

  • Weighted empirical processes

    Date: 2021-09-10

    Time: 15:30-16:30 (Montreal time)

    https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

    Meeting ID: 834 3668 6293

    Passcode: 12345

    Abstract:

    Empirical processes concern the uniform behavior of averaged sums over a sample of observations where the sums are indexed by a class of functions. Classical empirical processes typically study the empirical distribution function over the real line, while more modern empirical processes study much more general indexing function classes (e.g., Vapnik-Chervonenkis class, smoothness class); typical results include moment bounds and deviation inequalities. In this talk we will survey some of these results, but for the weighted empirical process that is obtained by weighing the original process by a factor related to the standard deviation of the process, which will make the resulting process more difficult to bound. Applications to multivaraite rank order statistics and residual empirical processes will be discussed.

  • Dependence Modeling of Mixed Insurance Claim Data

    Date: 2021-04-09

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    Multivariate claim data are common in insurance applications, e.g. claims of each policyholder for different types of insurance coverages. Understanding the dependencies among such multivariate risks is essential for the solvency and profitability of insurers. Effectively modeling insurance claim data is challenging due to their special complexities. At the policyholder level, claims data usually follow a two-part mixed distribution: a probability mass at zero corresponding to no claim and an otherwise positive claim from a skewed and long-tailed distribution. Copula models are often employed in order to simultaneously model the relationship between outcomes and covariates while flexibly quantifying the dependencies among the different outcomes. However, due to the mixed data feature, specification of copula models has been a problem. We fill this gap by developing a consistent nonparametric copula estimator for mixed data. Under our framework, both the models for the i) marginal relationship between covariates and claims and ii) dependence structure between claims can be chosen in a principled way. We show the uniform convergence of the proposed nonparametric copula estimator. Using the claim data from the Wisconsin Local Government Property Insurance Fund, we illustrate that our nonparametric copula estimator can assist analysts in identifying important features of the underlying dependence structure, revealing how different claims or risks are related to one another.

  • Learning Causal Structures via Continuous Optimization

    Date: 2021-03-26

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    There has been a recent surge of interest in the machine learning community in developing causal models that handle the effect of interventions in a system. In this talk, I will consider the problem of learning (estimating) a causal graphical model from data. The search over possible directed acyclic graphs modeling the causal structure is inherently combinatorial, but I’ll describe our recent work which use gradient-based continuous optimization for learning both the parameters of the distribution and the causal graph jointly, and can be combined naturally with flexible parametric families that use neural networks.

  • Measuring timeliness of annual reports filing by jump additive models

    Date: 2021-03-19

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    Foreign public issuers (FPIs) are required by the Securities and Exchanges Commission (SEC) to file Form 20-F as comprehensive annual reports. In an effort to increase the usefulness of 20-Fs, the SEC recently enacted a regulation to accelerate the deadline of 20-F filing from six months to four months after the fiscal year-end. The rationale is that the shortened reporting lag would improve the informational relevance of 20-Fs. In this work we propose a jump additive model to evaluate the SEC’s rationale by investigating the relationship between the timeliness of 20-F filing and its decision usefulness using the market data. The proposed model extends the conventional additive models to allow possible discontinuities in the regression functions. We suggest a two-step jump-preserving estimation procedure and show that it is statistically consistent. By applying the procedure to the 20-F study, we find a moderate positive association between the magnitude of the market reaction and the filing timeliness when the acceleration is less than 17 days. We also find that the market considers the filings significantly more informative when the acceleration is more than 18 days and such reaction tapers off when the acceleration exceeds 40 days.

  • Nonparametric Tests for Informative Selection in Complex Surveys

    Date: 2021-03-12

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 939 8331 3215

    Passcode: 096952

    Abstract:

    Informative selection, in which the distribution of response variables given that they are sampled is different from their distribution in the population, is pervasive in complex surveys. Failing to take such informativeness into account can produce severe inferential errors, including biased and inconsistent estimation of population parameters. While several parametric procedures exist to test for informative selection, these methods are limited in scope and their parametric assumptions are difficult to assess. We consider two classes of nonparametric tests of informative selection. The first class is motivated by classic nonparametric two-sample tests. We compare weighted and unweighted empirical distribution functions and obtain tests for informative selection that are analogous to Kolmogorov-Smirnov and Cramer-von Mises. For the second class of tests, we adapt a kernel-based learning method that compares distributions based on their maximum mean discrepancy. The asymptotic distributions of the test statistics are established under the null hypothesis of noninformative selection. Simulation results show that our tests have power competitive with existing parametric tests in a correctly specified parametric setting, and better than those tests under model misspecification. A recreational angling application illustrates the methodology.

  • CoinPress: Practical Private Point Estimation and Confidence Intervals

    Date: 2021-02-26

    Time: 15:30-16:30 (Montreal time)

    Zoom Link

    Meeting ID: 843 0865 5572

    Passcode: 690084

    Abstract:

    We consider point estimation and generation of confidence intervals under the constraint of differential privacy. We provide a simple and practical framework for these tasks in relatively general settings. Our investigation addresses a novel challenge that arises in the differentially private setting, which involves the cost of weak a priori bounds on the parameters of interest. This framework is applied to the problems of Gaussian mean and covariance estimation. Despite the simplicity of our method, we are able to achieve minimax near-optimal rates for these problems. Empirical evaluations, on the problems of mean estimation, covariance estimation, and principal component analysis, demonstrate significant improvements in comparison to previous work.