Past Seminar Series - McGill Statistics Seminars

- Oct 29, 2021
- post
Opinionated practices for teaching reproducibility: motivation, guided instruction and practice

Tiffany Timbers · Oct 29, 2021
Date: 2021-10-29

Time: 15:30-16:30 (Montreal time)

Zoom Link

Meeting ID: 939 8331 3215

Passcode: 096952

Abstract:

In the data science courses at the University of British Columbia, we define data science as the study, development and practice of reproducible and auditable processes to obtain insight from data. While reproducibility is core to our definition, most data science learners enter the field with other aspects of data science in mind, for example predictive modelling, which is often one of the most interesting topic to novices. This fact, along with the highly technical nature of the industry standard reproducibility tools currently employed in data science, present out-ofthe gate challenges in teaching reproducibility in the data science classroom. Put simply, students are not as intrinsically motivated to learn this topic, and it is not an easy one for them to learn. What can a data science educator do? Over several iterations of teaching courses focused on reproducible data science tools and workflows, we have found that providing extra motivation, guided instruction and lots of practice are key to effectively teaching this challenging, yet important subject. Here we present examples of how we deeply motivate, effectively guide and provide ample practice opportunities to data science students to effectively engage them in learning about this topic.

Read More…
- Oct 22, 2021
- post
Model-assisted analyses of cluster-randomized experiments

Peng Ding · Oct 22, 2021
Date: 2021-10-22

Time: 15:30-16:30 (Montreal time)

https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

Meeting ID: 834 3668 6293

Passcode: 12345

Abstract:

Cluster-randomized experiments are widely used due to their logistical convenience and policy relevance. To analyze them properly, we must address the fact that the treatment is assigned at the cluster level instead of the individual level. Standard analytic strategies are regressions based on individual data, cluster averages, and cluster totals, which differ when the cluster sizes vary. These methods are often motivated by models with strong and unverifiable assumptions, and the choice among them can be subjective. Without any outcome modeling assumption, we evaluate these regression estimators and the associated robust standard errors from a design-based perspective where only the treatment assignment itself is random and controlled by the experimenter. We demonstrate that regression based on cluster averages targets a weighted average treatment effect, regression based on individual data is suboptimal in terms of efficiency, and regression based on cluster totals is consistent and more efficient with a large number of clusters. We highlight the critical role of covariates in improving estimation efficiency, and illustrate the efficiency gain via both simulation studies and data analysis. Moreover, we show that the robust standard errors are convenient approximations to the true asymptotic standard errors under the design-based perspective. Our theory holds even when the outcome models are misspecified, so it is model-assisted rather than model-based. We also extend the theory to a wider class of weighted average treatment effects.

Read More…
- Oct 8, 2021
- post
Imbalanced learning using actuarial modified loss function in tree-based models

Zhiyu Quan · Oct 8, 2021
Date: 2021-10-08

Time: 15:30-16:30 (Montreal time)

https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

Meeting ID: 834 3668 6293

Passcode: 12345

Abstract:

Tree-based models have gained momentum in insurance claim loss modeling; however, the point mass at zero and the heavy tail of insurance loss distribution pose the challenge to apply conventional methods directly to claim loss modeling. With a simple illustrative dataset, we first demonstrate how the traditional tree-based algorithm’s splitting function fails to cope with a large proportion of data with zero responses. To address the imbalance issue presented in such loss modeling, this paper aims to modify the traditional splitting function of Classification and Regression Tree (CART). In particular, we propose two novel actuarial modified loss functions, namely, the weighted sum of squared error and the sum of squared Canberra error. These modified loss functions impose a significant penalty on grouping observations of non-zero response with those of zero response at the splitting procedure, and thus significantly enhance their separation. Finally, we examine and compare the predictive performance of such actuarial modified tree-based models to the traditional model on synthetic datasets that imitate insurance loss. The results show that such modification leads to substantially different tree structures and improved prediction performance.

Read More…
- Oct 1, 2021
- post
The HulC: Hull based Confidence Regions

Arun Kumar · Oct 1, 2021
Date: 2021-10-01

Time: 15:30-16:30 (Montreal time)

https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

Meeting ID: 834 3668 6293

Passcode: 12345

Abstract:

We develop and analyze the HulC, an intuitive and general method for constructing confidence sets using the convex hull of estimates constructed from subsets of the data. Unlike classical methods which are based on estimating the (limiting) distribution of an estimator, the HulC is often simpler to use and effectively bypasses this step. In comparison to the bootstrap, the HulC requires fewer regularity conditions and succeeds in many examples where the bootstrap provably fails. Unlike subsampling, the HulC does not require knowledge of the rate of convergence of the estimators on which it is based. The validity of the HulC requires knowledge of the (asymptotic) median-bias of the estimators. We further analyze a variant of our basic method, called the Adaptive HulC, which is fully data-driven and estimates the median-bias using subsampling. We show that the Adaptive HulC retains the aforementioned strengths of the HulC. In certain cases where the underlying estimators are pathologically asymmetric, the HulC and Adaptive HulC can fail to provide useful confidence sets. We discuss these methods in the context of several challenging inferential problems which arise in parametric, semi-parametric, and non-parametric inference. Although our focus is on validity under weak regularity conditions, we also provide some general results on the width of the HulC confidence sets, showing that in many cases the HulC confidence sets have near-optimal width. Please let me know if you need anything else.

Read More…
- Sep 24, 2021
- post
Deep down, everyone wants to be causal

Jennifer Hill · Sep 24, 2021
Date: 2021-09-24

Time: 15:00-16:00 (Montreal time)

https://mcgill.zoom.us/j/9791073141

Meeting ID: 979 107 3141

Abstract:

In the data science courses at the University of British Columbia, we define data science as the study, development and practice of reproducible and auditable processes to obtain insight from data. While reproducibility is core to our definition, most data science learners enter the field with other aspects of data science in mind, for example predictive modelling, which is often one of the most interesting topic to novices. This fact, along with the highly technical nature of the industry standard reproducibility tools currently employed in data science, present out-ofthe gate challenges in teaching reproducibility in the data science classroom. Put simply, students are not as intrinsically motivated to learn this topic, and it is not an easy one for them to learn. What can a data science educator do? Over several iterations of teaching courses focused on reproducible data science tools and workflows, we have found that providing extra motivation, guided instruction and lots of practice are key to effectively teaching this challenging, yet important subject. Here we present examples of how we deeply motivate, effectively guide and provide ample practice opportunities to data science students to effectively engage them in learning about this topic.

Read More…
- Sep 17, 2021
- post
On the Minimal Error of Empirical Risk Minimization

Gil Kur · Sep 17, 2021
Date: 2021-09-17

Time: 15:30-16:30 (Montreal time)

https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

Meeting ID: 834 3668 6293

Passcode: 12345

Abstract:

In recent years, highly expressive machine learning models, i.e. models that can express rich classes of functions, are becoming more and more commonly used due their success both in regression and classification tasks, such models are deep neural nets, kernel machines and more. From the classical theory statistics point of view (the minimax theory), rich models tend to have a higher minimax rate, i.e. any estimator must have a high risk (a “worst case scenario” error). Therefore, it seems that for modern models the classical theory may be too conservative and strict. In this talk, we consider the most popular procedure for regression task, that is Empirical Risk Minimization with squared loss (ERM) and we shall analyze its minimal squared error both in the random and the fixed design settings, under the assumption of a convex family of functions. Namely, the minimal squared error that the ERM attains on estimating any function in our class in both settings. In the fixed design setting, we show that the error is governed by the global complexity of the entire class. In contrast, in random design, the ERM may only adapt to simpler models if the local neighborhoods around the regression function are nearly as complex as the class itself, a somewhat counter-intuitive conclusion. We provide sharp lower bounds for performance of ERM for both Donsker and non-Donsker classes. This talk is based on joint work with Alexander Rakhlin.

Read More…
- Sep 10, 2021
- post
Weighted empirical processes

Yue Zhao · Sep 10, 2021
Date: 2021-09-10

Time: 15:30-16:30 (Montreal time)

https://mcgill.zoom.us/j/83436686293?pwd=b0RmWmlXRXE3OWR6NlNIcWF5d0dJQT09

Meeting ID: 834 3668 6293

Passcode: 12345

Abstract:

Empirical processes concern the uniform behavior of averaged sums over a sample of observations where the sums are indexed by a class of functions. Classical empirical processes typically study the empirical distribution function over the real line, while more modern empirical processes study much more general indexing function classes (e.g., Vapnik-Chervonenkis class, smoothness class); typical results include moment bounds and deviation inequalities. In this talk we will survey some of these results, but for the weighted empirical process that is obtained by weighing the original process by a factor related to the standard deviation of the process, which will make the resulting process more difficult to bound. Applications to multivaraite rank order statistics and residual empirical processes will be discussed.

Read More…
- Apr 9, 2021
- post
Dependence Modeling of Mixed Insurance Claim Data

Lu Yang · Apr 9, 2021
Date: 2021-04-09

Time: 15:30-16:30 (Montreal time)

Zoom Link

Meeting ID: 843 0865 5572

Passcode: 690084

Abstract:

Multivariate claim data are common in insurance applications, e.g. claims of each policyholder for different types of insurance coverages. Understanding the dependencies among such multivariate risks is essential for the solvency and profitability of insurers. Effectively modeling insurance claim data is challenging due to their special complexities. At the policyholder level, claims data usually follow a two-part mixed distribution: a probability mass at zero corresponding to no claim and an otherwise positive claim from a skewed and long-tailed distribution. Copula models are often employed in order to simultaneously model the relationship between outcomes and covariates while flexibly quantifying the dependencies among the different outcomes. However, due to the mixed data feature, specification of copula models has been a problem. We fill this gap by developing a consistent nonparametric copula estimator for mixed data. Under our framework, both the models for the i) marginal relationship between covariates and claims and ii) dependence structure between claims can be chosen in a principled way. We show the uniform convergence of the proposed nonparametric copula estimator. Using the claim data from the Wisconsin Local Government Property Insurance Fund, we illustrate that our nonparametric copula estimator can assist analysts in identifying important features of the underlying dependence structure, revealing how different claims or risks are related to one another.

Read More…
- Mar 26, 2021
- post
Learning Causal Structures via Continuous Optimization

Simon Lacoste-Julien · Mar 26, 2021
Date: 2021-03-26

Time: 15:30-16:30 (Montreal time)

Zoom Link

Meeting ID: 843 0865 5572

Passcode: 690084

Abstract:

There has been a recent surge of interest in the machine learning community in developing causal models that handle the effect of interventions in a system. In this talk, I will consider the problem of learning (estimating) a causal graphical model from data. The search over possible directed acyclic graphs modeling the causal structure is inherently combinatorial, but I’ll describe our recent work which use gradient-based continuous optimization for learning both the parameters of the distribution and the causal graph jointly, and can be combined naturally with flexible parametric families that use neural networks.

Read More…
- Mar 19, 2021
- post
Measuring timeliness of annual reports filing by jump additive models

Yicheng Kang · Mar 19, 2021
Date: 2021-03-19

Time: 15:30-16:30 (Montreal time)

Zoom Link

Meeting ID: 843 0865 5572

Passcode: 690084

Abstract:

Foreign public issuers (FPIs) are required by the Securities and Exchanges Commission (SEC) to file Form 20-F as comprehensive annual reports. In an effort to increase the usefulness of 20-Fs, the SEC recently enacted a regulation to accelerate the deadline of 20-F filing from six months to four months after the fiscal year-end. The rationale is that the shortened reporting lag would improve the informational relevance of 20-Fs. In this work we propose a jump additive model to evaluate the SEC’s rationale by investigating the relationship between the timeliness of 20-F filing and its decision usefulness using the market data. The proposed model extends the conventional additive models to allow possible discontinuities in the regression functions. We suggest a two-step jump-preserving estimation procedure and show that it is statistically consistent. By applying the procedure to the 20-F study, we find a moderate positive association between the magnitude of the market reaction and the filing timeliness when the acceleration is less than 17 days. We also find that the market considers the filings significantly more informative when the acceleration is more than 18 days and such reaction tapers off when the acceleration exceeds 40 days.

Read More…

Date: 2021-10-29

Time: 15:30-16:30 (Montreal time)

Meeting ID: 939 8331 3215

Passcode: 096952

Abstract:

Date: 2021-10-22

Time: 15:30-16:30 (Montreal time)

Meeting ID: 834 3668 6293

Passcode: 12345

Abstract:

Date: 2021-10-08

Time: 15:30-16:30 (Montreal time)

Meeting ID: 834 3668 6293

Passcode: 12345

Abstract:

Date: 2021-10-01

Time: 15:30-16:30 (Montreal time)

Meeting ID: 834 3668 6293

Passcode: 12345

Abstract:

Date: 2021-09-24

Time: 15:00-16:00 (Montreal time)

Meeting ID: 979 107 3141

Abstract:

Date: 2021-09-17

Time: 15:30-16:30 (Montreal time)

Meeting ID: 834 3668 6293

Passcode: 12345

Abstract:

Date: 2021-09-10

Time: 15:30-16:30 (Montreal time)

Meeting ID: 834 3668 6293

Passcode: 12345

Abstract:

Date: 2021-04-09

Time: 15:30-16:30 (Montreal time)

Meeting ID: 843 0865 5572

Passcode: 690084

Abstract:

Date: 2021-03-26

Time: 15:30-16:30 (Montreal time)

Meeting ID: 843 0865 5572

Passcode: 690084

Abstract:

Date: 2021-03-19

Time: 15:30-16:30 (Montreal time)

Meeting ID: 843 0865 5572

Passcode: 690084

Abstract: