Date: 2024-11-01
Time: 15:30-16:30 (Montreal time)
Location: In person, Burnside 1104
https://mcgill.zoom.us/j/81284191962
Meeting ID: 812 8419 1962
Passcode: None
Abstract:
Mixtures of experts (MoEs), a class of statistical machine learning models that combine multiple models, known as experts, to form more complex and accurate models, have been combined into deep learning architectures to improve the ability of these architectures and AI models to capture the heterogeneity of the data and to scale up these architectures without increasing the computational cost. In mixtures of experts, each expert specializes in a different aspect of the data, which is then combined with a gating function to produce the final output. Therefore, parameter and expert estimates play a crucial role by enabling statisticians and data scientists to articulate and make sense of the diverse patterns present in the data. However, the statistical behaviors of parameters and experts in a mixture of experts have remained unsolved, which is due to the complex interaction between gating function and expert parameters.
In the first part of the talk, we investigate the performance of the least squares estimators (LSE) under a deterministic MoEs model where the data are sampled according to a regression model, a setting that has remained largely unexplored. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. We demonstrate that the rates for estimating strongly identifiable experts, namely the widely used feed-forward networks with activation functions sigmoid(·) and tanh(·), are substantially faster than those of polynomial experts, which we show to exhibit a surprising slow estimation rate.
In the second part of the talk, we show that the insights from theories shed light into understanding and improving important practical applications in machine learning and artificial intelligence (AI), in- cluding effectively scaling up massive AI models with several billion parameters, efficiently finetuning large-scale AI models for downstream tasks, and enhancing the performance of Transformer model, state-of-the-art deep learning architecture, with a novel self-attention mechanism.
Speaker
Nhat Ho is currently an Assistant Professor of Statistics and Data Science at the University of Texas at Austin. He is a core member of the University of Texas, Austin Machine Learning Laboratory and senior personnel of the Institute for Foundations of Machine Learning. He is currently associate editor of Electronic Journal of Statistics and area chair of ICML, ICLR, AISTATS, etc. His current research focuses on the interplay of four principles of statistics and data science: (1) Heterogeneity of complex data, including mixture and hierarchical models and Bayesian nonparametrics; (2) Stability and optimality of optimization and sampling algorithms for solving statistical machine learning models; (3) Scalability and efficiency of optimal transport for machine learning and deep learning applications; (4) Interpretability, efficiency, and robustness of massive and complex machine learning models.