Multi-Teacher Distillation: An Ensemble-Then-Distill Approach

Prof. Lili Mou

University of Alberta

The NLP Reading Group is delighted to have Prof. Lili Mou give a talk about “Multi-Teacher Distillation: An Ensemble-Then-Distill Approach”.

Talk Description

Knowledge distillation (KD) aims to transfer the knowledge in a large model (called a teacher) into a small one (called a student), and has become an emerging research topic as the sizes of deep learning models keep growing. Today, there are abundant readily available large models, such as ChatGPT, LLaMa, and T5. It then becomes natural to ask: Can we distill the knowledge from multiple teachers? At first glance, it appears easy to perform multi-teacher KD, as we can simply train the student from the union of teachers’ predictions. However, I would argue that such a naïve attempt may not work well for multi-teacher KD. This is because traditional KD adopts the cross-entropy loss, which tends to yield a smooth distribution. In this talk, I will present a novel ensemble-then-distill approach, which builds an ensemble of teacher models to train the student. I will also discuss applications to text generation and syntactic parsing.

Speaker Bio

Dr. Lili Mou is an Assistant Professor at the Department of Computing Science, University of Alberta. He is also an Alberta Machine Intelligence Institute (Amii) Fellow and a Canada CIFAR AI (CCAI) Chair. Lili received his BS and PhD degrees in 2012 and 2017, respectively, from School of EECS, Peking University. After that, he worked as a postdoctoral fellow at the University of Waterloo. His research interests mainly lie in designing novel machine learning algorithms and frameworks for NLP. He has publications at top conferences and journals, including ACL, EMNLP, TACL, ICML, ICLR, and NeurIPS. He also presented tutorials at EMNLP’19 and ACL’20. He received a AAAI New Faculty Highlight Award in 2021.

Logistics

Date: April 18th
Time: 11:00AM
Location: Auditorium 2 or via Zoom (See email)