Distilling an End-to-End Voice Assistant Without Instruction Training Data

William Held

Georgia Tech

The NLP Reading Group is excited to host William Held, a PhD student at Georgia Tech, who will be speaking remotely on Zoom on Friday November 22nd about “Distilling an End-to-End Voice Assistant Without Instruction Training Data”.

Talk Description

In this talk, I’ll cover the methods we used to train our Distilled Voice Assistant (DiVA) model. Recent efforts to simplify spoken NLP with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting” capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed directly with ASR data. We show that our Distilled Voice Assistant (DiVA) generalizes to unseen tasks and improves user experience, achieving a 72\% win rate compared with state-of-the-art open models like Qwen 2 Audio. Finally, I’ll cover the open-source efforts we’ve made to support training and demoing Speech LLM systems.

Speaker Bio

William Held is a Machine Learning PhD student at Georgia Tech, advised by Diyi Yang in the Stanford NLP Group. Before that, early engineer at Sunshine. Focused on enabling inclusive language technology by modeling linguistic variation.

Logistics

Date: November 22nd
Time: 11:30AM
Location: Zoom (See email)