Decoder-only Modeling for Speech

Invited Talk at NTU and SJTU

In the world of natural language processing, foundation models have typically come in 3 different flavors: Encoder-only (e.g. BERT), Encoder-Decoder (e.g. T5) and Decoder-only (e.g. GPT-*, LLaMA, PaLM etc).  Encoder-only and Encoder-Decoder variants have been particularly effective for use cases where we finetune a pretrained model for a specific downstream task(s). However, when it comes to zero-shot generalization and multi-turn generation, Decoder-only models have proven to be more successful. Decoder-only models have also reliably demonstrated in-context learning and potentially better parameter efficiency. In this talk, we will present a few works to show how the decoder-only architecture can be a promising candidate for many speech and language tasks. We will also show that it is able to learn across different tasks, be more parameter efficient and integrate a text-LLM seamlessly with the capability of in-context learning.