Advancing end-to-end automatic speech recognition

MSR-TR-2021-32 |

Published by Microsoft

Keynote talk at the conference on Computational Linguistics and Speech Processing, 2021.

Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models  achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models still dominate the commercial ASR systems at current time. There are lots of practical factors that affect the production model deployment decision. Traditional hybrid models, being optimized for production for decades, are usually good at these factors. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized.

In this talk, I will overview the recent advances in E2E models with the focus on technologies addressing those challenges from the perspective of industry. Specificly, I will describe methods of 1) building high-accuracy low-latency E2E models,  2) building a single E2E model to serve all multilingual users, 3) customizing and adapting E2E models to a new domain 4) extending E2E models for multi-talker ASR etc. Finally, I will conclude the talk with some challenges we should address in the future.