Developing streaming end-to-end models for automatic speech recognition in industry

Developing streaming end-to-end models for automatic speech recognition in industry

MSR-TR-2020-45 | December 2020

Published by Microsoft

Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). As E2E models are more data hungry, building industry scale ASR systems provides E2E models a good platform for technology development. On the other hand, there are more practical requirements when developing ASR systems in industry. In this talk, I will share our journey in Microsoft of developing high-accuracy and low-latency streaming RNN-T models that surpasses high-performance hybrid models which can be customized in a new domain using text only data and personalized using only minutes of speaker data. To enjoy the high accuracy and solve the heavy runtime cost issue in typical Transformer models, we designed a streamable low-latency and low-cost Transformer Transducer with a “masking is all you need” strategy. I will conclude the talk with the recently developed streaming unmixing and recognition transducer as the first low-latency streaming multi-talker E2E model.