Large-Scale Streaming End-to-End Speech Translation

Invited Talk at NTU and SJTU

We propose to use Transformer-Transducer (T-T) for streaming end-to-end (E2E) speech translation (ST). Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed model enjoys low inference latency/computational cost and approaches the quality of cascaded ST. We then extend it to build a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into the text of a target language. We further extend SM2 to multiple output languages which only adds a small number of parameters to the original model and enables truly zero-shot capacity for unseen {source-speech, target-text} pairs. A non-erasing decoding method will be also introduced which completely solves the stability issue for online streaming system. Finally, the model is further improved by simultaneously generating automatic speech recognition (ASR) and ST results.