Achieving Human Parity in Conversational Speech Recognition

  • Geoff Zweig ,
  • ,
  • Jasha Droppo ,
  • Xuedong Huang ,
  • Frank Seide ,
  • Mike Seltzer ,
  • Andreas Stolcke ,
  • Dong Yu

Invited talk, IEEE SLT Workshop

The Switchboard-1 Telephone Speech Corpus was originally collected by Texas Instruments in 1990-91, under DARPA sponsorship, and marked the beginning of over 25 years of intensive effort in conversational speech recognition. Recently, we have measured the ability of professional transcribers to transcribe this sort of data, and found that our latest systems have achieved the same level of performance. In this talk, I will describe the key technological advances that have made this possible: the systematic use of CNN and LSTM acoustic models in both acoustic and language modeling, as well as the extensive use of system combination. The talk will also provide an analysis of the errors made by people and computers, which show substantially similar error patterns, with the exception of confusions between backchannel acknowledgments and hesitations.