Deep-Neural-Network Speech Recognition Debuts

Published June 14, 2012

Share this page

Posted by Rob Knies

(opens in new tab)
Last August, my colleague Janie Chang (opens in new tab) wrote a feature story titled Speech Recognition Leaps Forward (opens in new tab) that was published on the Microsoft Research website (opens in new tab). The article outlined how Dong Yu (opens in new tab), of Microsoft Research Redmond (opens in new tab), and Frank Seide (opens in new tab), of Microsoft Research Asia (opens in new tab), had extended the state of the art in real-time, speaker-independent, automatic speech recognition.

Now, that improvement has been deployed to the world. Microsoft is updating the Microsoft Audio Video Indexing Service (opens in new tab) with new algorithms that enable customers to take advantage of the improved accuracy detailed in a paper Yu, Seide, and Gang Li, also of Microsoft Research Asia, delivered in Florence, Italy, during Interspeech 2011, the 12th annual Conference of the International Speech Communication Association.
The algorithms represent the first time a company has released a deep-neural-networks (DNN)-based speech-recognition algorithm in a commercial product.

It’s a big deal. The benefits, says Behrooz Chitsaz (opens in new tab), director of Intellectual Property Strategy for Microsoft Research, are improved accuracy and faster processor timing.

He says that tests have demonstrated that the algorithm provides a 10- to 20-percent relative error reduction and uses about 30 percent less processing time than the best-of-breed speech-recognition algorithms based on so-called Gaussian Mixture Models.

Importantly, deep neural networks achieve these gains without the need for “speaker adaptation.” In comparison, today’s state-of-the-art technology operates in “speaker-adaptive” mode, in which an audio file is recognized multiple times, and after each time, the recognizer “tunes” itself a little more closely to the specific speaker or speakers in the file, so that the next time, it gets better—an expensive process.

The ultimate goal of automatic speech recognition, Chang’s story indicates, is out-of-the-box speaker-independent services that don’t require user training. Such services are critical in mobile scenarios, at call centers, and in web services for speech-to-speech translation. It’s difficult to overstate the impact that this technology will have as it rolls out across the breadth of Microsoft’s other services and applications that employ speech recognition.

Artificial neural networks are mathematical models of low-level circuits in the human brain. They have been in use for speech recognition for more than 20 years, but only a few years ago did computer scientists gain access to enough computing power to make it possible to build models that are fine-grained and complex enough to show promise in automatic speech recognition.

An intern at Microsoft Research Redmond, George Dahl (opens in new tab), now at the University of Toronto, contributed insights into the working of DNNs and experience in training them. His work helped Yu and teammates produce a paper called Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition (opens in new tab).

In October 2010, Yu presented the paper during a visit to Microsoft Research Asia. Seide was intrigued by the research results, and the two joined forces in a collaboration that has scaled up the new, DNN-based algorithms to thousands of hours of training data.