MAVIS

Established: October 13, 2008

The Microsoft Audio Video Indexing Service (MAVIS) uses state of the art speech recognition technology developed at Microsoft Research to enable searching of audio and video files with speech. Additionally, MAVIS automatically generates closed captions and keywords which can increase accessibility and discoverability of audio and video files with speech content. MAVIS is available as a cloud service running on the Windows Azure platform.

MAVIS is now available programmatically through Azure Media Services (opens in new tab) and referred to as the “Azure Media Services Indexer” (Indexer). The Introducing: Azure Media Indexer (opens in new tab) blog post describes how to submit media files to be processed and get results. The MAVIS Portal (opens in new tab) can be used to try out the service.

At this time the Indexer supports English speech content.

For more information on MAVIS please contact us.

Search audio for spoken words – MAVIS generates a binary file which can be searched in Microsoft SQL server using full text search. The user experience is much like searching for text in documents and on the web as demonstrated on the MAVIS trial (opens in new tab) site. Users type in search terms, the result is a set of links, which when clicked on, will start playing the audio or video from where those terms were spoken.

Highly accurate audio search Results – MAVIS uses state of the art Deep Neural Net (DNN) based speech recognition (opens in new tab) technology developed at Microsoft Research to convert digital audio signals into words. Furthermore, MAVIS reduces errors in speech recognition by automatically expanding its vocabulary, and storing word alternatives using a technique referred to as Probabilistic Word-Lattice Indexing, explained in the technical background. These techniques help provide highly accurate search results.

Closed Captions – Closed captions (opens in new tab) can make audio and video content accessible to the hearing impaired, or translated so that the content can be used by a broader audience in different languages. MAVIS generates closed captions in the SAMI (opens in new tab) and TTML (opens in new tab) formats. The accuracy of closed captions generated by MAVIS will depend mainly on the clarity of speech in the media content. There are a number of subtitle editing tools available on the web which can be used to edit the closed captions generated by MAVIS for improved accuracy. MAVIS provides an estimated level of accuracy to help determine if post editing is required.

Keyword generation – MAVIS generates keywords from the speech content. The keywords are stored in an XML file with frequency and offset information. The keywords generated by MAVIS can be used to perform speech analytics, or exposed to search engines such as Bing, Google or Microsoft SharePoint to make the media files more discoverable, or used to deliver more relevant ads.
As the role of multimedia continues to grow in the enterprise, Government, and the Internet, the need for technologies that better enable discovery and search of such content becomes all the more important.

Microsoft Research has been working in the area of speech recognition for over two decades, and speech-recognition technology is integrated in a number of Microsoft products, such as the Microsoft Windows OS, TellMe.com (opens in new tab), Exchange 2010, and Office OneNote (opens in new tab). Using integrated speech-recognition technology in the Windows OS, users can dictate into applications like Microsoft Word, or use speech to interact with their Windows system. The TellMe.com service allows mobile users to get directory services using speech while on the go. Microsoft Exchange 2010 and above, provides an automatic transcript of incoming voicemails , and in Office OneNote, users can search their speech recordings using keywords.

MAVIS Adds to the list of Microsoft applications and services that use speech recognition. MAVIS is designed to enable searching of 100s or even 10,000s of hours of conversational speech with different speakers, on different topics. As illustrated below, the user can type in a search term or phrase and get back links to where those words were spoken.

Searching media files requires the installation of the MAVIS SQL add-on (opens in new tab) on a machine running Microsoft SQL server 2008 or later. The MAVIS SQL Add-on includes the software components that perform full text index on binary files generated by MAVIS, an API for searching media files and a sample web application to help develop a media search site as illustrated by the MAVIS trial (opens in new tab) site.

MAVIS also generates closed captions in SMI and TTML format which can be used to make audio and video files accessible to people with hearing disability. Additionally, MAVIS generates keywords from the speech content which is stored in an XML formatted file with offset and frequency information.

MAVIS Architecture
- Sept 10, 2014 MAVIS announced publically as Microsoft Azure Media Services Indexer (opens in new tab)
- Deep-Neural-Network Speech Recognition Debuts (opens in new tab)
- CERN multimedia searchable on DOE ScienceCinema using MAVIS (opens in new tab)
- ScienceCinema uses MAVIS to enable searching of Videos from US Department of Energy (opens in new tab)
- Microsoft Research’s video search hits the DOE (opens in new tab)
- TechFlash (opens in new tab), SeattlePI (opens in new tab) and Seattle Times (opens in new tab)blogs on MAVIS and the State of Washington Digital Archives
Opens in a new tab
- Technical background
- State of Georgia case study
- Unlocking Audio/Video Content with Speech Recognition (opens in new tab) (Mix 2010 talk)
Opens in a new tab
- Questions or comments? Send us email at mmms@microsoft.com
Opens in a new tab

MAVIS

MAVIS Architecture