SpeechX

Neural Codec Language Model as a Versatile Speech Transformer

SpeechX is a versatile speech generation model leveraging audio and text prompts, which can deal with both clean and noisy speech inputs and perform zero-shot TTS and various tasks involving transforming the input speech. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting. This enables unified treatment of various tasks in an extensible manner, providing a consistent way of leveraging text input for speech enhancement and transformation. The current model, trained on 60K hours of speech audio, can perform zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing, where the spoken content can be altered while preserving the speaker and background sounds.

Read the paper

Model overview

SpeechX is a neural codec language model based on audio and text prompts and incorporates special tokens for task-dependent prompting, which helps the model to determine what a desired output is. The current model was trained on 60K hours of data from LibriLight.

Multiple tasks with one model

SpeechX deals with various input-output transformation relationships by employing a generic language modeling architecture using acoustic and textual tokens.

Applications / demo

Below, we included audio samples demonstrating how SpeechX performs in various speech-processing tasks. The audio files were normalized in amplitude and resampled at 16 kHz for listening. The speech samples and transcripts were taken from LibriSpeech test-clean. The speech samples below are provided for the sole purpose of illustrating SpeechX.

Speech Generation Tasks

Zero-shot TTS (Text To Speech)

SpeechX synthesizes speech in the style specified by an audio prompt

Text	Prompt (speaker)	SpeechX output	Ground truth
miss de graf said kenneth noticing the boy’s face critically as he stood where the light from the passage fell upon it
the paris plant like that at the crystal palace was a temporary exhibit
that summer’s emigration however being mainly from the free states greatly changed the relative strength of the two parties
it is my heart hung in the sky and no clouds ever float between the grave flowers and my heart on high

Spoken content editing

SpeechX helps correct misspoken words.

Original Text	Edited Text	Original speech	SpeechX output
cotton is a wonderful thing is it not boys she said rather primly	cotton is a wonderful thing with its soft and breathable texture she said rather primly
gold is the most common metal in the land of oz and is used for many purposes because it is soft and pliable	in the land of oz gold is the prevalent metal and is utilized for various reasons because it is soft and pliable.
its jaw is enormous and according to naturalists it is armed with no less than one hundred and eighty two teeth	its jaw is enormous and according to naturalists it is armed with no more than five hundred and thirty three teeth
shame on you citizens cried he i blush for my fellows of nottingham	citizens you should be ashamed cried he i blush for my fellows of nottingham

Background-preserving spoken content editing

SpeechX naturally corrects misspoken words by preserving original ambience

Original Text	Edited Text	Original speech	SpeechX output
we will go out together to the bower there is a way down to the court from my window	we will go out together to the bower there is a pathway down to the courtyard
she has been dead these twenty years	she has been dead thirty nine years
its origin was small a germ an insignificant seed hardly to be thought of as likely to arouse opposition	its origin was small a germ an insignificant seed always expected to provoke strong opposition
he was a fanatic on formality and he only addressed me in the third person to the point where it got tiresome	he was obsessive about protocol and always spoke to me in the third person to the point where it got tiresome

Noise suppression

SpeechX removes unwanted background sounds that have been mixed into your recordings. Text input is optional, but it helps.

Text	Noisy speech	SpeechX output	Ground truth
secure as he thought in the careful administration of justice in that city and the character of its well disposed inhabitants the good hidalgo was far from thinking that any disaster could befal his family
one day when the boy was sent by his grandfather with a message to a relation he passed along a street in which there was a great concourse of horsemen
the ideas also remain but they have become types in nature forms of men animals birds fishes
the salient features of this development of domestic service have already been indicated

Target speaker extraction

SpeechX zeros in on one person in a mixture of voices

Text	Prompt (speaker)	Mixed speech	SpeechX output	Ground truth
i allude to the goddess
at that moment the gentleman entered bearing a huge object concealed by a piece of green felt
I knew nothing of the doctrine of faith because we were taught sophistry instead of certainty and nobody understood spiritual boasting
it is my heart hung in the sky and no clouds ever float between the grave flowers and my heart on high

Speech removal

SpeechX can naturally erase human voices for audio redaction

Noisy speech	SpeechX output	Ground truth

Ethics statement

SpeechX could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on. While SpeechX can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.