VALL-E Family

Neural codec language models for speech synthesis

VALL-E 2 is the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Building upon the foundation laid by its predecessor, VALL-E, the new iteration introduces two significant enhancements to elevate its performance: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue encountered in VALL-E. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments, conducted on the LibriSpeech and VCTK datasets, have shown that VALL-E 2 surpasses previous zero-shot TTS systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases.

This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.


Grouped code modeling not only accelerates inference by reducing the sequence length but also improves performance by mitigating the long context modeling problem. Based on the token repetition in the decoding history, repetition aware sampling enhances the stability of the decoding process and circumvents the infinite loop issue encountered in VALL-E.		VALL-E 2 achieves human parity zero-shot TTS performance for the first time. In this context, human parity indicates that the robustness, naturalness, and similarity metrics of VALL-E 2 surpass those of the ground truth samples ( WER(GroundTruth) − WER(VALL-E 2) >0, CMOS(VALL-E 2) − CMOS(GroundTruth) >0, and SMOS(VALL-E 2) − SMOS(GroundTruth)>0), meaning that VALL-E 2 can generate accurate, natural speech in the exact voice of the original speaker, comparable to human performance. It is important to note that this conclusion is drawn solely from experimental results on the LibriSpeech and VCTK datasets.

Audio Samples

VALL-E 2 can synthesize personalized speech even with the hard text from ELLA-V. The speaker prompts are sampled from the librispeech dataset.

Text	Speaker Prompt	VALL-E	VALL-E 2
F one F two F four F eight H sixteen H thirty two H sixty four
Clever cats carefully crafted colorful collages creating cheerful compositions
Curious koalas curiously climbed curious curious climbers
Sad snakes sadly sighed sad sad sighs
Joyful jaguars joyfully jumped joyful joyful jumps
Noisy newts nonsensically nibbled noisy noisy nibbles
Crafting a symphony of flavors the skilled chef orchestrated a culinary masterpiece that left an indelible mark mark mark mark mark on the palates of the discerning diners
The future belongs to belongs to belongs to belongs to belongs to those who believe in the beauty of the beauty of the beauty of the beauty of the beauty of their dreams

VALL-E 2 can perform zero-shot speech continuation with the first 3-second prefix as the speaker prompt, and speech synthesis with a reference utterance of an unseen speaker as the speaker prompt. The audio and transcriptions are sampled from the LibriSpeech dataset.

Text	Speaker Prompt (Prefix/Ref)	VALL-E	VALL-E 2 (GroupSize=1)	VALL-E 2 (GroupSize=2)	VALL-E 2 (GroupSize=4)
They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission

And lay me down in thy cold bed and leave my shining lot
And lay me down in thy cold bed and leave my shining lot
Number ten fresh nelly is waiting on you good night husband
Number ten fresh nelly is waiting on you good night husband
Yea his honourable worship is within but he hath a godly minister or two with him and likewise a leech

Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid

Zero-shot TTS from 3-second, 5-second and 10-second speaker prompts. The audio and transcriptions are sampled from the VCTK dataset.

Text	Speaker Prompt (3s/5s/10s)	VALL-E	VALL-E 2 (GroupSize=1)	VALL-E 2 (GroupSize=2)	VALL-E 2 (GroupSize=4)
We have to reduce the number of plastic bags


So what is the campaign about


My life has changed a lot


Nothing is yet confirmed


I could hardly move for the next couple of days

Ethics Statement

VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public. VALL-E 2 could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on. While VALL-E 2 can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model. If you suspect that VALL-E 2 is being used in a manner that is abusive or illegal or infringes on your rights or the rights of other people, you can report it at the Report Abuse Portal.