background pattern

VALL-E

A neural codec language model for speech synthesis

VALL-E 2 is the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Building upon the foundation laid by its predecessor, VALL-E, the new iteration introduces two significant enhancements to elevate its performance: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue encountered in VALL-E. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments, conducted on the LibriSpeech and VCTK datasets, have shown that VALL-E 2 surpasses previous zero-shot TTS systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases.

This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.

raphical user interface, application, Word
graphical user interface, application, Word

Grouped code modeling not only accelerates inference by reducing the sequence length but also improves performance by mitigating the long context modeling problem. Based on the token repetition in the decoding history, repetition aware sampling enhances the stability of the decoding process and circumvents the infinite loop issue encountered in VALL-E.

VALL-E 2 achieves human parity zero-shot TTS performance for the first time. In this context, human parity indicates that the robustness, naturalness, and similarity metrics of VALL-E 2 surpass those of the ground truth samples ( WER(GroundTruth) − WER(VALL-E 2) >0, CMOS(VALL-E 2) − CMOS(GroundTruth) >0, and SMOS(VALL-E 2) − SMOS(GroundTruth)>0), meaning that VALL-E 2 can generate accurate, natural speech in the exact voice of the original speaker, comparable to human performance. It is important to note that this conclusion is drawn solely from experimental results on the LibriSpeech and VCTK datasets.

Hard Examples

  • Text Speaker Prompt VALL-E VALL-E 2
    F one F two F four F eight H sixteen H thirty two H sixty four
    Clever cats carefully crafted colorful collages creating cheerful compositions
    Curious koalas curiously climbed curious curious climbers
    Sad snakes sadly sighed sad sad sighs
    Joyful jaguars joyfully jumped joyful joyful jumps
    Noisy newts nonsensically nibbled noisy noisy nibbles
    Crafting a symphony of flavors the skilled chef orchestrated a culinary masterpiece that left an indelible mark mark mark mark mark on the palates of the discerning diners
    The future belongs to belongs to belongs to belongs to belongs to those who believe in the beauty of the beauty of the beauty of the beauty of the beauty of their dreams

LibriSpeech Samples

  • Text Speaker Prompt (Prefix/Ref) VALL-E VALL-E 2
    (GroupSize=1)
    VALL-E 2
    (GroupSize=2)
    VALL-E 2
    (GroupSize=4)
    They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission
    And lay me down in thy cold bed and leave my shining lot
    Number ten fresh nelly is waiting on you good night husband
    Yea his honourable worship is within but he hath a godly minister or two with him and likewise a leech
    Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid

VCTK Samples

  • Text Speaker Prompt (3s/5s/10s) VALL-E VALL-E 2
    (GroupSize=1)
    VALL-E 2
    (GroupSize=2)
    VALL-E 2
    (GroupSize=4)
    We have to reduce the number of plastic bags
    So what is the campaign about
    My life has changed a lot
    Nothing is yet confirmed
    I could hardly move for the next couple of days

Ethics Statement

VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public. VALL-E 2 could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on. While VALL-E 2 can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model. If you suspect that VALL-E 2 is being used in a manner that is abusive or illegal or infringes on your rights or the rights of other people, you can report it at the Report Abuse Portal.