Neural TTS Stylization with Adversarial and Collaborative Games

ShuangMa; Daniel McDuff; Yale Song

Neural TTS Stylization with Adversarial and Collaborative Games

ShuangMa ,
Daniel McDuff ,
Yale Song

International Conference on Learning Representations (ICLR) | April 2019

Download BibTex

The modeling of style when synthesizing natural human speech from text has been the focus of signiﬁcant attention. Some state-of-the-art approaches train an encoder-decoder network on paired text and audio samples hxtxt, xaudi by encouraging its output to reconstruct xaud. The synthesized audio waveform is expected to contain the verbal content of xtxt and the auditory style of xaud. Unfortunately, modeling style in TTS is somewhat under-determined and training models with a reconstruction loss alone is insufﬁcient to disentangle content and style from other factors of variation. In this work, we introduce an end-to-end TTS model that offers enhanced content-style disentanglement ability and controllability. We achieve this by combining a pairwise training procedure, an adversarial game, and a collaborative game into one training scheme. The adversarial game concentrates the true data distribution, and the collaborative game minimizes the distance between real samples and generated samples in both the original space and the latent space. As a result, our model delivers a highly controllable generator with disentangled representation. Beneﬁting from the separate modeling of style and content, our model can generate human ﬁdelity speech that satisﬁes the desired style conditions. Our model achieves start-of-the-art results across multiple tasks, including style transfer (content and style swapping), emotion modeling, and identity transfer (ﬁtting a new speaker’s voice).