End-To-End Generation of Talking Faces from Noisy Speech

Acoustic cues are not the only component in speech communication; if the visual counterpart is present, it is shown to benefit speech comprehension. In this work, we propose an end-to-end (no pre- or post-processing) system that can generate talking faces from arbitrarily long noisy speech. We propose a mouth region mask to encourage the network to focus on mouth movements rather than speech irrelevant movements. In addition, we use generative adversarial network (GAN) training to improve the image quality and mouth-speech synchronization. Furthermore, we employ noise-resilient training to make our network robust to unseen non-stationary noise. We evaluate our system with image quality and mouth shape (landmark) measures on noisy speech utterances with five types of unseen non-stationary noise between -10 dB and 30 dB signal-to-noise ratio (SNR) with increments of 1 dB SNR. Results show that our system outperforms a state-of-the-art baseline system significantly, and our noise-resilient training improves performance for noisy speech in a wide range of SNR.