Adversarial Training for Speech Super-Resolution

IEEE Journal of Selected Topics in Signal Processing | , Vol 13(2): pp. 347-358

Publication

Speech super-resolution or speech bandwidth expansion aims to upsample a given speech signal by generating the missing high-frequency content. In this paper, we propose a deep neural network approach exploiting the adversarial training ideas that have been shown effective in image super-resolution. Specifically, our proposed network follows the generative adversarial networks setup, where the generator network uses a convolutional autoencoder architecture with one-dimensional convolution kernels to generate high-frequency log-power spectra from the low-frequency log-power spectra of the input speech. We propose to use both the reconstruction loss and the adversarial loss for training, and we employ a recent regularization method that penalizes the gradient norms of the discriminator to stabilize the training. We compare our proposed approach with two state-of-the-art neural network baselines and evaluate these methods with both objective speech quality measures and subjective perceptual and intelligibility tests. Results show that our proposed method outperforms both baselines in terms of both objective and subjective evaluations. To gain insights of the network architecture, we analyze key parameters of the proposed network including the number of layers, the number of convolution kernels, and the relative weight of the reconstruction and adversarial losses. Besides, we analyze the computational complexity of our method and the baselines and discuss ways for phase estimation. We further develop a noise-resilient version of the proposed approach by training the network with noisy speech inputs. Objective evaluation validates the noise-resilient property on unseen noise types.