Efficient Integration of Fixed Beamformers and Speech Separation Networks for Multi-Channel Far-Field Speech Separation

Zhuo Chen; Takuya Yoshioka; Xiong Xiao; Jinyu Li; Michael L. Seltzer; Yifan Gong

Efficient Integration of Fixed Beamformers and Speech Separation Networks for Multi-Channel Far-Field Speech Separation

Zhuo Chen ,
Takuya Yoshioka ,
Xiong Xiao ,
Jinyu Li ,
Michael L. Seltzer ,
Yifan Gong

ICASSP 2018 | April 2018

Published by IEEE

Download BibTex

Speech separation research has significantly progressed in recent years thanks to the rapid advances in deep learning technology. However the performance of recently proposed single-channel neural network-based speech separation methods is still limited especially in reverberant environments. To push the performance limit, we recently developed a method of integrating beamforming and single-channel speech separation approaches. This paper proposes a novel architecture that integrates multi-channel beamforming and speech separation in a much more efficient way than our previous method. The proposed architecture comprises a set of fixed beamformers, a beam prediction network, and a speech separation network based on permutation invariant training (PIT). The beam prediction network takes in the beamformed audio signals and estimates the best beam for each speaker constituting the input mixture. Two variants of PIT-based speech separation networks are proposed. Our approach is evaluated on reverberant speech mixtures under three different mixing conditions, covering cases where speakers partially overlap or one speaker’s utterance is very short. The experimental results show that the proposed system significantly outperforms the conventional single-channel PIT system, producing the same performance as a single-channel system using oracle masks.