Efficient integration of fixed beamformers and speech separation networks for multi-channel far-field speech separation

Zhuo Chen; Takuya Yoshioka; Xiong Xiao; Jinyu Li; Michael L. Seltzer; Yifan Gong

Efficient integration of fixed beamformers and speech separation networks for multi-channel far-field speech separation

Zhuo Chen ,
Takuya Yoshioka ,
Xiong Xiao ,
Jinyu Li ,
Michael L. Seltzer ,
Yifan Gong

ICASSP | April 2018

Published by IEEE

Download BibTex

Speech separation research has significantly progressed in recent years thanks to the rapid advances in deep learning technology.
However the performance of recently proposed single-channel neural network-based speech separation methods is still limited especially in reverberant environments.
To push the performance limit,
we recently developed a method of integrating
beamforming and single-channel speech separation approaches.
This paper proposes a novel architecture that integrates
multi-channel beamforming and speech separation in
a much more efficient way than our previous method.
The proposed architecture comprises
a set of fixed beamformers,
a beam prediction network,
and a speech separation network based on permutation
invariant training (PIT).
The beam prediction network takes in the beamformed audio signals
and estimates the best beam for each speaker constituting the input mixture.
Two variants of PIT-based speech separation networks are proposed.
Our approach is evaluated on reverberant speech mixtures
under three different mixing conditions, covering cases
where speakers partially overlap or one speaker’s utterance is very short.
The experimental results show that the proposed system
significantly outperforms the conventional single-channel PIT system, producing the same performance as a single-channel system using oracle masks.