An End-to-end Architecture of Online Multi-channel Speech Separation

  • Jian Wu ,
  • Zhuo Chen ,
  • ,
  • Takuya Yoshioka ,
  • Zhili Tan ,
  • Ed Lin ,
  • Yi Luo ,
  • Lei Xie

Interspeech |

Although mask based adaptive beamforming technique benefits
speech recognition in far-field, noisy and multi-talker scenarios, it
depends on the long time context to estimate target and interference
statistics, thus when applied in applications with low latency
requirement, its performance usually drops drastically. In contrast,
the fixed beamformers do not import time delay but usually have
limited capability in acoustic cancellation of interfering source. In
this work, we propose a novel multi-channel speech separation system
that targets at overlapped speech recognition with low latency
processing, which includes four jointly optimized components: a
pre-separator, a set of fixed beamformer, an attentional selection
module and neural post filtering. With proposed model, low latency
processing is achieved by utilizing the known microphone geometry
information, while keeps the high quality separation through neural
post filtering and end-to-end optimization. In our experiments, we
show that the proposed system achieves comparable performance
in offline evaluation with the mask based MVDR and speech extraction
system, while yield remarkable improvements in the online
evaluation.