Towards Real-Time Single-Channel Speech Separation in Noisy and Reverberant Environments

2023 International Conference on Acoustics, Speech, and Signal Processing |

Real-time single-channel speech separation aims to unmix an audio stream captured from a single microphone that contains multiple people talking at once, environmental noise, and reverberation into multiple de-reverberated and noise-free speech tracks, each track containing only one talker. While large state-of-the-art DNNs can achieve excellent separation from anechoic mixtures of speech, the main challenge is to create compact and causal models that can separate reverberant mixtures at inference time. In this paper, we explore low-complexity, resource-efficient, causal DNN architectures for real-time separation of two or more simultaneous speakers. A cascade of three neural network modules are trained to sequentially perform noise-suppression, separation, and de-reverberation. For comparison, a larger end-to-end model is trained to output two anechoic speech signals directly from noisy reverberant speech mixtures. We propose an efficient single-decoder architecture with “subtractive” separation for real-time recursive speech separation for two or more speakers. Evaluation on real monophonic recordings of speech mixtures, according to speech separation and perceptual measures like SI-SDR, a novel proposed channel separation metric, and DNS-MOS, show that these compact causal models can separate speech mixtures with low latency, and perform on par with large offline state-of-the-art models like SepFormer.

Sound examples

The following shows some audio examples from several public datasets. Note that the baseline (SepFormer) is an offline processing method and 10x more complex, while our proposed methods are real-time using only 20 ms lookahead. We show several model variations:

  • end-to-end: a single-stage model that does all tasks end-to-end.
  • cascade: a chain of separate models aiming to perform noise reduction -> dereverberation -> speech separation.
  • cascade-subtractive: the same as cascade, but separating sources one-by-one in an iterative manner. The first separated source is subtracted from the input, and the system processes this signal again to extract the next source, etc.

DNS-challenge 4: real single microphone recording

REAL-M dataset: real single microphone recording

Dereverberation target example

The following examples demonstrate the effect of choosing the anechoic signal as target for dereverberation, or including the early reflections.

Another interesting observation is the convergence behavior of the recurrent network behavior: While for the first few seconds, there is hardly any dereverberation effect audible for both targets, the dereverberation kicks in surprisingly sudden.