Continuous speech separation: dataset and analysis

  • Zhuo Chen ,
  • Takuya Yoshioka ,
  • Liang Lu ,
  • Tianyan Zhou ,
  • Zhong Meng ,
  • Yi Luo ,
  • Jian Wu ,
  • Xiong Xiao ,

ICASSP |

This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior speech separation studies use pre-segmented audio signals, which are typically generated by mixing speech utterances on computers so that they fully overlap. Also, the separation algorithms have often been evaluated based on signal-based metrics such as signal-to-distortion ratio. However, in natural conversations, speech signals are continuous and contain both overlapped and overlap-free regions. In addition, the signal-based metrics only have weak correlation with automatic speech recognition (ASR) accuracy. Not only does this make it hard to assess the practical relevance of the tested algorithms, it also hinders researchers from developing systems that can be readily applied to real scenarios. In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a continuous audio stream that contains multiple utterances that are partially overlapped by a varying degree. A new real recording dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate conversations and capturing the audio replays with far-field microphones. A Kaldi-based ASR evaluation protocol is established by using a well-trained multi-conditional acoustic model. A recently proposed speaker-independent CSS algorithm is investigated by using LibriCSS. The dataset and evaluation scripts are made available to facilitate the research in this direction.