Sequence Modeling in Unsupervised Single-Channel Overlapped Speech Recognition

Sequence Modeling in Unsupervised Single-Channel Overlapped Speech Recognition

2018 IEEE International Conference on Acoustics, Speech, and Signal Processing | April 2018

Published by IEEE

Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic speech recognition (ASR).The problems can be modularized into three sub-problems: frame-wise interpreting, sequence level speaker tracing and speech recognition. Nevertheless,previousacousticmodelsformulatethecorrelationbetween sequential labels implicitly, which limit the modeling effect. In this work, we include explicit models for the sequential label correlation during training. This is relevant to models given by both the feature sequence and the output of the last frame. Moreover, we propose to integrate the linguistic information into the assignment decision of the permutation invariant training(PIT). Namely, a senone level neural network language model (NNLM) trained in the clean speech alignment is integrated, while the objective function is still cross-entropy. The proposed methods can be combined with an improved version of PIT and sequence discriminative training, which brings about further over 10% relative improvement of WER in the artiﬁcial overlapped Switchboard and hub5e-swb dataset.