Accent Issues in Large Vocabulary Continuous Speech Recognition (LVCSR)

  • Chao Huang ,
  • Eric Chang ,
  • Tao Chen

MSR-TR-2001-69 |

Publication

Speech recognition has achieved great improvements recently. However, robustness is still one of the big problems, e.g. performance of recognition fluctuates sharply depending on the speaker, especially when the speaker has strong accent that is not covered in the training corpus. In this report, we first introduce our result on cross accent experiments and show a 30% error rate increase when accent independent models are used instead of accent dependent ones. Then we organize the report into three parts to cover the problem. In the first part, we do an investigation of speaker variability and manage to seek out the relationship between the well-known parameter representation and the physical characteristics of speaker, especially accent and confirm once more that accent is one of the main factors causing speaker variability. Then we provide our solutions for accent variability from two aspects. One is adaptation method, including pronunciation dictionary adaptation and acoustic model adaptation, which integrate the dominant changes among accent speaker groups and the detailed style for specific speaker in each group. The other is to build accent specific models as we do in cross accent experiments. The key point inside this method is to provide an automatic mechanism to choose the accent dependent model, which is explored in the fourth part of the report. We propose a fast and efficient GMM based accent identification method. The respective descriptions of three parts are outlined as follows. Analysis and modeling of speaker variability, such as gender, accent, age, speaking rate, and phone realizations, are important issues in speech recognition. It is known that existing feature representations describing speaker variations are high dimensional. In the third part of this report, we introduce two powerful multivariate statistical analysis methods, namely, principal component analysis (PCA) and independent component analysis (ICA), as tools to analyze such variability and extract low dimensional feature representation. Our findings are the following: (1) the first two principal components correspond to gender and accent, respectively. (2) It is shown that ICA based features yield better classification performance than PCA ones. Using 2-dimensional ICA representation, we achieve 6.1% and 13.3% error rate in gender and accent classification, respectively, for 980 speakers. In the fourth part, a method of accent modeling through Pronunciation Dictionary Adaptation (PDA) is presented. We derive the pronunciation variation between canonical speaker groups and accent groups and add an encoding of the differences to a canonical dictionary to create a new, adapted dictionary that reflects the accent characteristics. The pronunciation variation information is then integrated with acoustic and language models into a one-pass search framework. It is assumed that acoustic deviation and pronunciation variation are independent but complementary phenomena that cause poor performance among accented speakers. Therefore, MLLR, an efficient model adaptation technique, is also presented both alone and in combination with PDA. It is shown that when PDA, MLLR and the combination of them are used, error rate reductions of 13.9%, 24.1% and 28.4% respectively, are achieved. It is well known that speaker variability caused by accent is an important factor in speech recognition. Some major accents in China are so different as to make this problem very severe. In part 5, we propose a Gaussian mixture model (GMM) based Mandarin accent identification method. In this method, a number of GMMs are trained to identify the most likely accent given test utterances. The identified accent type can be used to select an accent-dependent model for speech recognition. A multi-accent Mandarin corpus was developed for the task, including 4 typical accents in China with 1,440 speakers (1,200 for training, 240 for testing). We explore experimentally the effect of the num