Improved Source-Channel Models for Chinese Word Segmentation

Published by Association for Computational Linguistics

Publication

This paper presents a Chinese word segmentation system that uses improved sourcechannel models of Chinese sentence generation. Chinese words are defined as one of the following four types: lexicon words, morphologically derived words, factoids, and named entities. Our system provides a unified approach to the four fundamental features of word-level Chinese language processing: (1) word segmentation, (2) morphological analysis, (3) factoid detection, and (4) named entity recognition. The performance of the system is evaluated on a manually annotated test set, and is also compared with several state-ofthe- art systems, taking into account the fact that the definition of Chinese words often varies from system to system.