POS Tagging of English-Hindi Code-Mixed Social Media Content

Yogarshi Vyas; Spandana Gella; Jatin Sharma; Kalika Bali; Monojit Choudhury

POS Tagging of English-Hindi Code-Mixed Social Media Content

Yogarshi Vyas ,
Spandana Gella ,
Jatin Sharma ,
Kalika Bali ,
Monojit Choudhury

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) | October 2014

Published by Association for Computational Linguistics

Download BibTex

Code-mixing is frequently observed in user generated content on social media, especially from multilingual users. The linguistic complexity of such content is compounded by presence of spelling variations, transliteration and non-adherence to formal grammar. We describe our initial efforts to create a multi-level annotated corpus of Hindi-English code-mixed text collated from Facebook forums, and explore language identification, back-transliteration, normalization and POS tagging of this data. Our results show that language identification and transliteration for Hindi are two major challenges that impact POS tagging accuracy.