Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique

  • Shruti Rijhwani ,
  • Royal Sequiera ,
  • Monojit Choudhury ,
  • ,
  • Chandra Maddila

Proc. of ACL 2017 |

Published by ACL

Word-level language detection is necessary for analyzing code-switched text, where multiple languages could be mixed within a sentence. Existing models are restricted to code-switching between two specific languages and fail in real-world scenarios as text input rarely has a priori information on the languages used. We present a novel unsupervised word-level language detection technique for code-switched text for an arbitrarily large number of languages, which does not require any manually annotated training data. Our experiments with tweets in seven languages show a 74% relative error reduction in word-level labeling with respect to competitive baselines. We then use this system to conduct a large-scale quantitative analysis of code-switching patterns on Twitter, both global as well as  region-specific, with 58M tweets.