A language identification method applied to Twitter data

Singh A.K.; Goyal P.

doi:DOI not available

A language identification method applied to Twitter data

Authors

Abstract

This paper presents the results of some experiments on using a simple algorithm, aided by a few heuristics, for the purposes of language identification on Twitter data. These experiments were a part of a shared task focused on this problem. The core algorithm is an n-gram based distance metric algorithm. This algorithm has previously been shown to work very well on normal text. The distance metric used is symmetric cross entropy.