Advancing Language Identification in Code-Mixed Tulu Texts: Harnessing Deep Learning Techniques

Chanda S.; Mishra A.; Pal S.

doi:DOI not available

Advancing Language Identification in Code-Mixed Tulu Texts: Harnessing Deep Learning Techniques

Authors

Abstract

This study focuses on the task of word-level language identification in code-mixed Tulu-English texts, which is crucial for addressing the linguistic diversity observed on social media platforms. The CoLI-Tunglish shared task served as a platform for multiple teams to tackle this challenge, aiming to enhance our understanding of and capabilities in handling code-mixed language data. To tackle this task, we employed a methodology that leveraged Multilingual BERT (mBERT) for word embedding and a Bi-LSTM model for sequence representation. Our system achieved a Precision score of 0.74, indicating accurate language label predictions. However, our Recall score of 0.571 suggests the need for improvement, particularly in capturing all language labels, especially in multilingual contexts. The resulting F1 score, a balanced measure of our system’s performance, stood at 0.602, indicating a reasonable overall performance. Ultimately, our work contributes to advancing language understanding in multilingual digital communication. © 2023 Copyright for this paper by its authors.