Word Level Language Identification from Social Media Code-Mixed Data Leveraging Transformer-Based Models

Chanda S.; Pal S.

doi:https://doi.org/10.1007/s42979-025-04377-4

Word Level Language Identification from Social Media Code-Mixed Data Leveraging Transformer-Based Models

dc.contributor.author	Chanda S.
dc.contributor.author	Pal S.
dc.date.accessioned	2026-06-24T07:26:42Z
dc.date.issued	2025
dc.description	This paper published with affiliation IIT (BHU), Varanasi in open access mode.
dc.description.Volume	6
dc.description.abstract	Code-mixing is mixing of two or more languages in a statement or a conversation. Multilingual communities all over the world often use this on a regular basis, especially during communication in social media. People mix their mother tongue with other national and international languages, like English. While code-mixing in verbal communication is a serious problem, it is not either easy for written communication as well. In informal written communication, people use multiple languages without changing the script, i.e. words from two or more languages occur next to each other using a single script. For an intelligent system, automatic language identification in such scenarios is an essential task. Language identification is considered here as a token classification problem. Every word in a sentence receives a linguistic tag in a supervised setup. For this task, we leverage pre-trained Bidirectional Encoder Representations of Transformers (BERT) to obtain the contextual representations of sentences. We evaluate several combinations of deep learning models and input representations. Characters, sub-words, and their combination embeddings are primarily considered for CNN and LSTM-based models. Later, BERT with LSTM model is used. Through three different datasets: ICON_POS, ICON_SAIL, and LinCE, we conduct language identification (LID) task in Bengali-English (BN-EN), Hindi-English (HI-EN), and Spanish-English (ES-EN) code-mixed sentences. Our proposed method of the Bi-LSTM model on top of BERT neural representations of code-mixed data outperforms the existing state-of-the-art techniques in terms of scores. For two datasets of Bengali-English language pairs, ICON_POS and ICON_SAIL, we observe performance gains of 8.12% and 4.23%, respectively. We demonstrate performance gains of 6.41%, 0.68%, and 7.83% for three datasets of Hindi-English language pairs ICON_POS, ICON_SAIL, and LinCE, respectively. We also show an improvement of 6.08% in language identification for the Spanish-English language pair in the LinCE dataset. © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2025.
dc.description.issue	7
dc.identifier.doi	https://doi.org/10.1007/s42979-025-04377-4
dc.identifier.issn	2662995X
dc.identifier.uri	https://idr-sdlib.iitbhu.ac.in/handle/123456789/24302
dc.language.iso	en
dc.publisher	Springer
dc.relation.ispartofseries	SN Computer Science
dc.subject	Computer Science and Engineering
dc.title	Word Level Language Identification from Social Media Code-Mixed Data Leveraging Transformer-Based Models
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Word-Level-Language-Identification-from-Social-Media-CodeMixed-Data-Leveraging-TransformerBased-Models_2025_Springer.pdf
Size:: 4.47 MB
Format:: Adobe Portable Document Format

Download

Collections

Department of Computer Science and Engineering