Repository logo
Institutional Digital Repository
Shreenivas Deshpande Library, IIT (BHU), Varanasi

Word Level Language Identification from Social Media Code-Mixed Data Leveraging Transformer-Based Models

dc.contributor.authorChanda S.
dc.contributor.authorPal S.
dc.date.accessioned2026-06-24T07:26:42Z
dc.date.issued2025
dc.descriptionThis paper published with affiliation IIT (BHU), Varanasi in open access mode.
dc.description.Volume6
dc.description.abstractCode-mixing is mixing of two or more languages in a statement or a conversation. Multilingual communities all over the world often use this on a regular basis, especially during communication in social media. People mix their mother tongue with other national and international languages, like English. While code-mixing in verbal communication is a serious problem, it is not either easy for written communication as well. In informal written communication, people use multiple languages without changing the script, i.e. words from two or more languages occur next to each other using a single script. For an intelligent system, automatic language identification in such scenarios is an essential task. Language identification is considered here as a token classification problem. Every word in a sentence receives a linguistic tag in a supervised setup. For this task, we leverage pre-trained Bidirectional Encoder Representations of Transformers (BERT) to obtain the contextual representations of sentences. We evaluate several combinations of deep learning models and input representations. Characters, sub-words, and their combination embeddings are primarily considered for CNN and LSTM-based models. Later, BERT with LSTM model is used. Through three different datasets: ICON_POS, ICON_SAIL, and LinCE, we conduct language identification (LID) task in Bengali-English (BN-EN), Hindi-English (HI-EN), and Spanish-English (ES-EN) code-mixed sentences. Our proposed method of the Bi-LSTM model on top of BERT neural representations of code-mixed data outperforms the existing state-of-the-art techniques in terms of scores. For two datasets of Bengali-English language pairs, ICON_POS and ICON_SAIL, we observe performance gains of 8.12% and 4.23%, respectively. We demonstrate performance gains of 6.41%, 0.68%, and 7.83% for three datasets of Hindi-English language pairs ICON_POS, ICON_SAIL, and LinCE, respectively. We also show an improvement of 6.08% in language identification for the Spanish-English language pair in the LinCE dataset. © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2025.
dc.description.issue7
dc.identifier.doihttps://doi.org/10.1007/s42979-025-04377-4
dc.identifier.issn2662995X
dc.identifier.urihttps://idr-sdlib.iitbhu.ac.in/handle/123456789/24302
dc.language.isoen
dc.publisherSpringer
dc.relation.ispartofseriesSN Computer Science
dc.subjectComputer Science and Engineering
dc.titleWord Level Language Identification from Social Media Code-Mixed Data Leveraging Transformer-Based Models
dc.typeArticle

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Word-Level-Language-Identification-from-Social-Media-CodeMixed-Data-Leveraging-TransformerBased-Models_2025_Springer.pdf
Size:
4.47 MB
Format:
Adobe Portable Document Format