Word Level Language Identification from Social Media Code-Mixed Data Leveraging Transformer-Based Models
| dc.contributor.author | Chanda S. | |
| dc.contributor.author | Pal S. | |
| dc.date.accessioned | 2026-06-24T07:26:42Z | |
| dc.date.issued | 2025 | |
| dc.description | This paper published with affiliation IIT (BHU), Varanasi in open access mode. | |
| dc.description.Volume | 6 | |
| dc.description.abstract | Code-mixing is mixing of two or more languages in a statement or a conversation. Multilingual communities all over the world often use this on a regular basis, especially during communication in social media. People mix their mother tongue with other national and international languages, like English. While code-mixing in verbal communication is a serious problem, it is not either easy for written communication as well. In informal written communication, people use multiple languages without changing the script, i.e. words from two or more languages occur next to each other using a single script. For an intelligent system, automatic language identification in such scenarios is an essential task. Language identification is considered here as a token classification problem. Every word in a sentence receives a linguistic tag in a supervised setup. For this task, we leverage pre-trained Bidirectional Encoder Representations of Transformers (BERT) to obtain the contextual representations of sentences. We evaluate several combinations of deep learning models and input representations. Characters, sub-words, and their combination embeddings are primarily considered for CNN and LSTM-based models. Later, BERT with LSTM model is used. Through three different datasets: ICON_POS, ICON_SAIL, and LinCE, we conduct language identification (LID) task in Bengali-English (BN-EN), Hindi-English (HI-EN), and Spanish-English (ES-EN) code-mixed sentences. Our proposed method of the Bi-LSTM model on top of BERT neural representations of code-mixed data outperforms the existing state-of-the-art techniques in terms of scores. For two datasets of Bengali-English language pairs, ICON_POS and ICON_SAIL, we observe performance gains of 8.12% and 4.23%, respectively. We demonstrate performance gains of 6.41%, 0.68%, and 7.83% for three datasets of Hindi-English language pairs ICON_POS, ICON_SAIL, and LinCE, respectively. We also show an improvement of 6.08% in language identification for the Spanish-English language pair in the LinCE dataset. © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2025. | |
| dc.description.issue | 7 | |
| dc.identifier.doi | https://doi.org/10.1007/s42979-025-04377-4 | |
| dc.identifier.issn | 2662995X | |
| dc.identifier.uri | https://idr-sdlib.iitbhu.ac.in/handle/123456789/24302 | |
| dc.language.iso | en | |
| dc.publisher | Springer | |
| dc.relation.ispartofseries | SN Computer Science | |
| dc.subject | Computer Science and Engineering | |
| dc.title | Word Level Language Identification from Social Media Code-Mixed Data Leveraging Transformer-Based Models | |
| dc.type | Article |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Word-Level-Language-Identification-from-Social-Media-CodeMixed-Data-Leveraging-TransformerBased-Models_2025_Springer.pdf
- Size:
- 4.47 MB
- Format:
- Adobe Portable Document Format