A Study on Corpus-based Stopword Lists in Indian Language IR

Sahu S.S.; Pal S.

doi:https://doi.org/10.1145/3606262

A Study on Corpus-based Stopword Lists in Indian Language IR

dc.contributor.author	Sahu S.S.; Pal S.
dc.date.accessioned	2025-05-23T11:17:10Z
dc.description.abstract	We explore and evaluate the effect of different stopword lists (non-corpus-based and corpus-based) in the information retrieval (IR) tasks with different Indian languages such as Bengali, Marathi, Gujarati, Hindi, and English. The issue was investigated from three viewpoints. Is there any performance difference between non-corpus-based and corpus-based stopword removal in chosen Indian languages? Can corpus-based stopword lists improve performance in Indian languages IR? If yes, to what extent? Among the different corpus-based stopword lists, which stopword list provides the best IR performance? Does the length of a corpus-based stopword list affect the retrieval performance in Indian languages? If yes, to what extent? It was observed that a corpus-based stopword list provides better retrieval performance than a non-corpus-based stopword list in different Indian languages. Among the different corpus-based stopword lists generated and experimented with, Zipf's law-based stopword list (idf-based one) provides the best retrieval performance in various Indian languages. The aggregation1-based stopword list provides better retrieval than the aggregation2-based list in Indian languages, but in English, the aggregation2-based stopword list performs better than the aggregation1-based list. The best performing idf-based stopword list improves MAP score by 5.43% in Bengali, 1.91% in Marathi, 5.4% in Gujarati, 1.5% in Hindi, and 2.12% in English, respectively, over their baseline counterparts. The probabilistic retrieval models (BM25 and TF-IDF) perform best in different Indian languages. A smaller length of corpus-based stopword lists performs better than a larger length of non-corpus-based stopword lists for all the Indian languages considered. The proposed schemes demonstrate that a stopword list can be heuristically generated in a language-independent statistical method and effectively used for IR tasks with performance comparable, to or even better than non-corpus-based stopword lists. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
dc.identifier.doi	https://doi.org/10.1145/3606262
dc.identifier.uri	http://172.23.0.11:4000/handle/123456789/7106
dc.relation.ispartofseries	ACM Transactions on Asian and Low-Resource Language Information Processing
dc.title	A Study on Corpus-based Stopword Lists in Indian Language IR

Collections

2023

A Study on Corpus-based Stopword Lists in Indian Language IR

Files

Collections