Repository logo
Institutional Digital Repository
Shreenivas Deshpande Library, IIT (BHU), Varanasi

Effect of Stopwords and Stemming Techniques in Urdu IR

dc.contributor.authorSahu S.S.; Dutta D.; Pal S.; Rasheed I.
dc.date.accessioned2025-05-23T11:17:47Z
dc.description.abstractThis paper explores and evaluates the effect of different stopword removal and stemming techniques in Urdu IR. The issues are examined from four viewpoints. Is there any performance difference between non-corpus-based and corpus-based stopword removal in Urdu IR? Can corpus-based stopword lists improve performance in Urdu IR? Among the different corpus-based stopword lists, which stopword list gives the best performance in the IR domain? Does language-independent stemmer improve the performance of Urdu IR? Which is the best stemmer for Urdu IR? Whether to use clustering-based (YASS) or fast corpus-based (FCB) or co-occurrence-based (SNS) or graph-based (GRAS) or Trunc-n-based indexing? It was observed that the shorter length of a corpus-based stopword list outperforms a larger length of a non-corpus-based stopword list in the IR domain. Among the different corpus-based stopword lists, Zipf’s law-based stopword list (nidf approach) provides the best performance and improves a mean average precision (MAP) score of 4.9% compared to baseline approaches. During stemming evaluation, we observed that the language-independent stemming techniques improve retrieval performance in Urdu IR. Among the different stemming techniques, the FCB V-1-based stemmer performs best and improves a MAP score of 1.41% compared to the no-stemming approach. The trunc-n-based indexing strategy provides comparable performance to the language-independent stemming approach. In both the stopword removal and stemming strategies, the BB2 retrieval model outperforms other models in the IR domain. © 2023, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.
dc.identifier.doihttps://doi.org/10.1007/s42979-023-01953-4
dc.identifier.urihttp://172.23.0.11:4000/handle/123456789/7756
dc.relation.ispartofseriesSN Computer Science
dc.titleEffect of Stopwords and Stemming Techniques in Urdu IR

Files

Collections