Repository logo
Institutional Digital Repository
Shreenivas Deshpande Library, IIT (BHU), Varanasi

Building a text retrieval system for the Sanskrit language: Exploring indexing, stemming, and searching issues

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Stemming is an important pre-processing step in the text analysis domains such as text mining, text summarization and information retrieval (IR). In this study, we build a Sanskrit text collection and explore different indexing, stemming and searching strategies in Sanskrit. We also propose two stemmers: a ‘light’ and an ‘aggressive’ and evaluate their effectiveness in the text analysis task. The performance of the stemmers is evaluated in two ways: a direct and an indirect IR-based evaluation. In direct evaluation, we found that the stemmers are effective. In indirect evaluation, we apply different retrieval models such as BM25, TF–IDF, Divergence from Randomness (DFR) based and language models. The proposed stemmers are compared with GRAS stemmer, language-independent indexing (trunc-n) and no stemming approach. Among different stemming methods, aggressive stemmers provide the best performance. Hiemstra language model outperforms other retrieval models we experimented with. In statistical analysis, we found that the proposed stemming approaches produce significantly better results than the no-stemming approach. © 2023 Elsevier Ltd

Description

Keywords

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By