Evaluating user story quality with LLMs: a comparative study

Sharma A.; Kumar Tripathi A.

doi:https://doi.org/10.1007/s10844-025-00939-3

Evaluating user story quality with LLMs: a comparative study

Authors

Abstract

Evaluating the quality of user stories is crucial for the success of agile software development. This paper investigates the efficacy of Large Language Models (LLMs) in assessing the quality of individual user stories using the Quality User Story (QUS) framework, which categorizes quality criteria into syntactic, semantic, and pragmatic dimensions. Leveraging three state-of-the-art LLMs—GPT-4o, GPT-4-Turbo, and GPT-3.5-Turbo—this study employs two prompting strategies: context minimal and context rich, to gauge performance across eight user story quality criteria. To ensure robust validation, we generated 960 user stories using alternative LLMs (Gemini and Meta AI’s LLaMA3), which were then assessed for quality by 69 postgraduate students. The quality assessments were further verified by a team comprising a research scholar and a senior postgraduate student. The evaluation of these 960 user stories by the three LLMs under study reveal significant insights into their relative strengths and weaknesses. The results demonstrate that GPT-4o and GPT-4-Turbo exhibit superior performance in evaluating user stories, particularly excelling in syntactic and pragmatic criteria with minimal impact from additional contextual details. Conversely, GPT-3.5-Turbo reveals noticeable limitations, struggling to maintain effectiveness, particularly when handling richer contextual inputs. This research marks a pivotal step towards automated quality assessment in requirements engineering, highlighting both the potential and areas for improvement in leveraging LLMs for robust user story evaluation. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.