Repository logo
Institutional Digital Repository
Shreenivas Deshpande Library, IIT (BHU), Varanasi

Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

dc.contributor.authorLuo I.
dc.contributor.authorGraber-Naidich A.
dc.contributor.authorZhang M.
dc.contributor.authorKaushik R.
dc.contributor.authorNieda G.M.
dc.contributor.authorChen T.
dc.date.accessioned2026-06-24T07:26:42Z
dc.date.issued2025
dc.descriptionThis paper published with affiliation IIT (BHU), Varanasi in open access mode.
dc.description.Volume8
dc.description.abstractAccurate smoking documentation in electronic health records (EHRs) is crucial for risk assessment and patient monitoring. However, key information is often missing or inaccurately recorded. Large language models (LLMs) present a promising solution for interpreting clinical narratives to extract comprehensive smoking data. We developed a framework utilizing LLMs combined with rule-based longitudinal smoothing techniques to enhance data quality. We compared generative LLMs (Gemini-1.5-Flash, PaLM-2-Text-Bison, GPT-4) against BERT-based models using 1683 manually annotated clinical notes from 518 patients across Stanford and Sutter Health systems. Generative LLMs achieved superior performance (> 96% accuracy) across seven smoking variables, with external validation showing robust generalizability (97.5–98.8% accuracy). We deployed Gemini-1.5-Flash to 79,408 notes from 4792 lung cancer patients, demonstrating that risk model-based surveillance incorporating smoking factors outperformed NCCN Guidelines in identifying second malignancies. Our study highlights the potential of generative LLMs to improve smoking history documentation quality, enhancing lung cancer surveillance and broader clinical applications. © The Author(s) 2025.
dc.description.issue1
dc.identifier.doihttps://doi.org/10.1038/s41746-025-02009-y
dc.identifier.issn23986352
dc.identifier.urihttps://idr-sdlib.iitbhu.ac.in/handle/123456789/24313
dc.language.isoen
dc.publisherNature Research
dc.relation.ispartofseriesnpj Digital Medicine
dc.subjectComputer Science and Engineering
dc.titleLeveraging large language models to extract smoking history from clinical notes for lung cancer surveillance
dc.typeArticle

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Leveraging-large-language-models-to-extract-smoking-history-from-clinical-notes-for-lung-cancer-surveillance_2025_Nature-Research.pdf
Size:
1.77 MB
Format:
Adobe Portable Document Format