Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

Luo I.; Graber-Naidich A.; Zhang M.; Kaushik R.; Nieda G.M.; Chen T.

doi:https://doi.org/10.1038/s41746-025-02009-y

Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

dc.contributor.author	Luo I.
dc.contributor.author	Graber-Naidich A.
dc.contributor.author	Zhang M.
dc.contributor.author	Kaushik R.
dc.contributor.author	Nieda G.M.
dc.contributor.author	Chen T.
dc.date.accessioned	2026-06-24T07:26:42Z
dc.date.issued	2025
dc.description	This paper published with affiliation IIT (BHU), Varanasi in open access mode.
dc.description.Volume	8
dc.description.abstract	Accurate smoking documentation in electronic health records (EHRs) is crucial for risk assessment and patient monitoring. However, key information is often missing or inaccurately recorded. Large language models (LLMs) present a promising solution for interpreting clinical narratives to extract comprehensive smoking data. We developed a framework utilizing LLMs combined with rule-based longitudinal smoothing techniques to enhance data quality. We compared generative LLMs (Gemini-1.5-Flash, PaLM-2-Text-Bison, GPT-4) against BERT-based models using 1683 manually annotated clinical notes from 518 patients across Stanford and Sutter Health systems. Generative LLMs achieved superior performance (> 96% accuracy) across seven smoking variables, with external validation showing robust generalizability (97.5–98.8% accuracy). We deployed Gemini-1.5-Flash to 79,408 notes from 4792 lung cancer patients, demonstrating that risk model-based surveillance incorporating smoking factors outperformed NCCN Guidelines in identifying second malignancies. Our study highlights the potential of generative LLMs to improve smoking history documentation quality, enhancing lung cancer surveillance and broader clinical applications. © The Author(s) 2025.
dc.description.issue	1
dc.identifier.doi	https://doi.org/10.1038/s41746-025-02009-y
dc.identifier.issn	23986352
dc.identifier.uri	https://idr-sdlib.iitbhu.ac.in/handle/123456789/24313
dc.language.iso	en
dc.publisher	Nature Research
dc.relation.ispartofseries	npj Digital Medicine
dc.subject	Computer Science and Engineering
dc.title	Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Leveraging-large-language-models-to-extract-smoking-history-from-clinical-notes-for-lung-cancer-surveillance_2025_Nature-Research.pdf
Size:: 1.77 MB
Format:: Adobe Portable Document Format

Download

Collections

Department of Computer Science and Engineering