Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance
| dc.contributor.author | Luo I. | |
| dc.contributor.author | Graber-Naidich A. | |
| dc.contributor.author | Zhang M. | |
| dc.contributor.author | Kaushik R. | |
| dc.contributor.author | Nieda G.M. | |
| dc.contributor.author | Chen T. | |
| dc.date.accessioned | 2026-06-24T07:26:42Z | |
| dc.date.issued | 2025 | |
| dc.description | This paper published with affiliation IIT (BHU), Varanasi in open access mode. | |
| dc.description.Volume | 8 | |
| dc.description.abstract | Accurate smoking documentation in electronic health records (EHRs) is crucial for risk assessment and patient monitoring. However, key information is often missing or inaccurately recorded. Large language models (LLMs) present a promising solution for interpreting clinical narratives to extract comprehensive smoking data. We developed a framework utilizing LLMs combined with rule-based longitudinal smoothing techniques to enhance data quality. We compared generative LLMs (Gemini-1.5-Flash, PaLM-2-Text-Bison, GPT-4) against BERT-based models using 1683 manually annotated clinical notes from 518 patients across Stanford and Sutter Health systems. Generative LLMs achieved superior performance (> 96% accuracy) across seven smoking variables, with external validation showing robust generalizability (97.5–98.8% accuracy). We deployed Gemini-1.5-Flash to 79,408 notes from 4792 lung cancer patients, demonstrating that risk model-based surveillance incorporating smoking factors outperformed NCCN Guidelines in identifying second malignancies. Our study highlights the potential of generative LLMs to improve smoking history documentation quality, enhancing lung cancer surveillance and broader clinical applications. © The Author(s) 2025. | |
| dc.description.issue | 1 | |
| dc.identifier.doi | https://doi.org/10.1038/s41746-025-02009-y | |
| dc.identifier.issn | 23986352 | |
| dc.identifier.uri | https://idr-sdlib.iitbhu.ac.in/handle/123456789/24313 | |
| dc.language.iso | en | |
| dc.publisher | Nature Research | |
| dc.relation.ispartofseries | npj Digital Medicine | |
| dc.subject | Computer Science and Engineering | |
| dc.title | Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance | |
| dc.type | Article |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Leveraging-large-language-models-to-extract-smoking-history-from-clinical-notes-for-lung-cancer-surveillance_2025_Nature-Research.pdf
- Size:
- 1.77 MB
- Format:
- Adobe Portable Document Format