Screening for Emergency Department Missed Opportunities for Diagnosis Using Large Language Models

Tuesday, May 19, 2026 2:00 PM to 2:08 PM · 8 min. (America/New_York)

M101: Level M

Abstracts

Informatics/Data Science/AI

Information

Abstract Number

220

Background and Objectives

Missed opportunities for diagnosis (MODs) are a major source of morbidity and mortality in the emergency department (ED). Traditional electronic triggers (eTriggers), such as 72-hour returns with admission, identify cases with elevated error risk but have low positive predictive value (PPV). Large language models (LLMs) may enhance MOD detection and improve the efficiency of quality review.

Methods

We performed a retrospective observational cohort study of ED encounters (March 2015–June 2025) from 10 EDs in a single US health system. Emergency physicians adjudicated random samples of encounters identified by three validated eTriggers (72-hour return admission, 10-day ICU return, and floor-to-ICU escalation). A novel hybrid eTrigger combining an LLM adjudicator with a rules engine was evaluated for 9-day return admissions with emergency care–sensitive conditions (ECSCs). Claude Sonnet 4 was prompted with an iteratively developed SaferDx-based schema to identify MODs. Primary measures were PPV, sensitivity, specificity, negative predictive value (NPV), and number needed to screen (NNS). Stakeholder ratings of LLM case summaries was the secondary outcome.

Results

Among 357 encounters (mean age 65.2 years; 47.1% female), MOD PPVs for traditional eTriggers ranged from 11.0% to 18.6%. For 72-hour returns, the LLM achieved sensitivity 85.7%, specificity 56.8%, PPV 19.8%, NPV 97.0%. For 10-day ICU returns, sensitivity was 100%, specificity 43.5%, PPV 27.8%, NPV 100%. For floor-to-ICU escalations, sensitivity was 55.6%, specificity 64.6%, PPV 26.3%, NPV 86.4%. The hybrid ECSC eTrigger identified 110 MODs (53.1% of 207 cases) with a blinded review estimated PPV 45% and NPV 100%. LLM summaries were rated actionable for clinician feedback (mean 4.1/5) but less for systems-level improvement (1.4/5).

Conclusion

LLM adjudication showed high NPVs and substantially improved screening efficiency. LLM-generated narratives may support scalable diagnostic quality oversight and clinician-focused feedback in the ED.

CME

0.75