Feasibility and Accuracy of an Emergency Department Chatbot for Real-Time Operational Support

Tuesday, May 19, 2026 4:48 PM to 5:00 PM · 12 min. (America/New_York)

International Hall 7: Level I

Abstracts

Informatics/Data Science/AI

Information

Number

Background and Objectives

Operational questions disrupt emergency department (ED) workflows by requiring clinicians to search for policies, paging pathways, and order logistics during active patient care. Internal chatbots have been proposed as a solution, but their accuracy for real-time operational guidance remains unclear. We evaluated the feasibility and accuracy of an internal ED chatbot and assessed whether restructuring policies to be more machine-interpretable would meaningfully improve performance.

Methods

Semi-structured interviews with 7 attendings, 3 fellows, and 5 residents identified common on-shift operational pain points. Nineteen operational topics were selected, each assessed with a single multiple-choice question (four options, one correct) reflecting a realistic on-shift scenario. The chatbot, a system-wide internal tool that incorporates institutional policies, was evaluated in three stages: (1) baseline with no preparation, (2) after providing relevant policies, and (3) after restructuring policies to emphasize key actions, explicit timing rules, and removal of ambiguous language. Accuracy was the primary outcome. Proportions are reported with 95% Wilson confidence intervals.

Results

Baseline accuracy was 7/19 (36.8%, 95% CI 19.1–59.0). After providing the policies, accuracy increased to 12/19 (63.2%, 95% CI 41.0–80.9). After policy restructuring, accuracy improved to 14/19 (73.7%, 95% CI 51.6–88.0). Despite improvement, errors followed consistent patterns, particularly in time-dependent workflows and distinguishing the correct first step. In several cases, the chatbot selected responses that sounded reasonable or “safe,” with confident explanations inconsistent with policy.

Conclusion

An internal ED chatbot can answer some operational questions, and accuracy improves with policy access and restructuring, but performance remained below expectations for dependable real-time guidance. In operational contexts where a single incorrect answer can delay care, misdirect paging, or trigger the wrong pathway, 70–75% accuracy is insufficient for unsupervised clinical use. Future work should examine whether more structured policy modifications or alternative models can improve reliability before evaluation in real-time clinical use.

CPE

CME

1.25