

The Role of Large Language Models in Emergency Care: A Comprehensive Benchmarking Study
Tuesday, May 19, 2026 2:08 PM to 2:16 PM · 8 min. (America/New_York)
M101: Level M
Abstracts
Informatics/Data Science/AI
Information
Abstract Number
221
Background and Objectives
Emergency departments (EDs) face compounding crises of overcrowding, workforce shortages, financial instability, and increasingly complex multimorbid patients. Large Language Models (LLMs) offer potential to streamline workflows and support clinical decision-making. This study evaluated LLMs’ emergency medicine knowledge and their ability to perform simulated ED tasks.
Methods
This two-part study first tested factual knowledge of 18 LLMs using a curated MedMCQA subset covering 12 ED chief complaints, assessing accuracy, precision, and recall. Five models (GPT-5, GPT-4, Claude 3.5, Claude 4, and LLaMA 3.1) were then evaluated on patient summaries, Emergency Severity Index scoring, investigative questioning, management planning, and differential diagnosis across 12 simulated ED cases presented through four sequential information levels. Physicians rated outputs for accuracy, safety, and clinical relevance, with performance differences analyzed statistically.
Results
LLaMA-4 Maverick achieved the highest factual accuracy(90.7%), followed by LLaMA-3.1-70B(90.1%). In clinical tasks, GPT-5 outperformed all models, (Level 2 onwards, p<0.05), with performance stable or improving as complexity increased. Claude 3.5 ranked next, while Claude 4 performed slightly lower but stable with complexity. LLaMA-3.1 and GPT-4 ranked lowest and showed the greatest degradation. All models undertriaged except Claude 3.5, which initially overtriaged. GPT-5 demonstrated the strongest clinical reasoning and scalability with complexity, while LLaMA models excelled in factual recall.
Conclusion
GPT-5 demonstrated the strongest applied clinical reasoning and scalability with case complexity, while LLaMA models excelled in factual recall. These findings highlight a generational leap in reasoning performance, with GPT-5 showing early promise for ED decision-support, but real-world validation remains essential prior to deployment.
CME
0.75
Disclosures
Access the following link to view disclosures of session presenters, presenting authors, organizers, moderators, and planners:

