The Role of Large Language Models in Emergency Care: A Comprehensive Benchmarking Study

Tuesday, May 19, 2026 2:08 PM to 2:16 PM · 8 min. (America/New_York)

M101: Level M

Abstracts

Informatics/Data Science/AI

Information

Abstract Number

221

Background and Objectives

Emergency departments (EDs) face compounding crises of overcrowding, workforce shortages, financial instability, and increasingly complex multimorbid patients. Large Language Models (LLMs) offer potential to streamline workflows and support clinical decision-making. This study evaluated LLMs’ emergency medicine knowledge and their ability to perform simulated ED tasks.

Methods

This two-part study first tested factual knowledge of 18 LLMs using a curated MedMCQA subset covering 12 ED chief complaints, assessing accuracy, precision, and recall. Five models (GPT-5, GPT-4, Claude 3.5, Claude 4, and LLaMA 3.1) were then evaluated on patient summaries, Emergency Severity Index scoring, investigative questioning, management planning, and differential diagnosis across 12 simulated ED cases presented through four sequential information levels. Physicians rated outputs for accuracy, safety, and clinical relevance, with performance differences analyzed statistically.

Results

LLaMA-4 Maverick achieved the highest factual accuracy(90.7%), followed by LLaMA-3.1-70B(90.1%). In clinical tasks, GPT-5 outperformed all models, (Level 2 onwards, p<0.05), with performance stable or improving as complexity increased. Claude 3.5 ranked next, while Claude 4 performed slightly lower but stable with complexity. LLaMA-3.1 and GPT-4 ranked lowest and showed the greatest degradation. All models undertriaged except Claude 3.5, which initially overtriaged. GPT-5 demonstrated the strongest clinical reasoning and scalability with complexity, while LLaMA models excelled in factual recall.

Conclusion

GPT-5 demonstrated the strongest applied clinical reasoning and scalability with case complexity, while LLaMA models excelled in factual recall. These findings highlight a generational leap in reasoning performance, with GPT-5 showing early promise for ED decision-support, but real-world validation remains essential prior to deployment.

CME

0.75