Evaluating the Performance of a Large Language Model in Reviewing EMTALA Cases Involving Disruptive Behavior

Wednesday, May 20, 2026 1:16 PM to 1:24 PM · 8 min. (America/New_York)

International Hall 7: Level I

Abstracts

Health Policy

Information

Number

491

Background and Objectives

Complying with the medical screening exam (MSE) requirement of the Emergency Medical Treatment and Labor Act (EMTALA) in circumstances where patients exhibit disruptive behavior such as violence, verbal abuse, and threats toward staff can prove challenging for hospitals. Prior work identified many instances of EMTALA citations for failure to MSE patients exhibiting disruptive behavior, but required extensive manual review. This study aims to assess how large language models (LLMs) perform in identifying disruptive behavior in EMTALA inspection texts.

Methods

A program was developed to review inspection texts from EMTALA violations for failure to MSE using Google’s Gemini 2.5 Flash-Lite. The LLM was prompted to determine whether disruptive behavior was described in each inspection, excluding mental health crises not directed at others, suicidal ideation, or self-inflicted violence. Iterative prompt engineering was used to assess model validity under various prompt conditions. The results were compared against trained human reviewers as the ground truth. The dataset included failure-to-MSE violations from 2023 containing keywords suggestive of potentially disruptive behavior that had been previously reviewed by two trained independent reviewers in a prior study. Model performance was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), area under the ROC curve (AUC), and Cohen’s kappa.

Results

The LLM reviewed 50 inspection texts of EMTALA violations for failure to MSE. Various human-generated prompt conditions (e.g., inclusionary vs. exclusionary classification) were explored, with an LLM-generated prompt performing best. The LLM achieved a sensitivity of 83.3%, specificity of 87.5%, PPV of 78.9%, NPV of 90.3%, AUC of 0.854, and Cohen’s kappa of 0.7.

Conclusion

Our findings demonstrate LLMs’ potential to improve the speed and accuracy of identifying disruptive behavior in EMTALA-related research, helping supplement manual review. In addition, we found LLM-generated prompts outperformed human-generated prompts. LLMs have potential to support further study of EMTALA enforcement where disruptive behavior is involved. This may help inform future health policy guidance to ensure hospitals can maintain EMTALA compliance while protecting staff and patients from direct and indirect consequences of disruptive behavior.

CPE

CME

0.75