Reliability and Usability of a Case-Simulation Platform for Pulmonary Embolism Diagnostic Practices

Wednesday, May 20, 2026 11:16 AM to 11:24 AM · 8 min. (America/New_York)

M101: Level M

Abstracts

Informatics/Data Science/AI

Information

Number

371

Background and Objectives

Despite well-established clinical decision pathways for pulmonary embolism (PE) workup, there is little clinical guidance on when these pathways should be initiated. Here, we sought to test the reliability, usability, and validity of a novel large language model (LLM)-driven online case simulation platform probing the decision to initiate PE workup using clinical vignettes.

Methods

We recruited emergency medicine residents from the University of Michigan and Vanderbilt University Medical Center to complete six simulation cases powered by GPT-4.1 from OpenAI. Each completed six cases: one high-risk for PE (positive control), and one with ankle sprain (negative control), and four additional cases randomly selected from a pool with varied feature combinations (chest pain versus back pain, pleuritic quality of pain, consolidative x-ray findings) selected by expert consensus as potentially modifying suspected PE probability (test cases). Residents interacted with virtual patients through free-text dialogue, obtaining a history, eliciting physical exam findings, and ordering tests. The simulation platform’s reliability was measured by concordance of the underlying case specifications as prompted to the LLM with the actual LLM outputs. Acceptability was assessed via the System Usability Score (SUS; 0-100) where a score of ≥60 was considered acceptable. Validity was assessed using rates of PE workup in the positive control and negative control cases.

Results

Twenty residents completed 120 simulation cases. The platform reproduced case features with 100% concordance to underlying study case specifications. Users reported high usability (median SUS 75; IQR 70–85, 95% SUS ≥60). PE workup was initiated in 95% of positive control cases, 0% of negative control cases, and 58% of remaining cases. Participants varied considerably in their propensity to obtain a D-Dimer and/or CTPA across test cases (median 62.5%; range 0 to 100%).

Conclusion

This LLM–driven simulation platform achieved high case reliability, usability, and validity in a two-site resident pilot with participants showing wide practice variation. Future studies should test generalizability and link simulation decisions to clinical practice.

CPE

CME

0.75