Reliability and Usability of a Case-Simulation Platform for Pulmonary Embolism Diagnostic Practices

Reliability and Usability of a Case-Simulation Platform for Pulmonary Embolism Diagnostic Practices

Wednesday, May 20, 2026 11:16 AM to 11:24 AM · 8 min. (America/New_York)
M101: Level M
Abstracts
Informatics/Data Science/AI

Information

Number
371
Background and Objectives
Despite well-established clinical decision pathways for pulmonary embolism (PE) workup, there is little clinical guidance on when these pathways should be initiated. Here, we sought to test the reliability, usability, and validity of a novel large language model (LLM)-driven online case simulation platform probing the decision to initiate PE workup using clinical vignettes.
Methods
We recruited emergency medicine residents from the University of Michigan and Vanderbilt University Medical Center to complete six simulation cases powered by GPT-4.1 from OpenAI. Each completed six cases: one high-risk for PE (positive control), and one with ankle sprain (negative control), and four additional cases randomly selected from a pool with varied feature combinations (chest pain versus back pain, pleuritic quality of pain, consolidative x-ray findings) selected by expert consensus as potentially modifying suspected PE probability (test cases). Residents interacted with virtual patients through free-text dialogue, obtaining a history, eliciting physical exam findings, and ordering tests. The simulation platform’s reliability was measured by concordance of the underlying case specifications as prompted to the LLM with the actual LLM outputs. Acceptability was assessed via the System Usability Score (SUS; 0-100) where a score of ≥60 was considered acceptable. Validity was assessed using rates of PE workup in the positive control and negative control cases.
Results
Twenty residents completed 120 simulation cases. The platform reproduced case features with 100% concordance to underlying study case specifications. Users reported high usability (median SUS 75; IQR 70–85, 95% SUS ≥60). PE workup was initiated in 95% of positive control cases, 0% of negative control cases, and 58% of remaining cases. Participants varied considerably in their propensity to obtain a D-Dimer and/or CTPA across test cases (median 62.5%; range 0 to 100%).
Conclusion
This LLM–driven simulation platform achieved high case reliability, usability, and validity in a two-site resident pilot with participants showing wide practice variation. Future studies should test generalizability and link simulation decisions to clinical practice.
CPE
0
CME
0.75

Disclosures

Access the following link to view disclosures of session presenters, presenting authors, organizers, moderators, and planners:

Log in

See all the content and easy-to-use features by logging in or registering!