Highly Accurate Automated Abstraction of Potential High-Risk Features From Pulmonary Embolism Imaging Reports

Wednesday, May 20, 2026 4:56 PM to 5:04 PM · 8 min. (America/New_York)

International Hall 9: Level I

Abstracts

Informatics/Data Science/AI

Information

Abstract Number

664

Background and Objectives

Recently, we reported that CT Pulmonary Embolism (CTPE) findings widely believed to confer high risk, like saddle PE and RV strain, increased hospitalization of low-risk acute PE patients at our academic medical center, without affecting their rate of adverse clinical outcomes. To validate these findings broadly and motivate confirmatory prospective trials, we are building an automated registry of acute low-risk PE patients spanning the 50+ EDs of the Michigan Emergency Department Improvement Collaborative (MEDIC). As a first step, we evaluated large language models (LLMs) for extraction of structured data from free text CTPE reports. Our objectives were: 1. to compare the performance of a proprietary, cloud-based LLM with a locally hosted, open-source LLM easier to implement across multiple sites and 2. to determine which CTPE features were most amenable to accurate abstraction.

Methods

We evaluated LLM performance on an existing dataset of 400 PE-positive CTPE reports manually labeled for six fields: laterality, most proximal location, RV:LV, pulmonary infarct, and presence and description of intraventricular septal abnormalities. We compared U-M GPT-4o, a proprietary LLM, with Mistral-7B-Instruct, an open-source model, both at baseline and paired with hidden Chain-of-Thought (CoT) prompting. Accuracy, Macro-F1 score, and Mean Absolute Error (MAE) and were used to assess model output. Token usage and cost were also recorded.

Results

U-M GPT-4 performed well across all six fields, with near perfect accuracy (> 98%) once paired with CoT. Macro-F1 scores ranged from 0.938 to 0.997 across categorical fields, with MAE of 0.004 for RV:LV. Mistral-7B with CoT performed nearly as well (accuracy > 95%) for PE laterality, RV:LV, and septal abnormalities, but struggled with PE location (83% accuracy, 0.83 macro-F1) and pulmonary infarct (34% accuracy, 0.41 macro-F1). Cost was ~$15 for U-M GPT-4o on 2400 calls, while Mistral-7B required only 300 calls at a fraction of the cost.

Conclusion

Highly accurate, automated abstraction of structured data elements from CTPE reads is feasible with existing LLMs. While U-M GPT-4o performed best, Mistral-7B – a locally-hosted model which would be easier to implement across the MEDIC network – achieved similar accuracy for some data fields. Hidden Chain-of-Thought prompting benefited the performance of both models.

CME

1.25