Comparative Evaluation of Machine Translation Accuracy of Emergency Department Discharge Instructions: A Noninferiority Study

Tuesday, May 19, 2026 4:24 PM to 4:36 PM · 12 min. (America/New_York)

International Hall 7: Level I

Abstracts

Informatics/Data Science/AI

Information

Number

Background and Objectives

Patients with limited English proficiency (LEP) face disproportionate risks at emergency department (ED) discharge. Professional interpretation improves outcomes, but real-time written translations remain difficult to provide in many EDs. Modern transformer-based large language models (LLMs) may offer improved translation quality compared with older systems, yet their performance on ad hoc providerwritten ED discharge instructions is not well established.

Methods

We conducted a blinded cross-sectional non-inferiority study of English-language ED discharge instructions translated into Spanish, Brazilian Portuguese, and Simplified Chinese comparing Google Translate and ChatGPT-4o versus professional medical interpreters. Fifty-three randomly selected provider-written instructions (100–500 words, preserving spelling/grammar errors) were translated, yielding 477 unique translations. Professional medical interpreters, blinded to translation method, independently scored each translation on fluency, adequacy, meaning, and severity on a five-point Likert scale. Inter-rater reliability between the professional interpreter evaluations was calculated. A 0.5-point non-inferiority margin was pre-specified, and adjusted mean Likert rating differences generated by mixed effects models for each accuracy dimension were compared between translation methods for each language. The proportion of clinically significant translation errors was compared between methods, as was the ability of evaluators to guess the translation method.

Results

Inter-rater reliability was high across languages. Both machine translation methods were non-inferior to professional interpreters for adequacy, meaning, and severity in Spanish and Portuguese, and for all four domains in Chinese. For fluency, Google Translate and ChatGPT-4o were inferior in Spanish and Portuguese but non-inferior in Chinese. The frequency of clinically significant errors did not differ significantly by translation method. Evaluators, blinded to method, frequently mis-identified machine translations as professional.

Conclusion

In this multi-language evaluation of real-world ED discharge instructions, Google Translate and ChatGPT4o were non-inferior to professional interpreters for most domains of translation accuracy.

CPE

CME

1.25