Editorial Type: LETTERS TO THE EDITOR
 | 
Online Publication Date: 21 Nov 2025

Stratifying Errors in Anthropic’s Chatbot Claude–Simplified Pathology Reports

MD, MPH,
DO,
MD,
MD,
MD,
DO,
MD,
MD, and
MD
Article Category: Letter
Page Range: 1061 – 1064
Save
Download PDF

To the Editor.—Anatomic pathology reports are foundational to patient care and treatment planning. Yet, their technical language may leave patients confused about their condition.1,2 Artificial intelligence–powered chatbots, such as Anthropic’s Claude, offer a potential solution for bridging this communication gap. Building on a prior evaluation of GPT-4 and Bard,3 we analyze Claude’s ability to simplify pathology reports into patient-friendly explanations and systematically stratify its errors to better understand the potential risks and limitations. We used Claude 3 Opus (Anthropic, released February 2024) with default system settings, and no additional parameters or custom system instruction. Reports were submitted programmatically, and a new chat instance was initiated for each report. The following prompt was used: “Please explain the following pathology report to someone without any high-school education. Be concise: [pathology report].”

We evaluated a corpus of 1134 pathology reports spanning diverse procedures, organ systems, and conditions. Two pathology trainees (resident physicians) independently screened and flagged reports with potential issues, which were subsequently reviewed by a panel of 2 senior pathologists. Simplified reports were categorized as medically correct (no errors), partially correct (minor errors unlikely to affect patient management), or medically incorrect (significant errors with potential clinical implications).

Erroneous simplified reports were further stratified into error categories: anatomic localization, staging and stage explanation errors, misunderstanding an ambiguous report, misinterpretation or oversimplification of medical terms, or diagnostic interpretation errors. Readability metrics, including the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE), were recorded for original and simplified reports, using a commercial web-based tool (readable.com).

Claude significantly improved readability. The mean FKGL decreased from 13.19 (95% CI, 12.98–13.41) in original reports to 8.74 (95% CI, 8.52–8.98) in simplified reports, while the mean FRE score increased from 10.32 (95% CI, 8.69–11.96) to 62.84 (95% CI, 62.30–63.39) (P < .001 for both). Despite these improvements, 57 of the 1134 reports (5.03%) contained errors, including 10 (0.88%) with significant errors that could have clinical implications. Of those 57 reports, common errors included misinterpretation or oversimplification of medical terms (17 reports; 29.82%) and anatomic localization errors (14 reports; 24.56%) (Figure). Select reports are shown in the Table.

Distribution of error categories in Chatbot-simplified pathology reports (N = 57).

Citation: Archives of Pathology & Laboratory Medicine 149, 12; 10.5858/arpa.2025-0027-LE

Categorization of Errors in Chatbot-Simplified Pathology Reports

Claude’s performance in simplifying pathology reports was largely accurate, with few errors and virtually no hallucinations. Its accuracy was comparable to GPT-4 and surpassed Bard.3 Most errors stemmed from either the chatbot’s unfamiliarity with subject matter or inconsistencies in pathology reporting language. For example, in mismatch repair deficiency testing, variations in terminology (eg, “no loss of MLH1,” “microsatellite stable”) led to misinterpretations, such as equating “stable” with a slow tumor progression. It is possible that standardizing pathology report language could reduce such errors, which is critical for patients relying on simplified explanations. Testing chatbot performance using only the synoptic portions of reports, without free text, could further explore this hypothesis. However, errors also occurred in interpreting standard reporting notations, such as from the American Joint Committee on Cancer,4 equating pNx (no lymph nodes submitted) with pN0 (all nodes negative for malignancy), highlighting the need for refinement.

While promising, significant errors in chatbot simplifications pose risks of serious misinterpretations. Future research should prioritize both improving fine-tuned models and training foundation models, with a focus on minimizing clinically significant errors and assessing output consistency to determine model reliability. At the same time, the widespread use of general-purpose tools underscores the need to enhance the interpretative capabilities of these accessible platforms. Until these tools are further refined, patients should be advised to consult health care providers for accurate interpretation of their reports.

Copyright: © 2025 College of American Pathologists 2025

Contributor Notes

Corresponding author: Eric Steimetz, MD, MPH, Department of Pathology and Laboratory Medicine, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY 10065 (email: steimee@mskcc.org).

The authors have no relevant financial interest in the products or companies described in this article.

Accepted: 22 Aug 2025
  • Download PDF