Artificial Intelligence & Machine Learning
,
Healthcare
,
Industry Specific
Chatbots Getting Better Making Final Diagnoses, But Clinical Reasoning Still Weak

General purpose large language model chatbots are getting better at identifying final diagnoses but are weak in clinical reasoning, being mostly unable to rule out other potential conditions and causes of symptoms, find health researchers.
See Also: Reduce Cloud Risk in Healthcare with Security by Default
A Mass Gen Brigham study asked 21 different general purpose artificial intelligence LLMs “to play doctor” in a series of clinical scenarios.
The off-the-shelf models tested included the latest models available at the time – including GPT-5, Claude 4.5 Opus, Gemini 3.0 Flash and Pro, and Grok 4. Three teams of medical students assessed and scored the chatbots across sequential stages of the standard clinical workflow to assess performance across five domains: differential diagnosis, diagnostic testing, final diagnosis, management and miscellaneous clinical reasoning questions. The study ran from January 2025 to December 2025.
Researchers fed each of the models clinical vignettes of 29 published clinical cases. To simulate the way that clinical cases evolve, the researchers gradually provided the models information about each case, beginning with basics such as patient age, gender and symptoms before adding physical examination findings and laboratory results.
The most consistent weaknesses were in differential diagnosis and diagnostic testing. Differential diagnosis is one of the most important functions a doctor performs. It’s the process of ruling out different possible diagnoses when symptoms match more than one condition. The models failed to come up with an appropriate differential diagnosis 80% of the time. Their performance in final diagnosis – determining a condition based on test results meant to rule out alternatives – and management was generally better, with a correct final diagnosis made 90% of the time.
The idea was to put models “in the position of a doctor,” said Arya Rao, lead author and researcher of the study, and MD-PhD student at Harvard Medical School.
“These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information.” Clinicians don’t close off possibilities until the iterative process of differential diagnosis provides an answer, “whereas LLMs collapse prematurely onto single answers, a limitation that persists across model generations,” researchers found.
“The risk is not just that LLMs are sometimes wrong but that their reasoning is brittle precisely where uncertainty and nuance matter most,” the study concludes.
The findings caution against vendor claims that general purpose, off-the-shelf AI models are ready for patient-facing clinical use. “While strong performance on final diagnosis tasks may create that impression, persistent failures in generating differential diagnoses and navigating uncertainty show that LLMs cannot yet be trusted in frontline decision-making.”
Researchers advice is to limit to presence of AI models in clinical settings to supervised use in tasks where there exists little uncertainty. A separate report by at patient safety research organization ECRI Institute earlier this year identified AI chatbots as the number-one health technology hazard in 2026 (see: Chatbots, Outages, Devices Top 2026 Health Tech Hazards).
