Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Andrew M. Bean; Rebecca Elizabeth Payne; Guy Parsons; Hannah Rose Kirk; Juan Ciro; Rafael Mosquera-Gomez; M. Sara Hincapie; Aruna S. Ekanayaka; Lionel Tarassenko; Luc Rocher; Adam Mahdi

doi:10.1038/s41591-025-04074-y

Back

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Journal article

Open access

Peer reviewed

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Andrew M. Bean, Rebecca Elizabeth Payne, Guy Parsons, Hannah Rose Kirk, Juan Ciro, Rafael Mosquera-Gomez, M. Sara Hincapie, Aruna S. Ekanayaka, Lionel Tarassenko, Luc Rocher, …

Nature medicine, Vol.32(2), pp.609-615

01/02/2026

DOI: https://doi.org/10.1038/s41591-025-04074-y

PMID: 41663592

Abstract

Biochemistry & Molecular Biology

Cell Biology

Life Sciences & Biomedicine

Medicine, Research & Experimental

Research & Experimental Medicine

Science & Technology

Global healthcare providers are exploring the use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings. We tested whether LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities before public deployments in healthcare.

Files and links (1)

url

https://doi.org/10.1038/s41591-025-04074-yView

Published (Version of record) Open

Metrics

1 Record Views

Details

Title: Reliability of LLMs as medical assistants for the general public: a randomized preregistered study
Creators - without role: Andrew M. Bean - University of Oxford
Rebecca Elizabeth Payne - Betsi Cadwaladr University Health Board
Guy Parsons - University of Oxford
Hannah Rose Kirk - University of Oxford
Juan Ciro - Contextual AI, Mountain View, CA USA
Rafael Mosquera-Gomez - Palo Alto Institute
M. Sara Hincapie - MLCommons, San Francisco, CA USA
Aruna S. Ekanayaka - Birmingham Women’s and Children’s NHS Foundation Trust
Lionel Tarassenko - University of Oxford
Luc Rocher - University of Oxford
Adam Mahdi - Science Oxford
Publication Details: Nature medicine, Vol.32(2), pp.609-615
Publisher: NATURE PORTFOLIO
Number of pages: 25
Grant note: MR/Y015711/1 / UKRI Future Leaders Fellowship; UK Research & Innovation (UKRI) RG\R2\232035 / Royal Society Research; Royal Society UK Oxford Internet Institute's Research Programme - Dieter Schwarz Stiftung gGmbH National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC); National Institutes of Health Research (NIHR)
Identifiers: 9922464109548
Academic Unit: Public policy
Language: English
Resource Type: Journal article
Date published: 01/02/2026

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Abstract

Files and links (1)

Metrics

Details

The Alan Turing Institute Social media