“Thinking” AI Outperforms Human Doctors on Real-Life Data
- The tested model is no longer state-of-the-art.
- The o1-preview model significantly outperformed an older model, GPT4, which itself outperformed human physicians on diagnostic tasks.
- This effect is strongest in the earliest stages of diagnostics, when patients are first admitted into clinical settings.
A new study has pit an advanced large language model against human physicians in tasks involving complex reasoning, treatment recommendations, and messy real-world patient records [1].
Testing a “thinking” model
The dream of a ‘computer doctor’ has existed since at least 1959 [2], but until the recent rise of large language models, no computer program could come near human physicians in working on complex clinical cases. The rise of LLMs ignited new hope and spawned numerous studies with encouraging results [3]. The next big step was the appearance of reasoning models, which maintain an internal chain of thought and can explain their decisions.
This has made the human-machine showdown much more interesting, and now the first rigorous study of a reasoning LLM directly pitted against human doctors is out and published in Science. Despite the study being fresh off the press, the head-spinning pace of progress in the field of AI means that the LLM used – OpenAI’s first reasoning model, o1-preview – is already obsolete, and the newest models should perform even better.
Outperforming humans on hard cases
The researchers tested the model across six different physician-style tasks, comparing it against hundreds of physicians and against earlier models like GPT-4. First, they fed o1-preview the full text of 143 NEJM clinicopathological conferences (CPCs) and asked it to produce a ranked list of possible diagnoses (a differential diagnosis). Two physicians independently scored the outputs. A CPC is a commonly used teaching format in which a real, usually challenging, case is presented in detail to a discussant who works through it aloud, building a differential diagnosis and reasoning toward a final answer.
o1-preview included the correct diagnosis somewhere in its differential in 78.3% of cases and named it as the top guess in 52% of cases. When “very close” answers were also counted as wins, accuracy reached 97.9%.
A critical concern with LLMs on published cases is memorization, as a model may have seen the case and its answer during training. The authors addressed this by comparing performance on cases that were published before and after o1-preview’s pretraining cutoff and found no significant difference, suggesting genuine reasoning rather than recall.
GPT-4 performed meaningfully worse. More importantly, on a 101-case subset where responses from human physicians were previously documented, o1-preview outperformed humans in both top-1 and top-10 accuracy.
AI, what do you recommend?
Making a diagnosis is just the first step. Will the model be able to correctly recommend further actions? To answer this question, on 136 of the same CPCs, the authors asked o1-preview which diagnostic test it would order next. In 87.5% of cases, the model picked the correct test; in another 11%, it picked something the reviewers judged to be helpful; and in only 1.5% was the choice unhelpful.
Next, the team tested o1-preview on 20 cases from NEJM Healer, a virtual-patient educational tool, scoring responses across four domains of written clinical reasoning such as problem representation and differential justification. The model scored a perfect 10 on 78 of 80 responses, significantly outperforming GPT-4 (47/80), attending physicians (28/80), and residents (16/72). In one bright spot for human physicians, o1-preview was not meaningfully better in including “cannot-miss” diagnoses (the high-stake possibilities that must be considered even when they are remote).
In another test of AI’s ability to make recommendations and not just diagnoses, the authors used five clinical vignettes from a prior study in which 25 expert physicians participated. o1-preview scored a median of 89% – dramatically better than GPT-4 alone (42%), physicians using GPT-4 (41%), and physicians using conventional resources (34%).
To more rigorously address memorization concerns, the authors used six diagnostic vignettes that were taken from a 1994 study and have never been publicly released. o1-preview scored a median of 97% compared to 92% for GPT-4, 76% for physicians + GPT-4, and 74% for physicians + conventional resources. However, none of the differences reached statistical significance due to the small case count.
Handling real clinical data with gusto
In the final experiment that distinguishes this paper from earlier studies, the authors collected 76 real, randomly selected ER cases from Beth Israel Deaconess Medical Center, with all identifiers and unstructured notes intact – the messy real-world clinical data. They then constructed three “diagnostic touchpoints” representing the information available at successive stages of an ER visit: initial triage (minimal data), ER physician evaluation (history, exam, and initial labs already available), and admission to floor or ICU (most complete data).
At each touchpoint, o1, GPT-4o, and two attending physicians independently produced differential diagnoses. Two separate attending physicians, blinded to source, scored every differential. Interestingly, blinding worked extremely well: the raters guessed whether the diagnoses were from AIs or humans correctly only 3-15% of the time, choosing “Can’t tell” 84-94% of the time, indicating that o1’s outputs were stylistically indistinguishable from human outputs.
o1 handily beat both attendings and GPT-4o. The advantage was largest at initial triage, where the least data is available and the stakes are highest. By admission, when the data is rich, the gap narrowed and was no longer statistically significant, suggesting that o1 extracts more diagnostic signal from sparse information than physicians do.
“We didn’t pre-process the data at all,” said Adam Rodman, MD, MPH, a hospitalist and clinical researcher at BIDMC. “The model is literally just processing data as it exists in the health record.” “I thought it was going to be a fun experiment but that it wouldn’t work that well. That was not at all what happened.”
“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” added co-senior author Arjun (Raj) Manrai, assistant professor of biomedical informatics at Harvard Mecal School. “However, this does not mean AI will necessarily improve care – how and where it should be deployed remain understudied, and we desperately need rigorous prospective trials to evaluate the impact of AI on clinical practice.”
Literature
[1] Peter G. Brodeur et al. (2026). Performance of a large language model on the reasoning tasks of a physician. Science 392,524-527
[2] Rs, L., & LB, L. (1959). Reasoning foundations of medical diagnosis; symbolic logic, probability, and value theory aid our understanding of how physicians reason. Science (New York, NY), 130(3366), 9-21.
[3] Goh, E., Gallo, R., Hom, J., Strong, E., Weng, Y., Kerman, H., … & Chen, J. H. (2024). Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA network open, 7(10), e2440969.








