A Harvard Medical School and Beth Israel Deaconess Medical Center team tested OpenAI models against doctors in medical diagnosis, using real emergency room cases and other clinical tasks.
Good to Know
The strongest result came at the point where doctors usually have the least information. At initial ER triage, OpenAI o1 gave an exact or very close diagnosis in 67% of cases. One attending physician reached 55%, while another reached 50%.
Researchers did not frame the result as a green light for AI to run emergency rooms. Instead, the Science study called for an “urgent need for prospective trials to evaluate these technologies in real-world patient care settings.”
That warning matters because the test stayed inside text-based records. The team noted that “existing studies suggest that current foundation models are more limited in reasoning over nontext inputs.” In plain terms, charts, scans, images, physical exams, and bedside judgment still create harder problems for AI diagnosis tools.
The study used 76 patients from the Beth Israel emergency room. OpenAI o1 and 4o received the same electronic medical record details available at each diagnosis point. Harvard Medical School said researchers did not “pre-process the data at all,” so the models did not get cleaned-up summaries or extra help.
Two other attending physicians then graded the answers without knowing which diagnosis came from a human doctor and which came from AI.
The study said:
“At each diagnostic touchpoint, o1 either performed nominally better than or on par with the two attending physicians and 4o,”
It added that the gap looked clearest early in care, where pressure runs high and information stays thin:
“were especially pronounced at the first diagnostic touchpoint (initial ER triage), where there is the least information available about the patient and the most urgency to make the correct decision.”
Arjun Manrai, who heads an AI lab at Harvard Medical School and helped lead the study, said:
“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,”
Still, accountability remains a hard problem. Adam Rodman, a Beth Israel doctor and one of the study lead authors, commented to the Guardian that there is “no formal framework right now for accountability” around AI diagnoses. He also said patients still “want humans to guide them through life or death decisions [and] to guide them through challenging treatment decisions.”