When I think of heroic doctors, I think of the hospital doctor who is presented with a patient suffering from strange or vague symptoms and gets the correct diagnosis just in time. It is the basis of almost all television programs about medical procedures, from House, Maryland to The pit. It is the mystique that has made doctors among the most respected professionals in society.
But what if a machine could make that call just as well or even better? What should we do about it here in the real world?
That question is increasingly urgent. According to a major new study published in ScienceAdvanced AI programs often outperform human doctors in diagnosing people seeking emergency medical care.
AI has already, for better or worse, become part of modern medicine. Different programs are being used to do everything from collecting medical notes to identifying promising new candidates for drug development. The authors of the Science The study reported that its findings are strong evidence that AI could also be valuable in the emergency room, provided it is fully examined in clinical trials for specific uses.
Lest the hype overtake the science, the authors made a point of saying they feared their research would be cited to justify replacing human doctors with software programs: “I’m a little uncomfortable with how some of these results could be used,” said co-author Dr. Adam Rodman, a general internist and medical educator at Beth Israel Deaconess Medical Center. They warned against such a simplistic view of their findings.
“No one should look at this and say we don’t need doctors,” Rodman said in a call with reporters.
At the same time, researchers argued that AI had reached the point where it could be a genuine asset for doctors in certain situations, especially in emergency rooms, where doctors frequently deal with imperfect information. They called for clinical trials that would adequately evaluate the safety and effectiveness of using AI for those tasks, serving as a second pair of virtual eyes that could act as a checkup for human doctors, or help them when they encounter a case that is outside their experience or knowledge.
They said AI clearly can be a positive force in healthcare, as long as we recognize its limitations and use it alongside our human doctors, rather than replacing them.
“We are witnessing a really profound shift in technology that will reshape medicine,” said Arjun Manrai, who studies machine learning and statistical modeling for medical decision making at Harvard Medical School.
AI outperforms human doctors in making emergency diagnoses
The researchers evaluated OpenAI’s o1 reasoning model, which is a more specialized AI program than, for example, ChatGPT. It works more deliberately and with an emphasis on internal logic. They ran the program through several experiments, evaluating its accuracy in both simulated and historical cases that have been used in medical training to evaluate doctors’ critical thinking, as well as in real-world emergency cases at Beth Israel Hospital. The study then compared the performance of the o1 model with human doctors, ChatGPT, and human doctors using ChatGPT.
Evaluation of the training cases allowed the researchers to compare the performance of o1 with a very large sample of existing data from human doctors who performed the same tests. And in those different scenarios, the AI consistently outperformed those doctors and offered the correct diagnosis or a useful patient management plan in the vast majority of cases studied.
Subscribe to the Good Medicine newsletter
Our political welfare landscape has changed: new leaders, murky science, conflicting advice, broken trust and overwhelming systems. How is anyone supposed to make sense of all this? Vox senior correspondent Dylan Scott has been on the health front for a long time, and each week he’ll delve into tough debates, answer fair questions, and contextualize what’s happening in American health care policy. Register here.
But its accuracy in evaluating raw electronic medical record data from real-world emergency cases was especially impressive. This is closer to the confusing reality in which emergency doctors often must operate: they are dealing with a person who urgently needs rapid treatment and they have incomplete and unfiltered information, if much information at all. When reviewing those cases, the o1 model identified the exact or very close diagnosis 67 percent of the time during the patient’s initial presentation at triage (vs. 50 and 55 percent, respectively, for two expert doctors with whom AI was measured) and 81 percent of the time once the patient was ready to be admitted to the hospital (vs. 70 and 79 percent for human doctors).
“We can definitely say…reasoning models can meet those criteria for performing diagnostic reasoning at the highest levels of human performance,” Rodman told reporters.
Two experts I consulted who were not affiliated with the study (Dr. Sanjay Basu of UC-San Francisco and Nigam Shah of Stanford) praised its rigor, but also noted its limitations. The pre-existing training cases studied have been selected specifically to test clinician accuracy, so they may exaggerate how well the model would perform in the real world. In one of the case study experiments that included a set of “cannot be missed” diagnoses when the patient is at risk of serious harm or death, the AI model did not perform better than ChatGPT or human doctors.
Even the ER findings, which come closest to evaluating the performance of the o1 model in real-world conditions, were retrospective reviews of existing cases; The model was not actually asked to diagnose or manage patients in real time.
Therefore, even the Science As the study authors argued, the next step should not be to immediately put the Open AI model in charge of emergency triage in hospitals across the country. Instead, they called for clinical trials that could evaluate the model’s performance (both accuracy and safety) under real-world conditions.
“There is a lot at stake in medicine… and we have ways to mitigate these risks. They are called clinical trials,” Rodman told reporters. “What these results support is a robust and ambitious research agenda.”
AI Could Be Valuable for Doctors, but Patients Need to Be Careful
Enthusiasm for AI, especially in medicine, is high right now. As I listened to the authors discuss their findings, what struck me was their own awareness that their research could be used as justification for cutting the human medical workforce and the risks that could end up creating for patients.
“There are a lot of so-called AI doctor companies that are trying to leave doctors out of the loop or have minimal clinical oversight,” Rodman said. “As one of the lead authors of the study, I don’t think these results support that.”
The authors emphasized that based on their results, they would imagine that AI models in the emergency room would be supervised by a real doctor. Making a diagnosis is only one part of treating a patient; It also includes developing a treatment plan and monitoring progress, as well as the human element. “Humans want to be guided in life and death decisions,” Manrai said.
Basu and Shah said they supported narrowly defined uses for AI in the ER based on the collective research so far. You could offer second opinions when a patient moves on to another doctor or weigh in on specific high-risk situations (such as a patient presenting with a sepsis infection or stroke symptoms) where time is of the essence. It could also reduce paperwork for doctors, an app introduced in the most recent season of The pit. Shah pointed to prior authorization, documentation and scheduling as obvious areas where AI could help.
At the same time, AI models should not be deployed at all to diagnose and manage treatment autonomously, Basu said.
People should also be careful when using AI to make medical decisions. Other AI diagnostic studies have found worrying results, especially for consumer-facing models like ChatGPT. An article published in Nature medicine Earlier this year it evaluated ChatGPT’s performance when presented with scenarios ranging from non-urgent to emergent and found that the model underestimated the severity of the patient’s condition in 52 percent of cases; patients who were on the verge of diabetic shock or respiratory failure were referred for 24- or 48-hour follow-up. The model repeatedly failed to identify clear signs of suicidal ideation.
As Shah told me, the Science The document represents a “ceiling” for the use of AI for diagnosis, while the Nature medicine The paper represents a floor. The two studies show how precise we need to be when considering using AI to make clinical decisions: while the more sophisticated o1 model performed well in the Science In a study that reviewed selected cases, consumer-facing ChatGPT, developed by the same company, Open AI, underperformed in the other paper.
“Both things can be true,” Basu told me. “They both are.”
In the call with reporters, Manrai described both “green” (low-risk) scenarios in which an AI could be useful even to a layman and “red” (high-risk) cases in which a medical professional should always be involved. An eco-friendly use would be, for example, asking a model about a diet that could help control their hypertension or stretches that could relieve a recent back injury. Think of it more as lifestyle advice than strict clinical guidance.
A red use, on the other hand, would imply serious medical situations with life or death consequences: chest pain, to give one of many possible examples, is a reason to go directly to the doctor or hospital, not to consult ChatGPT.
We are getting closer to unlocking the incredible potential of these powerful programs to improve healthcare and make what was once science fiction a reality. But even these cutting-edge researchers agree that we must proceed with caution and keep the real experts, the doctors, informed.

