OpenAI o1 beats internal medicine docs in ER diagnosis study

OpenAI o1 outperformed doctors in 76-case ER study

Harvard Medical School researchers tested OpenAI's o1 and GPT-4o models against two internal medicine attending physicians using real emergency room cases from Beth Israel Deaconess Medical Center (per Science journal study). The AI models received identical information available in electronic medical records at each diagnostic decision point.

At initial triage, o1 achieved exact or very close diagnoses in 67% of cases versus 55% and 50% for the two human physicians (Harvard study data). Two independent attending physicians evaluated all diagnoses without knowing which came from AI versus humans. The performance gap was largest at triage, where limited patient information creates maximum diagnostic pressure.

The study covered 76 patients across multiple diagnostic touchpoints. Researchers emphasized they provided no data preprocessing, giving AI models the same raw EMR information physicians used in real time.

First benchmark using unprocessed hospital records

Most medical AI evaluations use cleaned datasets or standardized test cases. This study fed models messy, real-world electronic health records as they existed during actual patient visits. The difference matters because production medical AI must handle incomplete data, unclear symptoms, and time pressure.

However, the comparison has methodological issues. The study used internal medicine doctors, not emergency medicine specialists, to generate human baseline diagnoses. As emergency physician Kristen Panthagani noted, internal medicine doctors don't typically practice emergency medicine. An ER doctor's primary goal differs from making precise diagnoses: ruling out immediately life-threatening conditions takes priority over diagnostic accuracy.

The researchers acknowledged AI accountability frameworks don't exist for clinical deployment, and patients still prefer human guidance for life-or-death decisions.

Test AI against proper specialist baselines

Healthcare AI teams should benchmark models against physicians who actually practice the target specialty, not adjacent fields. The study's design inflates AI performance by using internal medicine doctors for emergency medicine cases.

The research points toward prospective trials in real clinical settings, but current models only processed text-based information. Medical diagnosis often requires interpreting images, physical examination findings, and other non-text inputs where current foundation models show more limitations (per study authors).

Organizations planning diagnostic AI deployment need frameworks for clinical accountability and clear protocols for human oversight, especially in emergency settings where rapid triage decisions affect patient outcomes.

OpenAI o1 beats internal medicine docs in ER diagnosis study

Our Take

Why it matters

Do this week

OpenAI o1 outperformed doctors in 76-case ER study

First benchmark using unprocessed hospital records

Test AI against proper specialist baselines

Related stories

OpenAI o1 beats doctors at emergency diagnosis in Harvard study

NIMHANS launches mental health app repository after study flags quality gaps

Healthcare IT News floats AI interoperability strategy