Our Take
Strong benchmark performance, but comparing AI to internal medicine docs doing emergency medicine work stacks the deck.
Why it matters
Medical AI benchmarks typically use curated datasets, but this study used raw electronic health records from actual ER visits. Healthcare organizations need real-world performance data before deploying diagnostic AI.
Do this week
Healthcare AI teams: benchmark your models against specialists in their actual practice areas, not generalists doing specialty work.
OpenAI o1 outperformed doctors in 76-case ER study
Harvard Medical School researchers tested OpenAI's o1 and GPT-4o models against two internal medicine attending physicians using real emergency room cases from Beth Israel Deaconess Medical Center (per Science journal study). The AI models received identical information available in electronic medical records at each diagnostic decision point.
At initial triage, o1 achieved exact or very close diagnoses in 67% of cases versus 55% and 50% for the two human physicians (Harvard study data). Two independent attending physicians evaluated all diagnoses without knowing which came from AI versus humans. The performance gap was largest at triage, where limited patient information creates maximum diagnostic pressure.
The study covered 76 patients across multiple diagnostic touchpoints. Researchers emphasized they provided no data preprocessing, giving AI models the same raw EMR information physicians used in real time.
First benchmark using unprocessed hospital records
Most medical AI evaluations use cleaned datasets or standardized test cases. This study fed models messy, real-world electronic health records as they existed during actual patient visits. The difference matters because production medical AI must handle incomplete data, unclear symptoms, and time pressure.
However, the comparison has methodological issues. The study used internal medicine doctors, not emergency medicine specialists, to generate human baseline diagnoses. As emergency physician Kristen Panthagani noted, internal medicine doctors don't typically practice emergency medicine. An ER doctor's primary goal differs from making precise diagnoses: ruling out immediately life-threatening conditions takes priority over diagnostic accuracy.
The researchers acknowledged AI accountability frameworks don't exist for clinical deployment, and patients still prefer human guidance for life-or-death decisions.
Test AI against proper specialist baselines
Healthcare AI teams should benchmark models against physicians who actually practice the target specialty, not adjacent fields. The study's design inflates AI performance by using internal medicine doctors for emergency medicine cases.
The research points toward prospective trials in real clinical settings, but current models only processed text-based information. Medical diagnosis often requires interpreting images, physical examination findings, and other non-text inputs where current foundation models show more limitations (per study authors).
Organizations planning diagnostic AI deployment need frameworks for clinical accountability and clear protocols for human oversight, especially in emergency settings where rapid triage decisions affect patient outcomes.