A judged panel of models beats every solo frontier system on deep research

breakthroughDeveloperPlatform Strategy

Monday, June 15, 2026

Conviction

High

Time horizon

This quarter

Risk

Optimizing for benchmarks that reward ensembles over the single model you standardized on

The News

OpenRouter launched Fusion on Friday, a feature that runs a panel of models in parallel on one server-side API call and uses a judge model to synthesize a single answer, with web search and fetch enabled. On the DRACO deep-research benchmark, a Fable 5 + GPT-5.5 panel judged by Opus 4.8 scored 69.0%, ahead of every solo frontier model (solo Fable 5 scored 65.3%). OpenRouter notes 7 of the 100 DRACO tasks went unscored because Fable 5's content filters blocked them from running. Source: OpenRouter; see the DRACO benchmark chart (mean normalized score across 100 deep-research tasks, via OpenRouter).

The Read

The headline isn't that Fusion wins — it's that a budget panel of Gemini 3 Flash, Kimi K2.6 and DeepSeek V4 Pro cleared 64.7%, within a point of solo Fable 5, at a fraction of the per-task cost. For deep-research workloads the performance frontier is moving from "pick the best model" to "orchestrate a panel and judge the outputs," which puts the defensible layer in the routing-and-synthesis logic rather than access to any one frontier API. It also lands an awkward footnote beside today's lead: the same Fable 5 content filters Washington isn't satisfied with already refused 7% of this benchmark — the model's own guardrails and the government's are pulling the same direction from opposite ends.

Do This Week

Spend two hours running one real deep-research task through a two-model Fusion panel against your current single-model default. Log where the synthesized answer genuinely beat the solo run versus where the judge just averaged them — that delta is whether orchestration earns the token multiple for your workload.

For Developer