Open Agent Leaderboard measures full systems, not just models—costs matter

Six benchmarks, one protocol, visible cost gaps

Hugging Face and IBM Research launched the Open Agent Leaderboard, a benchmark suite that evaluates full agent systems (model, planner, memory, tool selection, error recovery) across six established tasks: SWE-Bench Verified (code fixes), BrowseComp+ (web research), AppWorld (personal task automation), tau2-Bench Airline & Retail (customer service policy adherence), and tau2-Bench Telecom (technical support). Each benchmark existed independently; the leaderboard unified them under a single protocol so different agents could be tested without benchmark-specific tuning.

Results show the same model paired with different agent architectures produced success rates and costs that diverged sharply. The top three configurations all use the same frontier model but vary in both quality and cost. The most efficient system in the top five runs at a fraction of the price of the strongest. The leaderboard publishes both success rate and cost per task, so a practitioner can plot configurations by quality and cost and see the tradeoff surface directly.

The framework is open from day one. Hugging Face released the Exgentic evaluation platform (for running and reproducing tests), the paper describing methodology and results, and the leaderboard itself. The company is inviting the community to submit new agents, benchmarks, and models via pull request.

Agent design is starting to move the needle alongside model choice

The key finding is structural: model choice still explains most of the variance in agent performance. But agent architecture is already making visible gains. Tool shortlisting (helping the agent focus on relevant tools instead of scanning the entire set) improved performance across every model tested and converted otherwise-failing configurations into working ones.

A second insight applies directly to operational cost. Failed agent runs cost 20–54% more than successful ones (per the paper). For anyone running agents at scale in production, failure behavior shapes the bill as much as success rate does. This is invisible in traditional benchmarks that report only accuracy.

A third result challenges the assumption that general-purpose agents must sacrifice performance. In several cases, agents with no benchmark-specific tuning matched or exceeded systems built directly for those tasks. Generality is no longer a proxy for weakness. Open-weight models (DeepSeek V3.2 and Kimi K2.5) trail closed-source frontier models by 18–29 percentage points on average (company-reported), a meaningful but not insurmountable gap.

Unpack the agent, not just the model, when selecting a system

If you are evaluating an agent for production, reproduce the leaderboard results on your own tasks before committing. Use Exgentic to run a configuration against one or two benchmarks; you will see cost and success rate. Do not assume a cheaper model with better agent design will lose to an expensive model with vanilla design.

When you compare vendors or open-source agents, ask for cost per task in addition to success rate. Failure mode matters: does the agent fail fast and cheap, or does it burn through long, expensive runs before failing? The leaderboard makes both visible for the first time.

Consider tool shortlisting as a baseline optimization for any agent you deploy. The paper shows it improves performance uniformly and is cheaper than scaling model size. Implement it before tuning prompts or context management.

Open Agent Leaderboard measures full systems, not just models—costs matter

Our Take

Why it matters

Do this week

Six benchmarks, one protocol, visible cost gaps

Agent design is starting to move the needle alongside model choice

Unpack the agent, not just the model, when selecting a system

One daily brief. Every story gets a hype verdict.

Related stories

The 30-Day AI-Native Challenge: a free/freemium roadmap to real AI skills

Your AI compliance gap is wider than your governance framework

Compliance teams ditch spreadsheets for unified EDD software