Back to news
AnalysisMay 19, 2026· 3 min read

Open Agent Leaderboard measures full systems, not just models—costs matter

Hugging Face and IBM Research released the Open Agent Leaderboard, benchmarking six real-world tasks across five models and five agents. Results show the same model can cost 5x more or less depending on agent design—and generalist systems now match specialized ones.

Our Take

Agent architecture is no longer invisible; the leaderboard proves tool-shortlisting and system design shift results as much as model choice, but the model still dominates.

Why it matters

Anyone deploying agents in production needs to know that failure behavior costs 20–54% more than success, and that a cheaper configuration may work just as well as an expensive one. This is the first open benchmark that shows both.

Do this week

Practitioners: pull the Exgentic framework and reproduce one benchmark result on your agent before evaluating it on production tasks, so you can see where your actual cost-quality tradeoffs sit.

Six benchmarks, one protocol, visible cost gaps

Hugging Face and IBM Research launched the Open Agent Leaderboard, a benchmark suite that evaluates full agent systems (model, planner, memory, tool selection, error recovery) across six established tasks: SWE-Bench Verified (code fixes), BrowseComp+ (web research), AppWorld (personal task automation), tau2-Bench Airline & Retail (customer service policy adherence), and tau2-Bench Telecom (technical support). Each benchmark existed independently; the leaderboard unified them under a single protocol so different agents could be tested without benchmark-specific tuning.

Results show the same model paired with different agent architectures produced success rates and costs that diverged sharply. The top three configurations all use the same frontier model but vary in both quality and cost. The most efficient system in the top five runs at a fraction of the price of the strongest. The leaderboard publishes both success rate and cost per task, so a practitioner can plot configurations by quality and cost and see the tradeoff surface directly.

The framework is open from day one. Hugging Face released the Exgentic evaluation platform (for running and reproducing tests), the paper describing methodology and results, and the leaderboard itself. The company is inviting the community to submit new agents, benchmarks, and models via pull request.

Agent design is starting to move the needle alongside model choice

The key finding is structural: model choice still explains most of the variance in agent performance. But agent architecture is already making visible gains. Tool shortlisting (helping the agent focus on relevant tools instead of scanning the entire set) improved performance across every model tested and converted otherwise-failing configurations into working ones.

A second insight applies directly to operational cost. Failed agent runs cost 20–54% more than successful ones (per the paper). For anyone running agents at scale in production, failure behavior shapes the bill as much as success rate does. This is invisible in traditional benchmarks that report only accuracy.

A third result challenges the assumption that general-purpose agents must sacrifice performance. In several cases, agents with no benchmark-specific tuning matched or exceeded systems built directly for those tasks. Generality is no longer a proxy for weakness. Open-weight models (DeepSeek V3.2 and Kimi K2.5) trail closed-source frontier models by 18–29 percentage points on average (company-reported), a meaningful but not insurmountable gap.

Unpack the agent, not just the model, when selecting a system

If you are evaluating an agent for production, reproduce the leaderboard results on your own tasks before committing. Use Exgentic to run a configuration against one or two benchmarks; you will see cost and success rate. Do not assume a cheaper model with better agent design will lose to an expensive model with vanilla design.

When you compare vendors or open-source agents, ask for cost per task in addition to success rate. Failure mode matters: does the agent fail fast and cheap, or does it burn through long, expensive runs before failing? The leaderboard makes both visible for the first time.

Consider tool shortlisting as a baseline optimization for any agent you deploy. The paper shows it improves performance uniformly and is cheaper than scaling model size. Implement it before tuning prompts or context management.

#Agents#Benchmarking#Open Source#Research
Share:
Keep reading

Related stories