Harvey opens legal agent testing bench with 1,200 tasks

Harvey launches open-source agent benchmark

Harvey released Legal Agent Benchmark (LAB), an open-source testing platform for autonomous legal AI systems. The benchmark includes 1,200 agent tasks across 24 legal practice areas, evaluated using 75,000 expert-written rubric criteria (company-reported).

Major AI companies back the project: Nvidia, OpenAI, Anthropic, Mistral, and DeepMind. Additional collaborators include LangChain, Fireworks AI, Stanford Liftlab, and Snorkel.

The benchmark tests agents across three core functions: planning, interacting, and adapting. Tasks mirror real legal work, such as M&A deals where agents must locate key provisions in synthetic data, assess their importance, and generate reports. "You can bring your agents to solve tasks," Niko Grupen, Head of Applied Research at Harvey, told Artificial Lawyer.

Harvey plans to launch a public leaderboard within weeks, showing which AI systems perform best on specific legal tasks. The platform accepts any agent for testing, including Harvey's own tools and custom-built systems.

Standardized testing fills evaluation gap

Legal AI evaluation has relied on vendor-supplied benchmarks and anecdotal case studies. LAB provides the first standardized framework for comparing agent performance across real legal work, similar to how coding benchmarks measure programming agents.

The benchmark includes deliberate errors in test materials to assess whether agents spot issues, mimicking the quality control demands of actual legal practice. This addresses a critical concern: agents that complete tasks incorrectly create liability risks that traditional AI tools do not.

Independent testing infrastructure helps legal buyers move beyond pilot programs. As Grupen noted, coding agents "had a big impact on engineering" as their capabilities improved through systematic measurement.

Open access enables direct comparison

Legal teams can test agents before procurement decisions rather than relying on vendor demonstrations. The open-source structure allows customization of evaluation criteria to match specific practice needs.

The benchmark supports both commercial and research use cases. Law firms can evaluate custom agent configurations while AI companies can measure progress against standardized legal tasks.

Harvey encourages contributions to expand the benchmark: "We want model providers, startups, researchers, legal AI companies, and law firms to run the benchmark, audit the rubrics, improve the harness, contribute new task families."

Harvey opens legal agent testing bench with 1,200 tasks

Our Take

Why it matters

Do this week

Harvey launches open-source agent benchmark

Standardized testing fills evaluation gap

Open access enables direct comparison

Related stories

Gresham and FundGuard merge data platforms for asset managers

ANNA Money adds 3.66% savings account for UK small businesses

Payward buys Reap for $600M to merge stablecoin cards with B2B rails