Back to news
NewsMay 6, 2026· 2 min read

Harvey opens legal agent testing bench with 1,200 tasks

Open-source benchmark tests agents across 24 practice areas using 75,000 expert-written criteria, backed by major AI labs.

Our Take

Testing infrastructure matters more than the scores: this creates the first standardized way to compare legal agents beyond vendor demos.

Why it matters

Legal tech buyers need objective agent performance data as autonomous tools move from pilots to production deployments.

Do this week

Legal ops teams: test your current agent tools on LAB before next procurement cycle so you can benchmark vendor claims against standardized metrics.

Harvey launches open-source agent benchmark

Harvey released Legal Agent Benchmark (LAB), an open-source testing platform for autonomous legal AI systems. The benchmark includes 1,200 agent tasks across 24 legal practice areas, evaluated using 75,000 expert-written rubric criteria (company-reported).

Major AI companies back the project: Nvidia, OpenAI, Anthropic, Mistral, and DeepMind. Additional collaborators include LangChain, Fireworks AI, Stanford Liftlab, and Snorkel.

The benchmark tests agents across three core functions: planning, interacting, and adapting. Tasks mirror real legal work, such as M&A deals where agents must locate key provisions in synthetic data, assess their importance, and generate reports. "You can bring your agents to solve tasks," Niko Grupen, Head of Applied Research at Harvey, told Artificial Lawyer.

Harvey plans to launch a public leaderboard within weeks, showing which AI systems perform best on specific legal tasks. The platform accepts any agent for testing, including Harvey's own tools and custom-built systems.

Standardized testing fills evaluation gap

Legal AI evaluation has relied on vendor-supplied benchmarks and anecdotal case studies. LAB provides the first standardized framework for comparing agent performance across real legal work, similar to how coding benchmarks measure programming agents.

The benchmark includes deliberate errors in test materials to assess whether agents spot issues, mimicking the quality control demands of actual legal practice. This addresses a critical concern: agents that complete tasks incorrectly create liability risks that traditional AI tools do not.

Independent testing infrastructure helps legal buyers move beyond pilot programs. As Grupen noted, coding agents "had a big impact on engineering" as their capabilities improved through systematic measurement.

Open access enables direct comparison

Legal teams can test agents before procurement decisions rather than relying on vendor demonstrations. The open-source structure allows customization of evaluation criteria to match specific practice needs.

The benchmark supports both commercial and research use cases. Law firms can evaluate custom agent configurations while AI companies can measure progress against standardized legal tasks.

Harvey encourages contributions to expand the benchmark: "We want model providers, startups, researchers, legal AI companies, and law firms to run the benchmark, audit the rubrics, improve the harness, contribute new task families."

#Legal AI#Agents#Developer Tools#Enterprise AI
Share:
Keep reading

Related stories