Back to news
NewsMay 20, 2026· 4 min read

Harvey Launches LAB: 1,200 Legal Tasks to Measure AI Agent Real Work

Harvey released an open-source benchmark with 1,200 legal tasks across 24 practice areas, graded by 75,000 expert rubric criteria. Law firms can now test where AI agents actually work—and where they don't.

Our Take

LAB is the first credible public yardstick for legal agent capability, but it's built by the market leader Harvey—so its definition of 'good legal work' matters more than the open-source label suggests.

Why it matters

Law firms have spent two years in vendor demos without a shared way to answer where agents can actually be deployed. A benchmark structured around real deliverables (deal memos, risk assessments) instead of multiple-choice reasoning could finally make that conversation concrete. But only if the leaderboard methodology doesn't get buried.

Do this week

Legal tech buyers: Request LAB scores from your AI vendors on the 3–4 practice areas you're evaluating before the leaderboard launches, so you can compare apples to apples instead of relying on vendor case studies.

Harvey Open-Sources Benchmark for Long-Horizon Legal Agent Work

Harvey, valued at $11 billion, released the Legal Agent Benchmark (LAB) on May 6 as an open-source evaluation framework. The initial release contains more than 1,200 tasks spanning 24 legal practice areas, each graded against more than 75,000 expert-written rubric criteria. Code and a portion of the dataset are available on GitHub.

LAB differs from existing legal AI benchmarks (LegalBench, CUAD, LEXam, and Harvey's own BigLaw Bench), which measure short-horizon reasoning like contract reading, case comparison, or argument analysis. LAB structures tasks around actual law firm work units: a partner-to-associate instruction (averaging 50 words), a closed document environment the agent must navigate, reviewable legal work product as the output, and atomic pass/fail verification against expert rubrics.

A fictional M&A example illustrates the approach. An agent reviews eight material contracts plus adjacent documents (10-K, deferred compensation plan) in a virtual data room, identifies change-of-control provisions, assesses deal risk, recommends next steps, and produces a draft board memo. The rubric for that single task contains 57 criteria covering nine legal issues. LAB uses "all-pass" grading: a task passes only if every criterion passes. Partial credit does not apply. As Harvey notes, a deal memo catching 8 of 10 material risks is not 80% useful; the missed issue could blow up the transaction or surface post-closing.

Harvey intentionally launched LAB without a leaderboard. The company plans to work with research partners to produce baseline results and publish normalization standards before any rankings appear, citing the expectation that the dataset will evolve over time.

Law Firms Need a Shared Standard for AI Agent ROI

Benchmarks in other domains—SWE-Bench Verified for coding agents, GDPval for web search, FinanceAgent for financial analysis—have marked capability inflection points. Harvey argues that a legal-specific benchmark could serve the same function: revealing where agents are ready to deploy under a "review pattern" and where they need heavy human-in-the-loop work.

For managing partners and innovation leads, that clarity addresses a two-year-old vacuum. Every firm has fielded vendor demos and pilots. Few have a way to ask "where, specifically, can we put these to work?" without relying on vendor case studies.

LAB could serve multiple constituencies. Law firms evaluating competing products could request vendor performance on specific practice areas instead of comparing demos. Vendors gain a public yardstick for capability claims. Researchers gain longer-horizon, domain-specific tasks for evaluation and fine-tuning. Journalists and analysts gain a way to test vendor claims independently.

There is, however, a structural caveat. LAB is built by a market participant—Harvey is a dominant, well-funded vendor. The choices embedded in LAB's task definitions and rubric criteria reflect Harvey's judgment about what good legal work looks like, made in consultation with research partners (including Anthropic, OpenAI, Nvidia, Google DeepMind, Mistral, LangChain, and Stanford LIFTLab). That does not invalidate the benchmark, but it shapes what gets measured and how.

The question of "open source" in this context also warrants skepticism. Legal AI projects from well-funded vendors like Harvey typically remain maintained almost exclusively by in-house staff, rarely graduating to community codebases with outside contributors. How much room LAB leaves for external voices to shape what gets measured will determine whether it becomes the shared standard Harvey intends or another vendor showcase.

How to Use LAB Now and When the Leaderboard Ships

Before any public leaderboard exists, LAB serves as a reference taxonomy. Legal teams evaluating AI agents can use the 1,200 tasks as a framework for vendor conversations. Ask which practice areas a vendor has tested against LAB tasks. Request performance data on the areas that matter to your firm.

When the leaderboard launches, transparency matters more than the benchmark's existence. Watch how Harvey normalizes submissions from different vendors (different inference speeds, training data, model sizes). Ask whether the rubric criteria can be audited independently. If normalization stays opaque, the leaderboard becomes marketing—not measurement.

For firms not yet in vendor pilots, LAB is a signal that the market is shifting from "Can agents do legal work?" to "Where should we deploy them first?" That shift often precedes buying decisions by six to twelve months.

#Legal AI#Agents#Open Source#Enterprise AI
Share:
Keep reading

Related stories