Our Take
Crosby is measuring what matters: negotiation as judgment under pressure, not isolated clause edits. The results say AI still can't adapt when leverage shifts, which is the actual job.
Why it matters
Fixed-fee legal models (Crosby's business) need workflow automation with quality control to survive. This benchmark tells you whether today's frontier models can handle that judgment layer — they mostly can't yet.
Do this week
Legal ops: test your contract workflow against Crosby's benchmark criteria (decision-making, leverage assessment, counterparty modeling) before betting automation ROI on current LLM outputs.
Crosby Publishes Contract Negotiation Results
Crosby, a NewMod law firm backed by $85 million in funding (Sequoia, Index Ventures, Lux Capital, and others), launched Multi-turn Negotiation Bench (also called Redline), a benchmark co-published with micro1. The test measures how frontier models perform the workflow of senior commercial lawyers in live contract negotiation settings.
The benchmark frames redlining as a multi-turn sequence, not isolated edits. Each turn requires the model or attorney to decide what matters, what to leave alone, how hard to push, and how to adapt as negotiation evolves. Crosby found that contract work requires understanding deal context, each party's commercial leverage, making legally sound edits, anticipating counterparty response, and preserving momentum toward execution.
Results (per Crosby's published findings):
- ChatGPT 5.5: 50.5% overall
- Claude Fable 5: 47.3% overall (limited test window; retest planned)
- Gemini 3.5 Flash: 45.1% overall
- Claude Opus 4.8: 44.4% overall
The score spread was narrow. More telling: human lawyers consistently outperformed all models on finding new routes to resolution. AI tools got stuck on initial positions, indicating the judgment layer remains absent from current systems.
Crosby also announced Crosby Intelligence, a research group of lawyers, applied AI engineers, and researchers. Its stated mission is building agentic attorneys and releasing benchmarks for legal domains where judgment matters most. The group plans to move time-to-signature from weeks to hours.
The Judgment Gap Is Real, and It Matters
Crosby's score distribution tells you something important: frontier models can handle isolated contract editing tasks at a basic level, but none reliably model negotiation as a dynamic, adaptive process. The finding that human lawyers still dominate on "new routes to resolution" is a confirmation, not a surprise. It means the tool cannot reframe a deal when the first approach stalls.
For Crosby's business model, this is critical. It operates on fixed fees, not hourly billing. More automation (with quality gates) improves margins. Conversely, fixed fees plus inefficiency equals margin erosion. Traditional law firms don't face this pressure and have historically moved slowly on agent-based workflows.
Crosby's investment in benchmarking and research also signals competitive territory-marking. Crosby Intelligence is jointly funding fellows with OpenAI, a signal that the legal vertical matters to LLM vendors. The legal services market is large enough for many players, but efficiency will sort winners from the rest.
What to Do Now
If you run legal ops or purchase legal tech, use Crosby's benchmark as a stress test. Does your contract automation handle multi-step negotiation, or just single-turn redline suggestions? Can the tool reason about commercial leverage and counterparty incentives, or does it stop at clause-level edits?
The benchmark itself is vendor-published with no independent reproducer, so treat the absolute scores as Crosby's data. The structure of the test (multi-turn, judgment-driven, context-aware) is what to apply to your own evaluation. If your current vendor touts "AI-powered redlining" but delivers clause-by-clause suggestions without deal reasoning, you have your answer.
Crosby Intelligence's stated roadmap includes monthly speaker series with scholars and practitioners, plus additional benchmarks on open problems in legal AI. Subscribe to that output if you're evaluating agent-based legal work.