Crosby Benchmarks AI Contract Negotiation; ChatGPT 5.5 Scores 50.5%

Crosby Publishes Contract Negotiation Results

Crosby, a NewMod law firm backed by $85 million in funding (Sequoia, Index Ventures, Lux Capital, and others), launched Multi-turn Negotiation Bench (also called Redline), a benchmark co-published with micro1. The test measures how frontier models perform the workflow of senior commercial lawyers in live contract negotiation settings.

The benchmark frames redlining as a multi-turn sequence, not isolated edits. Each turn requires the model or attorney to decide what matters, what to leave alone, how hard to push, and how to adapt as negotiation evolves. Crosby found that contract work requires understanding deal context, each party's commercial leverage, making legally sound edits, anticipating counterparty response, and preserving momentum toward execution.

Results (per Crosby's published findings):

ChatGPT 5.5: 50.5% overall
Claude Fable 5: 47.3% overall (limited test window; retest planned)
Gemini 3.5 Flash: 45.1% overall
Claude Opus 4.8: 44.4% overall

The score spread was narrow. More telling: human lawyers consistently outperformed all models on finding new routes to resolution. AI tools got stuck on initial positions, indicating the judgment layer remains absent from current systems.

Crosby also announced Crosby Intelligence, a research group of lawyers, applied AI engineers, and researchers. Its stated mission is building agentic attorneys and releasing benchmarks for legal domains where judgment matters most. The group plans to move time-to-signature from weeks to hours.

The Judgment Gap Is Real, and It Matters

Crosby's score distribution tells you something important: frontier models can handle isolated contract editing tasks at a basic level, but none reliably model negotiation as a dynamic, adaptive process. The finding that human lawyers still dominate on "new routes to resolution" is a confirmation, not a surprise. It means the tool cannot reframe a deal when the first approach stalls.

For Crosby's business model, this is critical. It operates on fixed fees, not hourly billing. More automation (with quality gates) improves margins. Conversely, fixed fees plus inefficiency equals margin erosion. Traditional law firms don't face this pressure and have historically moved slowly on agent-based workflows.

Crosby's investment in benchmarking and research also signals competitive territory-marking. Crosby Intelligence is jointly funding fellows with OpenAI, a signal that the legal vertical matters to LLM vendors. The legal services market is large enough for many players, but efficiency will sort winners from the rest.

What to Do Now

If you run legal ops or purchase legal tech, use Crosby's benchmark as a stress test. Does your contract automation handle multi-step negotiation, or just single-turn redline suggestions? Can the tool reason about commercial leverage and counterparty incentives, or does it stop at clause-level edits?

The benchmark itself is vendor-published with no independent reproducer, so treat the absolute scores as Crosby's data. The structure of the test (multi-turn, judgment-driven, context-aware) is what to apply to your own evaluation. If your current vendor touts "AI-powered redlining" but delivers clause-by-clause suggestions without deal reasoning, you have your answer.

Crosby Intelligence's stated roadmap includes monthly speaker series with scholars and practitioners, plus additional benchmarks on open problems in legal AI. Subscribe to that output if you're evaluating agent-based legal work.

Crosby Benchmarks AI Contract Negotiation; ChatGPT 5.5 Scores 50.5%

Our Take

Why it matters

Do this week

Crosby Publishes Contract Negotiation Results

The Judgment Gap Is Real, and It Matters

What to Do Now

Related stories

Six in 10 workers skip reading employment contracts

Jury awards former Ameris Bank exec $80M in wrongful termination case

SpaceX IPO mints 4,400 millionaires. Here's how you compete for AI talent.