Back to news
AnalysisJune 23, 2026· 3 min read

Same Model, Different Results: Legal AI Scaffold Beats Raw Model Power

Legal Nodes study shows Claude Opus 4.8 performs differently across three scaffolds. Workflow engineering, not just base models, drives real legal AI performance.

Our Take

Model-only benchmarks miss the actual lever: the scaffold around it controls legal AI output quality more than the model itself.

Why it matters

Legal teams are pouring resources into fine-tuning while overlooking faster wins in prompt engineering, retrieval setup, and workflow logic. This study suggests scaffold investment may outpace post-training ROI in the near term, especially as token costs climb.

Do this week

Legal ops lead: audit your current legal AI setup against the three scaffolding dimensions (context layer, agentic loops, tool calling) this week so you can identify quick wins before committing to model fine-tuning.

Legal Nodes Tests One Model Across Three Scaffolds

A benchmarking study by Legal Nodes, a tech-enabled legal consultancy, evaluated Claude Opus 4.8 on the same 40 legal tasks (data protection and digital operational resilience) using three different publicly available environments: Claude Chat, Cowork with Legal Plugin, and MikeOSS. The same model produced materially different outputs depending on which scaffold wrapped it.

The finding contradicts the current industry focus on model selection and post-training. Nestor Dubnevych, legal AI expert at Legal Nodes, stated the study shows "model-only evaluation gives an incomplete picture of legal AI performance." Output quality depends on context layering, workflow logic, prompt engineering, planning, agentic loops, retrieval, and tool calling—not the base model alone.

MikeOSS, the open-source scaffold, achieved performance slightly lower than Claude and Cowork but at 60% cost savings versus Cowork and 90% savings versus Claude Chat (company-reported), using the same underlying model. Will Chen, MikeOSS creator, noted the results demonstrate "satisfactory performance standards" with "significant cost savings," though he acknowledged the current focus has been transactional tasks like contract review.

Scaffold Engineering May Deliver Faster ROI Than Fine-Tuning

The legal tech market's recent pivot toward post-trained models (Harvey and Kirkland & Ellis announced model training partnerships) assumes that fine-tuning is the bottleneck. This study suggests it isn't—at least not first. For many legal teams, optimizing the scaffold around an existing model may be the faster path to measurable performance gains.

Token cost pressure is sharpening the trade-off. As inference costs climb, teams cannot afford to default to the largest or most expensive model. Cowork and Claude Chat demonstrated higher outputs, but MikeOSS achieved functional performance at a fraction of the cost. This becomes a material decision when legal workflows run thousands of inferences per month.

The study also flags a gap in current legal AI benchmarking. Harvey and Crosby's LAB and RedlineBench leaderboards test models in isolation, not in real deployment context. Legal Nodes' decision to test the same model across three scaffolds fills that gap and raises a second-order question: if scaffold choice matters this much, why do legal teams still optimize for model selection first?

Audit Your Scaffold Before You Fine-Tune

Map your legal AI setup against the scaffold dimensions Legal Nodes identified: context layer (how you feed case law, internal docs, regulatory updates), workflow logic (branching, escalation, human review gates), prompt engineering (available skills and instruction sets), planning (multi-step reasoning chains), agentic loops (iteration and refinement), retrieval (vector stores, keyword search, hybrid), and tool calling (API integrations for document assembly, case research, compliance checks).

Most legal teams will find gaps in one or more of these layers. Closing them does not require model retraining or vendor lock-in—it requires design work and integration. A properly structured retrieval system or a well-engineered planning loop can move the needle faster than waiting for your vendor's next post-trained release.

Cost pressure makes this urgent. If MikeOSS can match Cowork and Claude Chat performance at 10% to 40% of the price (depending on the comparison), the savings compound. A 1000-task-per-month workflow saves $500 to $900 per month by optimizing the scaffold alone. That capital can fund a second engineer to improve your context layer or agentic loops.

#Legal AI#Claude#Enterprise AI#Research
Share:
Keep reading

Related stories