Same Model, Different Results: Legal AI Scaffold Beats Raw Model Power

Legal Nodes Tests One Model Across Three Scaffolds

A benchmarking study by Legal Nodes, a tech-enabled legal consultancy, evaluated Claude Opus 4.8 on the same 40 legal tasks (data protection and digital operational resilience) using three different publicly available environments: Claude Chat, Cowork with Legal Plugin, and MikeOSS. The same model produced materially different outputs depending on which scaffold wrapped it.

The finding contradicts the current industry focus on model selection and post-training. Nestor Dubnevych, legal AI expert at Legal Nodes, stated the study shows "model-only evaluation gives an incomplete picture of legal AI performance." Output quality depends on context layering, workflow logic, prompt engineering, planning, agentic loops, retrieval, and tool calling—not the base model alone.

MikeOSS, the open-source scaffold, achieved performance slightly lower than Claude and Cowork but at 60% cost savings versus Cowork and 90% savings versus Claude Chat (company-reported), using the same underlying model. Will Chen, MikeOSS creator, noted the results demonstrate "satisfactory performance standards" with "significant cost savings," though he acknowledged the current focus has been transactional tasks like contract review.

Scaffold Engineering May Deliver Faster ROI Than Fine-Tuning

The legal tech market's recent pivot toward post-trained models (Harvey and Kirkland & Ellis announced model training partnerships) assumes that fine-tuning is the bottleneck. This study suggests it isn't—at least not first. For many legal teams, optimizing the scaffold around an existing model may be the faster path to measurable performance gains.

Token cost pressure is sharpening the trade-off. As inference costs climb, teams cannot afford to default to the largest or most expensive model. Cowork and Claude Chat demonstrated higher outputs, but MikeOSS achieved functional performance at a fraction of the cost. This becomes a material decision when legal workflows run thousands of inferences per month.

The study also flags a gap in current legal AI benchmarking. Harvey and Crosby's LAB and RedlineBench leaderboards test models in isolation, not in real deployment context. Legal Nodes' decision to test the same model across three scaffolds fills that gap and raises a second-order question: if scaffold choice matters this much, why do legal teams still optimize for model selection first?

Audit Your Scaffold Before You Fine-Tune

Map your legal AI setup against the scaffold dimensions Legal Nodes identified: context layer (how you feed case law, internal docs, regulatory updates), workflow logic (branching, escalation, human review gates), prompt engineering (available skills and instruction sets), planning (multi-step reasoning chains), agentic loops (iteration and refinement), retrieval (vector stores, keyword search, hybrid), and tool calling (API integrations for document assembly, case research, compliance checks).

Most legal teams will find gaps in one or more of these layers. Closing them does not require model retraining or vendor lock-in—it requires design work and integration. A properly structured retrieval system or a well-engineered planning loop can move the needle faster than waiting for your vendor's next post-trained release.

Cost pressure makes this urgent. If MikeOSS can match Cowork and Claude Chat performance at 10% to 40% of the price (depending on the comparison), the savings compound. A 1000-task-per-month workflow saves $500 to $900 per month by optimizing the scaffold alone. That capital can fund a second engineer to improve your context layer or agentic loops.

Same Model, Different Results: Legal AI Scaffold Beats Raw Model Power

Our Take

Why it matters

Do this week

Legal Nodes Tests One Model Across Three Scaffolds

Scaffold Engineering May Deliver Faster ROI Than Fine-Tuning

Audit Your Scaffold Before You Fine-Tune

Related stories

1 in 3 lawyers use unapproved AI; 25% want to leave

Your Legal Team Is Drowning in Volume, Not Complexity

AI Finds Antibiotic Candidates Hidden Inside Prion Proteins