Our Take
This is not geopolitics dressed up as technology; it's rational cost control and competitive differentiation wearing a sovereignty label.
Why it matters
Large law firms and legal tech vendors are realizing that relying entirely on third-party models exposes them to price increases, availability risk, and commoditization of their expertise. As AI becomes table stakes, owning the training pipeline becomes a margin play.
Do this week
General Counsel: Audit which AI workflows are locked into single vendors by January 31 so you can identify the top 3 candidates for internal retraining on proprietary datasets.
Law firms are training their own models
Thomson Reuters has begun training open source LLMs on its own legal data rather than routing all queries through OpenAI or Anthropic. Kirkland & Ellis is building GPU clusters for model training in partnership with Palantir and has made a deliberate choice to avoid commercial platforms. Harvey is working with law firms to train custom models on their proprietary workflows and client data.
These moves reflect a broader pattern Artificial Lawyer calls "AI sovereignty," though the term covers more than geopolitical independence from US AI vendors. It includes any effort by organizations to control their own AI infrastructure and avoid lock-in to third-party providers.
Control over infrastructure protects margins and independence
The stakes are concrete. First, token costs are rising. A firm that fine-tunes or post-trains on its own curated legal datasets reduces per-query costs compared to querying GPT-4 or Claude at commercial rates. Second, regulatory risk is real. Anthropic's Fable model was temporarily banned; firms dependent on a single provider have no fallback. Third, data leverage matters. Thomson Reuters holds decades of legal precedent and filings; using that as a training base creates a defensible moat rather than feeding it into a commodity model accessible to competitors.
For Kirkland & Ellis, the narrative control is explicit. The firm wants to avoid the perception that its advice is simply ChatGPT output wrapped in billable hours. Building internal infrastructure is marketing as much as engineering.
Audit token spend and data ownership now
If your organization spends more than $50K annually on API calls to OpenAI, Anthropic, or Google, the math favors exploring open source model training on your own data. Mistral, Llama, and other open models can be fine-tuned and run on commodity hardware or cloud infrastructure (AWS, Azure).
Start by cataloging: which workflows consume the most tokens, which workflows operate on proprietary or sensitive data, and which outputs require domain-specific accuracy (medical diagnostics, legal research, financial risk modeling). These are candidates for in-house retraining. The others stay on commercial APIs.
The constraint is not technology; it is operational maturity. Training and deploying models requires in-house ML ops, data governance, and version control. If you lack that capability, start with a vendor partner (Harvey, or specialist boutiques) to pilot one workflow before building a team.