Bridgewater fine-tuned a cheap open model to beat frontier LLMs on financial-news judgment

verifiedDeveloperFinance

Thursday, July 2, 2026

Confidence

High · — primary research with full figures

Evidence

Bridgewater AIA Labs + Thinking Machines Lab joint research post with published methodology and benchmark numbers

Bridgewater's AIA Labs and Thinking Machines published a study showing a fine-tuned open-weight model outperforms the best frontier LLMs at judging which financial news is relevant — at roughly a fourteenth of the running cost.

The details

On replicating expert investor judgment about document relevance, naive prompts to frontier models averaged 47.2% accuracy — "a coin flip" — and even expert-crafted prompts only reached 77.2%, short of the ~80% investors say they need to trust a system in daily work , per Thinking Machines Lab.
Starting from Alibaba's open-weight Qwen3-235B (44.8% out of the box), the team used an expert-labeled dataset and on-policy distillation to reach 84.7% average accuracy — up from the best frontier model's 78.2%, or 29.8% fewer mistakes , per Thinking Machines Lab.
The fine-tuned model runs at a 13.8x reduction in inference cost per task versus the frontier models tested, and the team notes newer, pricier frontier releases added only a point or two , per Bridgewater AIA Labs.

What it means

The clearest recent case against "just wait for the next frontier model." Expert-labeled data plus a fine-tuned open model beat every frontier LLM at ~14x less cost — a moat data makes, not model access. Building an internal classifier? Label a few thousand expert examples and fine-tune first. The catch: you need the experts.

On replicating expert investor judgment about document relevance, naive prompts to frontier models averaged 47.2% accuracy — "a coin flip" — and even expert-crafted prompts only reached 77.2%, short of the ~80% investors say they need to trust a system in daily work , per Thinking Machines Lab.
Starting from Alibaba's open-weight Qwen3-235B (44.8% out of the box), the team used an expert-labeled dataset and on-policy distillation to reach 84.7% average accuracy — up from the best frontier model's 78.2%, or 29.8% fewer mistakes , per Thinking Machines Lab.
The fine-tuned model runs at a 13.8x reduction in inference cost per task versus the frontier models tested, and the team notes newer, pricier frontier releases added only a point or two , per Bridgewater AIA Labs.

Sources

Bridgewater fine-tuned a cheap open model to beat frontier LLMs on financial-news judgment

Draft an Investor-Update Email from a Metrics Snapshot and Last Month's Update

Claude Science