Back to news
AnalysisJune 16, 2026· 3 min read

Probably Raises $9M to Stop AI Hallucinations Before Users See Them

Probably closed a seed round from Andreessen Horowitz to build validation systems that catch LLM errors before output. The approach runs smaller models locally, cutting token costs dramatically.

Our Take

The real insight is not the accuracy target—it's that better harness engineering lets you ship smaller, cheaper models without sacrificing correctness, which inverts the current economics of AI deployment.

Why it matters

Token costs are rising and companies are cutting AI budgets. A vendor showing how to hit 99.99% accuracy on locally-runnable models instead of frontier LLMs addresses the cost crisis directly, not theoretically.

Do this week

Audit your current LLM pipeline: map which queries actually need frontier models versus which could run on smaller models with deterministic validation layered in, then calculate the token savings before end of week.

Probably Closes $9M Seed to Engineer Accuracy Into Smaller Models

Probably, a newly funded startup founded by Peter Elias, raised $9 million in seed funding from Andreessen Horowitz. The company is building a system to prevent hallucinations and factual errors from reaching users by treating LLM accuracy as an engineering problem rather than a model-capacity problem.

The first product is a data science tool that returns answers from complex datasets. Each result includes citations and an audit trail. The critical innovation is what Elias calls a "data science mech suit": the LLM generates an initial answer, which is then checked against a deterministic validator that bounces back any results that don't match the underlying data. The LLM has been trained against this validator, and the entire system is optimized for speed and accuracy together.

The result is counterintuitive. Elias reports that "the better your harness engineering is, the weaker the model can be." Probably's current version runs on a model that is "four classes weaker than the frontier models" (company-reported), meaning it can run on local hardware rather than cloud infrastructure. This cuts token costs substantially at a time when many organizations are reassessing AI spending.

The Economics Flip When Validation Replaces Model Scale

This approach addresses a gap that large AI labs are structurally incentivized to ignore. As Elias notes, major labs "make money the more times you have to correct the model," so they have no commercial reason to ship systems that reduce correction loops or lower inference costs through smaller models.

Probably's architecture decouples accuracy from model size by adding deterministic validation. When you constrain the LLM's context carefully enough, the model "does not have to work very hard to do the right thing," Elias explains. The payoff is local deployment, reduced latency, and dramatically lower token consumption. For precision-sensitive use cases (accounting, medical services, data analysis), this is material.

The $9 million round signals investor conviction in this approach, but the real test is whether smaller models with tight harnesses can match frontier-model accuracy in production. Probably claims the system targets 99.99% accuracy, matching deterministic systems, but this is not yet independently verified.

How to Evaluate This for Your Deployment

If you are running inference-heavy AI pipelines with high error sensitivity, the question is whether your use case can be reframed as a validation problem instead of a raw model-quality problem. Probably's model suggests that if you can enumerate what the data says, you can trap most hallucinations without scaling to GPT-4 or Claude Opus.

This is particularly relevant if you have been migrating toward frontier models purely for reliability. Audit your error logs: how many failures are genuine model reasoning gaps versus factual inconsistencies with your source data? If the latter dominates, a smaller model plus deterministic validation may cut your costs by an order of magnitude while holding accuracy constant. The catch is that this approach works best when outputs must be verifiable against a fixed corpus. Open-ended reasoning tasks will still demand larger models.

#LLM#Enterprise AI#Developer Tools
Share:
Keep reading

Related stories