Probably Raises $9M to Stop AI Hallucinations Before Users See Them

Probably Closes $9M Seed to Engineer Accuracy Into Smaller Models

Probably, a newly funded startup founded by Peter Elias, raised $9 million in seed funding from Andreessen Horowitz. The company is building a system to prevent hallucinations and factual errors from reaching users by treating LLM accuracy as an engineering problem rather than a model-capacity problem.

The first product is a data science tool that returns answers from complex datasets. Each result includes citations and an audit trail. The critical innovation is what Elias calls a "data science mech suit": the LLM generates an initial answer, which is then checked against a deterministic validator that bounces back any results that don't match the underlying data. The LLM has been trained against this validator, and the entire system is optimized for speed and accuracy together.

The result is counterintuitive. Elias reports that "the better your harness engineering is, the weaker the model can be." Probably's current version runs on a model that is "four classes weaker than the frontier models" (company-reported), meaning it can run on local hardware rather than cloud infrastructure. This cuts token costs substantially at a time when many organizations are reassessing AI spending.

The Economics Flip When Validation Replaces Model Scale

This approach addresses a gap that large AI labs are structurally incentivized to ignore. As Elias notes, major labs "make money the more times you have to correct the model," so they have no commercial reason to ship systems that reduce correction loops or lower inference costs through smaller models.

Probably's architecture decouples accuracy from model size by adding deterministic validation. When you constrain the LLM's context carefully enough, the model "does not have to work very hard to do the right thing," Elias explains. The payoff is local deployment, reduced latency, and dramatically lower token consumption. For precision-sensitive use cases (accounting, medical services, data analysis), this is material.

The $9 million round signals investor conviction in this approach, but the real test is whether smaller models with tight harnesses can match frontier-model accuracy in production. Probably claims the system targets 99.99% accuracy, matching deterministic systems, but this is not yet independently verified.

How to Evaluate This for Your Deployment

If you are running inference-heavy AI pipelines with high error sensitivity, the question is whether your use case can be reframed as a validation problem instead of a raw model-quality problem. Probably's model suggests that if you can enumerate what the data says, you can trap most hallucinations without scaling to GPT-4 or Claude Opus.

This is particularly relevant if you have been migrating toward frontier models purely for reliability. Audit your error logs: how many failures are genuine model reasoning gaps versus factual inconsistencies with your source data? If the latter dominates, a smaller model plus deterministic validation may cut your costs by an order of magnitude while holding accuracy constant. The catch is that this approach works best when outputs must be verifiable against a fixed corpus. Open-ended reasoning tasks will still demand larger models.

Probably Raises $9M to Stop AI Hallucinations Before Users See Them

Our Take

Why it matters

Do this week

Probably Closes $9M Seed to Engineer Accuracy Into Smaller Models

The Economics Flip When Validation Replaces Model Scale

How to Evaluate This for Your Deployment

Related stories

Your compliance API isn't ready for AI agents yet

Regulators now demand proof controls work, not just docs

Banks can't wait for AI rules. Regulators just told you why.