LLMs fail to sample randomness: new benchmark shows 0–20% accuracy

A new benchmark exposes LLM distributional sampling as mostly broken

Researchers introduced UnpredictaBench, a test suite of 448 problems designed to measure whether large language models can sample from target distributions. The test includes canonical statistical distributions (normal, exponential, etc.), distributions generated by stochastic programs, and natural-language scenarios describing random processes.

The evaluation uses KS@N (Kolmogorov-Smirnov test at sample size N), which measures how often a model's samples would fail to be rejected as coming from the ground-truth distribution. Tested at KS@100 (samples of size 100), the metric represents the pass rate.

Results across open and proprietary models show a wide range, from near 0% to over 20% (per the paper). No model achieved above 40% at KS@100. Adding chain-of-thought reasoning improved scores somewhat but did not solve the underlying problem.

Mode collapse breaks anything downstream that assumes variance

The distinction matters precisely because LLMs excel at generating high-confidence single answers. When a model is asked "sample from this distribution," it often returns the mode (most likely value) instead of exploring the full range. This works fine for classification. It breaks for simulation.

Consider an economic model that asks an LLM to estimate household spending across 100 scenarios. If the model always returns the median estimate instead of sampling from the posterior, the simulation underestimates volatility and correlation structure. A Monte Carlo that needs variance gets none. Forecasting systems built on top compound the error.

The benchmark makes this failure concrete and measurable, isolating a "simplified but fundamental" problem that must be solved before LLMs can serve as substitutes for human judgment or traditional sampling methods in stochastic settings.

Audit your LLM's distributional output before using it for simulation

If you are using an LLM to generate scenarios, forecast ranges, or sample from distributions in any critical workflow, test it against UnpredictaBench or a custom version of your own target distributions first. Know your model's KS@N score. Pair it with rejection sampling or explicit temperature/top-p tuning to force spread, but do not assume the model will naturally output calibrated samples. Treat distributional sampling as a separate capability that must be validated independently of the model's general reasoning quality.

LLMs fail to sample randomness: new benchmark shows 0–20% accuracy

Our Take

Why it matters

Do this week

A new benchmark exposes LLM distributional sampling as mostly broken

Mode collapse breaks anything downstream that assumes variance

Audit your LLM's distributional output before using it for simulation

Related stories

25 MLOps Guidelines for Model Deployment Now Public

Deeper transformers need smarter residual routing, not just fixed weights

macOS Agents Fail Where Linux Ones Succeed: New 421-Task Benchmark Reveals the Gap