Our Take
LLMs collapse toward single answers when they should spread across probability distributions; this benchmark proves it, but offers no fix.
Why it matters
As companies deploy LLMs in economic simulations, forecasting, and Monte Carlo sampling, mode collapse becomes a silent failure. A model that returns one confident wrong answer instead of a properly-distributed range will corrupt any downstream analysis that depends on variance.
Do this week
Evaluator: test your LLM's distributional output on UnpredictaBench before shipping it into any stochastic simulation or scenario-generation pipeline.
A new benchmark exposes LLM distributional sampling as mostly broken
Researchers introduced UnpredictaBench, a test suite of 448 problems designed to measure whether large language models can sample from target distributions. The test includes canonical statistical distributions (normal, exponential, etc.), distributions generated by stochastic programs, and natural-language scenarios describing random processes.
The evaluation uses KS@N (Kolmogorov-Smirnov test at sample size N), which measures how often a model's samples would fail to be rejected as coming from the ground-truth distribution. Tested at KS@100 (samples of size 100), the metric represents the pass rate.
Results across open and proprietary models show a wide range, from near 0% to over 20% (per the paper). No model achieved above 40% at KS@100. Adding chain-of-thought reasoning improved scores somewhat but did not solve the underlying problem.
Mode collapse breaks anything downstream that assumes variance
The distinction matters precisely because LLMs excel at generating high-confidence single answers. When a model is asked "sample from this distribution," it often returns the mode (most likely value) instead of exploring the full range. This works fine for classification. It breaks for simulation.
Consider an economic model that asks an LLM to estimate household spending across 100 scenarios. If the model always returns the median estimate instead of sampling from the posterior, the simulation underestimates volatility and correlation structure. A Monte Carlo that needs variance gets none. Forecasting systems built on top compound the error.
The benchmark makes this failure concrete and measurable, isolating a "simplified but fundamental" problem that must be solved before LLMs can serve as substitutes for human judgment or traditional sampling methods in stochastic settings.
Audit your LLM's distributional output before using it for simulation
If you are using an LLM to generate scenarios, forecast ranges, or sample from distributions in any critical workflow, test it against UnpredictaBench or a custom version of your own target distributions first. Know your model's KS@N score. Pair it with rejection sampling or explicit temperature/top-p tuning to force spread, but do not assume the model will naturally output calibrated samples. Treat distributional sampling as a separate capability that must be validated independently of the model's general reasoning quality.