Four axioms reveal how LLMs fail at internal reasoning—independent of benchmark scores

Four axioms expose a representational gap in LLMs

Researchers at arXiv published a framework to measure latent thought quality independently of downstream accuracy. Most evaluations conflate representation quality with model capacity, making it impossible to know whether a reasoning failure stems from weak internal representations or weak processing of good representations.

The framework formalizes four functional axioms: Causality (representations capture causal structure), Minimality (irrelevant information is compressed), Separability (different questions within a task produce distinct representations), and Stability (representations persist meaningfully across related inputs). For each axiom, the authors defined a quantitative measure computed directly on the representation, not backward from accuracy.

The audit covered 23 reasoning tasks (spatial reasoning, factual QA, and others) across open-weight LLMs spanning dense, reasoning-distilled, and reinforcement-learning-trained families. Result: no candidate model satisfied all four axioms simultaneously. More tellingly, representations distinguished task type reliably but could not distinguish between two different questions within the same task. Representations also encoded little information beyond what was already present in the input embedding.

The gap held consistent across model sizes and training procedures, indicating a structural property rather than a scaling or optimization artifact.

Benchmark success may not reflect internal reasoning capability

The conventional narrative holds that chain-of-thought prompting and reasoning-optimized training elicit richer internal representations. This study suggests otherwise: models may be outputting better answers without developing materially better internal thoughts. They are processing input more effectively, not thinking harder.

For practitioners building systems that depend on reasoning (code synthesis, mathematics, multi-step planning), this is consequential. Fine-tuning a model to score higher on a reasoning benchmark tells you the output improved; it does not tell you the internal representation improved. Scaling up retrieval context or adding more reasoning steps may yield diminishing returns if the bottleneck is representational, not architectural capacity.

The finding also raises a practical question: if representations are thin across all tested families, incremental improvements to training or prompting may not be the lever. Architectural change may be necessary.

Audit reasoning models before deployment at scale

Do not assume that a high benchmark score means your model is reasoning internally. Before committing to chain-of-thought orchestration, retrieval augmentation, or reasoning-focused fine-tuning in production, instrument your deployed model to check representation quality directly. The four axioms (causality, minimality, separability, stability) are measurable without gold-label test sets; you can apply them to your own domain.

If your model fails the separability axiom (cannot distinguish between two questions in the same domain), adding more context or longer prompts is unlikely to solve the problem. You may need to re-train or adopt a different architecture. Know this before you scale.

Four axioms reveal how LLMs fail at internal reasoning—independent of benchmark scores

Our Take

Why it matters

Do this week

Four axioms expose a representational gap in LLMs

Benchmark success may not reflect internal reasoning capability

Audit reasoning models before deployment at scale

Related stories

Non-observable states cut Markovian bandit regret near-logarithmic

New method lets you interpret protein AI models without exploding feature counts

Darts Adds Four Foundation Models in One Interface