LLM agents fail to update facts mid-conversation, even with full context

Frontier models fail to track updated facts without external memory

Researchers isolated a specific, measurable failure in LLM agents: the inability to discard superseded facts and use current values instead. On the knowledge-update subset of LongMemEval (real conversational data), gpt-5.4 dropped from 92% accuracy with full context to 77% when forced to maintain bounded, self-managed memory (paired McNemar p<0.005, independent statistical test).

The failure is not compression. When researchers extended conversations 24x and granted agents proportionally more memory, accuracy fell further (68% to 28%), then stayed flat (28% to 28%, n=25 trials). The bottleneck scales with conversation length, not memory ratio. Stronger models do not close the gap: the gap persists across model scale while full-context accuracy saturates near 92%.

The researchers then built Supersede, an open reinforcement-learning environment on the verifiers/prime-rl stack, that turns this measurement into a training signal. Agents are rewarded for answering from current fact values and penalized for stale ones. Fine-tuning Qwen2.5-3B (a small open model) on this environment nearly doubled held-out supersession accuracy on unseen conversations: 9.0% to 16.7% (single run), with monotonic checkpoint gains indicating the learned policy, not the training harness, carried the improvement.

This is the first trainable solution to temporal fact-currency in agents

The gap is not a data problem or a scale problem. It is a behavioral one: models trained on next-token prediction have no built-in mechanism to flag when a fact should be discarded. External context windows hide the deficit; they do not solve it. Any agent system relying on self-managed memory (long-horizon planning, multi-session customer interactions, conversational databases) will encounter this failure as context limits force trade-offs.

Prior work measured the failure. This work demonstrates it is trainable. That distinction matters. A team deploying a customer-service agent or plan-revision system can now measure the gap on their own data, then use open-source training infrastructure to close it, rather than accepting the gap as a model limitation.

Audit your agent's memory behavior on multi-fact updates

If you are deploying agents for conversations where facts change (e-commerce, customer support, planning), run a red-team test: inject a price update, address change, or plan revision mid-conversation and check whether the agent uses the old or new value on follow-up queries. If it fails more than ~15% of the time, the gap is real for your use case.

Supersede code, environment, model, and dataset are open. If your agent fails the red-team test, evaluate the cost of fine-tuning a small model on this environment against the cost of adding external memory infrastructure or accepting the degradation.

LLM agents fail to update facts mid-conversation, even with full context

Our Take

Why it matters

Do this week

Frontier models fail to track updated facts without external memory

This is the first trainable solution to temporal fact-currency in agents

Audit your agent's memory behavior on multi-fact updates

Related stories

Non-observable states cut Markovian bandit regret near-logarithmic

New method lets you interpret protein AI models without exploding feature counts

Darts Adds Four Foundation Models in One Interface