Back to news
AnalysisJune 29, 2026· 2 min read

LLM agents fail to update facts mid-conversation, even with full context

Frontier models drop from 92% to 77% accuracy when forced to maintain their own memory of changing facts across long conversations. Researchers released a trainable environment to fix the gap.

Our Take

The problem is not comprehension or model scale—it's that agents don't know when to forget old facts, and more context won't teach them.

Why it matters

Any LLM agent handling multi-turn conversations where facts change (addresses, prices, plan revisions) is exposed to this failure today. This is the first evidence the gap responds to training, not just measurement.

Do this week

Enterprise AI teams: test your agent's behavior when a customer updates their address or price mid-conversation, then evaluate whether retraining on Supersede-style reward signals is cheaper than enforcing external memory.

Frontier models fail to track updated facts without external memory

Researchers isolated a specific, measurable failure in LLM agents: the inability to discard superseded facts and use current values instead. On the knowledge-update subset of LongMemEval (real conversational data), gpt-5.4 dropped from 92% accuracy with full context to 77% when forced to maintain bounded, self-managed memory (paired McNemar p<0.005, independent statistical test).

The failure is not compression. When researchers extended conversations 24x and granted agents proportionally more memory, accuracy fell further (68% to 28%), then stayed flat (28% to 28%, n=25 trials). The bottleneck scales with conversation length, not memory ratio. Stronger models do not close the gap: the gap persists across model scale while full-context accuracy saturates near 92%.

The researchers then built Supersede, an open reinforcement-learning environment on the verifiers/prime-rl stack, that turns this measurement into a training signal. Agents are rewarded for answering from current fact values and penalized for stale ones. Fine-tuning Qwen2.5-3B (a small open model) on this environment nearly doubled held-out supersession accuracy on unseen conversations: 9.0% to 16.7% (single run), with monotonic checkpoint gains indicating the learned policy, not the training harness, carried the improvement.

This is the first trainable solution to temporal fact-currency in agents

The gap is not a data problem or a scale problem. It is a behavioral one: models trained on next-token prediction have no built-in mechanism to flag when a fact should be discarded. External context windows hide the deficit; they do not solve it. Any agent system relying on self-managed memory (long-horizon planning, multi-session customer interactions, conversational databases) will encounter this failure as context limits force trade-offs.

Prior work measured the failure. This work demonstrates it is trainable. That distinction matters. A team deploying a customer-service agent or plan-revision system can now measure the gap on their own data, then use open-source training infrastructure to close it, rather than accepting the gap as a model limitation.

Audit your agent's memory behavior on multi-fact updates

If you are deploying agents for conversations where facts change (e-commerce, customer support, planning), run a red-team test: inject a price update, address change, or plan revision mid-conversation and check whether the agent uses the old or new value on follow-up queries. If it fails more than ~15% of the time, the gap is real for your use case.

Supersede code, environment, model, and dataset are open. If your agent fails the red-team test, evaluate the cost of fine-tuning a small model on this environment against the cost of adding external memory infrastructure or accepting the degradation.

#LLM#Agents#Fine-tuning#Research
Share:
Keep reading

Related stories