A 90-year-old brain test exposes the models' attention ceiling

verifiedDeveloper

Wednesday, June 10, 2026

Signal

Medium

Time horizon

This quarter

Risk

Long-running agents degrading silently on extended tasks

The News

Researchers Suketu Chandrakant Patel, Hongbin Wang and Jin Fan published "Deficient executive control in transformer attention" in PNAS Nexus (June 10), running leading models — GPT-4o, Claude 3.5 Sonnet, GPT-5, Claude Opus 4.1 and Gemini 2.5 — through the Stroop task, the classic psychology test of naming an ink color while ignoring a conflicting written word. Accuracy held at short lengths and then collapsed as the task got longer: GPT-4o scored 91% at five items, 57% at ten, and 15% at forty; Claude 3.5 Sonnet stayed stable through twenty items before falling to 24% at forty. When mismatched color words were clustered together, some models dropped to "nearly zero." The paper frames this as a deficit in executive control — the ability to sustain focus and resist interference over an extended sequence.

The Read

This is the empirical floor under story 01's ceiling, and the two belong in the same issue. The benchmark sheet says the frontier can write 80% of a SWE-Bench Pro task; this study says the same class of system loses the thread when a task runs long and full of conflicting signals — which is exactly the shape of a real agentic workflow, not a benchmark prompt. The practical translation for anyone shipping long-horizon agents: degradation isn't a cliff you'll see in a demo, it's a slope that shows up around the point where the context fills with competing instructions and the agent quietly starts getting things wrong without flagging it. Peer-reviewed, vendor-independent, and measuring the failure mode that the launch-day benchmark explicitly doesn't — this is the counterweight to read before you let an agent run unsupervised across a forty-step process.

Do This Week

For every production agent you run unattended past ~10 sequential steps, add one control: a forced checkpoint where the agent's state is summarized and re-grounded before it continues. The Stroop result says the failure is length-and-interference driven, so the cheapest mitigation is to cap the uninterrupted run length — make it a one-line item in your next agent-deployment review.

For Founder