Our Take
NVIDIA correctly diagnoses agent economics but the solution is selling you their new platform.
Why it matters
Teams building agents face token costs that spike 15x over chatbots, and current serving infrastructure can't sustain the economics at interactive speeds.
Do this week
Infrastructure teams: benchmark your agent workloads' cache hit rates this week so you can identify whether context management is driving your token costs.
NVIDIA maps agent token consumption patterns
NVIDIA analyzed a 33-minute Claude Code session that consumed 283 inference requests across a main agent and 225 sub-agent invocations. The context window grew from 15K tokens to 156K before compacting back to 20K tokens. Anthropic estimates these multi-agent systems consume up to 15x more tokens than standard chat (per Anthropic's report).
The session data reveals three distinct consumption patterns. Primary agents accumulate context quickly, averaging 85K tokens across the first 40 turns and processing 3.5 million input tokens before compaction. Sub-agents start with fresh context windows but increase total output volume. Context compaction forces sharp drops to avoid hitting limits and reduce costs.
Prompt caching becomes critical at these scales. The measured session sustained 95-98% cache hit rates, reducing input processing costs by roughly 85% compared to full reprocessing. Without this caching, costs would be approximately 6x higher (company analysis).
Standard serving economics break under agent loads
Agent workloads create a performance bottleneck that standard infrastructure can't solve economically. These systems need high interactivity (low latency) while processing massive token volumes. Traditional serving optimizes for either high throughput with poor interactivity, or low latency with collapsed throughput.
The economic constraint is real. Agents must operate on the high-interactivity side of the performance curve where throughput drops dramatically, making per-token costs prohibitive. The token consumption isn't just 15x higher in volume but structurally different: agents spawn unpredictable chains of tool calls and sub-agents rather than following linear chat patterns.
Context management becomes a systems problem rather than just an API feature. Sustaining high cache hit rates requires CPU-side KV cache management and high-capacity context storage to preserve long prefixes as sessions scale across multiple agents.
Infrastructure demands shift to specialized hardware
NVIDIA positions their Vera Rubin NVL72 platform as the solution, claiming one-tenth the cost per million tokens of Blackwell for long-context workloads. The platform combines high-bandwidth memory for context storage with specialized processors for different phases of inference.
The analysis identifies specific bottlenecks: network and memory-system behavior directly affect user-perceived latency once cached tokens dominate processing. Multi-agent sessions require low-latency fabrics to keep shared context accessible and reduce recomputation penalties.
Teams running agents should audit their context patterns now. Track cache hit rates, measure context growth curves, and identify compaction triggers. The 15x token multiplier isn't uniform: it depends on task complexity, sub-agent spawning patterns, and context management strategies. Understanding your specific consumption profile determines whether specialized serving infrastructure justifies the platform shift.