NVIDIA Dynamo Fixes Agent Streaming With Per-Turn Context Controls

NVIDIA Dynamo Gets Agent-Specific Parsing

NVIDIA released updates to Dynamo, its inference engine, specifically targeting multi-turn agent workflows. The changes address three core problems: unstable prompt prefixes that break KV cache reuse, reasoning context that gets dropped between agent turns, and tool calls that arrive in batches instead of streaming as they decode.

The most concrete fix targets Anthropic billing headers. Claude Code sends a session-specific header ("x-anthropic-billing-header: cc_version=0.2.93; cch=abc123def456==") that varies per session, poisoning KV cache reuse. NVIDIA's solution strips these headers before tokenization. On a 52K-token prompt deployment, this reduced time-to-first-token from 912ms to 169ms (company-reported), a 5x improvement.

The reasoning preservation fix addresses how models like Nemotron handle interleaved thinking and tool calls. Previous versions would group all reasoning together, then all tool calls, losing the sequence where specific reasoning explains specific tool actions. The new parser maintains the original interleaved structure: reasoning_0 → tool_call_0 → reasoning_1 → tool_call_1.

Agent Reliability Depends On Context Plumbing

Agent workflows fail when the inference layer doesn't match the interaction model. Reasoning that explains why the agent chose a specific tool needs to persist into the next turn, but reasoning from casual chat turns often should be dropped to save context. Getting this wrong breaks agent decision-making in ways that are hard to debug.

The streaming issue matters for user experience. Tool calls that arrive only after the entire response completes make agents feel slow and unresponsive. Streaming tool dispatch lets complete tool calls start executing as soon as they decode, rather than waiting for the full turn to finish.

The KV cache poisoning problem scales badly. Every new session with an unstable prefix forces a cold prefill instead of reusing cached computation. At 744ms per request (company-reported), this quickly becomes a capacity and cost problem for production deployments.

Configuration Flags Target Agent Use Cases

NVIDIA added specific flags for agent deployments: --enable-anthropic-api for Messages API compatibility, --strip-anthropic-preamble for cache stability, and --enable-streaming-tool-dispatch for responsive tool execution. Worker-side parsing uses --dyn-tool-call-parser and --dyn-reasoning-parser to handle model-specific reasoning behaviors.

The reasoning preservation logic now checks whether the active chat template understands reasoning_content directly. Templates like Nemotron and Qwen3 handle this natively. Others fall back to inserting thinking blocks into regular content or dropping them based on model policy.

For agent workflows, Dynamo automatically sets truncate_history_thinking=false when reasoning parsers are active, preserving the context agents need while keeping the default behavior for regular chat workloads.

NVIDIA Dynamo Fixes Agent Streaming With Per-Turn Context Controls

Our Take

Why it matters

Do this week

NVIDIA Dynamo Gets Agent-Specific Parsing

Agent Reliability Depends On Context Plumbing

Configuration Flags Target Agent Use Cases

Related stories

Wealth managers pivot to resilience as Q1 volatility rewards caution

Dutch pension advisers lack data access during WTP transition

Iress partners with Thoughtworks for wealth platform overhaul