Our Take
Real engineering work on the plumbing that makes agents usable, with measurable latency wins that matter more than the marketing would suggest.
Why it matters
Agent workflows break when reasoning gets dropped between turns or tool calls arrive in batches instead of streaming. Production deployments need this reliability layer.
Do this week
Infrastructure teams: audit your agent serving stack for reasoning preservation and streaming tool dispatch before your next production rollout.
NVIDIA Dynamo Gets Agent-Specific Parsing
NVIDIA released updates to Dynamo, its inference engine, specifically targeting multi-turn agent workflows. The changes address three core problems: unstable prompt prefixes that break KV cache reuse, reasoning context that gets dropped between agent turns, and tool calls that arrive in batches instead of streaming as they decode.
The most concrete fix targets Anthropic billing headers. Claude Code sends a session-specific header ("x-anthropic-billing-header: cc_version=0.2.93; cch=abc123def456==") that varies per session, poisoning KV cache reuse. NVIDIA's solution strips these headers before tokenization. On a 52K-token prompt deployment, this reduced time-to-first-token from 912ms to 169ms (company-reported), a 5x improvement.
The reasoning preservation fix addresses how models like Nemotron handle interleaved thinking and tool calls. Previous versions would group all reasoning together, then all tool calls, losing the sequence where specific reasoning explains specific tool actions. The new parser maintains the original interleaved structure: reasoning_0 → tool_call_0 → reasoning_1 → tool_call_1.
Agent Reliability Depends On Context Plumbing
Agent workflows fail when the inference layer doesn't match the interaction model. Reasoning that explains why the agent chose a specific tool needs to persist into the next turn, but reasoning from casual chat turns often should be dropped to save context. Getting this wrong breaks agent decision-making in ways that are hard to debug.
The streaming issue matters for user experience. Tool calls that arrive only after the entire response completes make agents feel slow and unresponsive. Streaming tool dispatch lets complete tool calls start executing as soon as they decode, rather than waiting for the full turn to finish.
The KV cache poisoning problem scales badly. Every new session with an unstable prefix forces a cold prefill instead of reusing cached computation. At 744ms per request (company-reported), this quickly becomes a capacity and cost problem for production deployments.
Configuration Flags Target Agent Use Cases
NVIDIA added specific flags for agent deployments: --enable-anthropic-api for Messages API compatibility, --strip-anthropic-preamble for cache stability, and --enable-streaming-tool-dispatch for responsive tool execution. Worker-side parsing uses --dyn-tool-call-parser and --dyn-reasoning-parser to handle model-specific reasoning behaviors.
The reasoning preservation logic now checks whether the active chat template understands reasoning_content directly. Templates like Nemotron and Qwen3 handle this natively. Others fall back to inserting thinking blocks into regular content or dropping them based on model policy.
For agent workflows, Dynamo automatically sets truncate_history_thinking=false when reasoning parsers are active, preserving the context agents need while keeping the default behavior for regular chat workloads.