Back to news
NewsJune 3, 2026· 4 min read

2x faster agent inference on Windows PCs via NVIDIA and Microsoft tools

NVIDIA and Microsoft unveiled agent sandboxing, security primitives, and optimized inference backends at GTC Taipei and Build 2026. Developers can now run autonomous agents locally on RTX hardware with built-in containment.

Our Take

This is infrastructure work, not a capability leap: the security story (MXC + OpenShell) matters more than the 2x inference claim, which relies on vendor optimization, not a new technique.

Why it matters

Developers shipping agents on 100 million NVIDIA RTX PCs worldwide now have a native path to containment and policy enforcement, eliminating a real blocker for consumer and enterprise deployments. The timing aligns with widespread agent prototyping moving from cloud to edge.

Do this week

Windows developers: test Microsoft eXecution Containers (MXC) with a pilot agent this week before locking your security architecture, because the API surface may narrow once adoption picks up.

NVIDIA and Microsoft ship agent sandboxing and inference optimizations for Windows

At GTC Taipei and Microsoft Build 2026, NVIDIA and Microsoft announced three layers of tooling for on-device agents on Windows.

Security and containment: Microsoft eXecution Containers (MXC) define isolation and policy enforcement using native Windows OS constructs. NVIDIA is wrapping MXC into OpenShell, a runtime that bundles policy creation, inference routing, and personally identifiable information (PII) obfuscation. Open source agents OpenClaw and Hermes Agent are adopting the stack.

Inference speed: NVIDIA collaborated with llama.cpp and vLLM maintainers to ship two optimizations. Multi-Token Prediction (MTP) is a speculative decoding technique where a smaller draft model proposes multiple tokens ahead; the target model verifies them in one forward pass. Programmatic Dependent Launch (PDL) allows dependent kernels to execute concurrently on the same CUDA stream instead of sequentially. On Qwen 3.5 and 3.6 27B dense models, llama.cpp now delivers 2x throughput (per NVIDIA benchmarking). vLLM reports 2.6x improvements with additional BF16 kernel selection and CUDA Graphs overhead reduction.

Multi-GPU scaling: llama.cpp now supports tensor parallelism (TP), allowing two GPUs to function as a single compute unit, yielding up to 1.8x token generation performance and 2x memory capacity (company-reported). ComfyUI integrates Classifier-Free Guidance (CFG) across two GPUs for up to 2x compute.

Hardware and models: NVIDIA RTX Spark desktops and laptops deliver 1 petaflop of AI power and up to 128 GB memory. Microsoft is shipping a Surface RTX Spark Dev Box preloaded with developer tools. NVIDIA NemoClaw now supports all NVIDIA client systems (GeForce RTX, RTX PRO, DGX Spark, DGX Station) via Windows and WSL. Hermes Agent released native Windows desktop and CLI support. H Company released Holo 3.1 models tuned for Computer Use (agent screen-and-click interaction), with quantized checkpoints delivering 35% lower memory than FP8; NVIDIA optimization yields over 2x GPU performance.

Windows platform maturity: Windows AI Foundry and Windows AI APIs are now GPU-accelerated on RTX hardware. The first supported model is Phi-Silica, a 3.3B SLM for summarization and code generation. Windows ML and TensorRT for RTX adoption continues, with four recent adopters: Voicemod (42% faster voice conversion), Topaz (20% faster upscaling with 3-4x smaller storage), DxO PhotoLab 9.7 (faster photo processing), and Camo Streamlight (real-time light autotune). Windows Subsystem for Linux Containers (WSL-C) allows native Windows apps to invoke Linux containers without manual WSL setup.

Containment unlocks consumer and enterprise adoption

Agents accessing personal files and apps pose prompt injection risks. MXC + OpenShell removes this blocker by enforcing system-level isolation without requiring developers to implement custom sandboxing. That shifts agents from "prototype" to "deployable."

The inference gains (2x on llama.cpp, 2.6x on vLLM) are real but come from known techniques (MTP is established speculative decoding; PDL is kernel scheduling). The novelty is NVIDIA and open source maintainers co-optimizing for consumer hardware, not cloud infrastructure. That matters because always-on agents running 24/7 on local hardware face different constraints than batch inference.

The hardware narrative (RTX Spark, 100 million existing RTX PCs) is vendor-controlled, but the Windows platform ecosystem (WSL-C, TensorRT for RTX, Phi-Silica integration) is genuine infrastructure maturation. Developers now have multiple paths to GPU-accelerated inference on Windows without proprietary APIs.

Audit your agent containment before shipping to end users

If you are shipping agents on Windows, MXC is not optional. Test the OpenShell integration now and plan to ship with policy enforcement from day one. The installation and setup tooling is not finalized (NVIDIA notes installer "enhancements" in NemoClaw), so early adopters will hit friction; document your experience and feed it back to the maintainers.

For inference, pin to llama.cpp or vLLM versions that include MTP and PDL support, then measure your own throughput (NVIDIA benchmarks are on specific quantized models; your mileage varies). Multi-GPU tensor parallelism in llama.cpp is stable in LM Studio, so if you have two RTX cards, enable it before running larger models.

Ignore the RTX Spark hardware marketing. The value is in the toolchain, not the silicon. You can test the entire stack on your existing RTX GPU.

#Agents#Open Source#Developer Tools#Enterprise AI
Share:
Keep reading

Related stories