NVIDIA Nemotron 3 Ultra cuts agent costs 30% with 5x faster inference

NVIDIA ships Nemotron 3 Ultra with 5x throughput and token-efficient reasoning

NVIDIA released Nemotron 3 Ultra, a 550B-parameter Mixture-of-Experts model with 55B active parameters, as an open-source foundation for long-running agent systems. The model is available immediately via Hugging Face, with optimized deployments on NVIDIA NIM, OpenRouter, and multiple cloud providers.

On agentic benchmarks, Nemotron 3 Ultra reports leading scores: 91% on PinchBench (agent productivity), 54% on Terminal-Bench 2.0 (coding), and 33% on EnterpriseOps-Gym (long-horizon planning). The company claims 5x higher inference throughput than comparable open models in its class (per Blackbox endpoint measurement) and 30% lower token count to task completion on SWE-bench and Terminal-Bench 2.0 (company-reported).

The model uses a hybrid Mamba-Transformer architecture, combining Mamba layers for efficient sequence handling with Transformer layers for precise fact retrieval. NVIDIA introduced LatentMoE for smarter expert routing across reasoning, code, and tool-calling tasks, and multi-token prediction to reduce generation time by predicting multiple tokens per forward pass.

Training relied on Multi-Teacher On-Policy Distillation (MOPD), where the model learned from 10+ specialized teacher models while generating its own attempts. Teachers scored outputs in their domain, and improvements were merged iteratively. The base 10T-token pre-training was expanded with 212B domain-specific tokens: 4B synthetic legal data, 35B Wiki-based data, and 173B refreshed GitHub code through September 2025.

NVIDIA is releasing 10M supervised fine-tuning samples, 1M RL tasks, and 15 new RL environments. Models can be fine-tuned via LoRA, full supervised training, or reinforcement learning using the open NeMo libraries. Nemotron 3 Ultra runs on NVIDIA Hopper, Blackwell, and Ampere GPUs via a single NVFP4-quantized checkpoint, enabling 5x higher throughput per GPU on Blackwell at the same latency versus BF16.

Multi-turn agents amplify the token-efficiency problem

Single-turn chatbots do not accumulate context. Agents iterate: they plan, call tools, receive observations, validate results, delegate to sub-agents, and recover from errors. Each cycle grows the context window. Over 10–50 turns, token counts double or triple, pushing inference latency up and cost per task completion up with it.

Existing models (frontier or efficient) are trained for single-turn quality. They do not prioritize reasoning decisions that remain stable across many turns, tool invocation clarity, or efficient fact retrieval from large context windows. Custom orchestration and vector retrieval can mask the problem, but they add latency and operational complexity.

Nemotron 3 Ultra was post-trained on 2M RL tasks and 55 agent environments specifically designed for multi-turn workflows. The released recipes and open RL environments allow teams to adapt the model to their own domains without starting from commodity base models.

Evaluate Nemotron 3 Ultra for your longest task chains first

Start with SWE-bench or your own longest agent workflow. Measure baseline token count per completion (current model + prompt), then swap in Nemotron 3 Ultra without retraining. The MOPD training approach is public; if the off-the-shelf scores do not match your domain, the recipes are available in NeMo-RL to fine-tune on your own RL tasks and environments.

NVIDIA also released Nemotron 3.5 Content Safety (4B guardrail model, 23 safety categories, 12 languages) and Nemotron 3.5 ASR (multilingual voice input, 40+ languages, sub-100ms latency). Both integrate with Hermes Agent and OpenShell, NVIDIA's new secure runtime for autonomous code execution. If you are building always-on agentic systems, these are worth testing alongside the base model.

Deploy via NVIDIA NIM (managed service) or self-hosted on your infrastructure. The company is moving Nemotron licensing to OpenMDW-1.1 (Linux Foundation), broadening legal clarity for enterprise customization and redistribution.

NVIDIA Nemotron 3 Ultra cuts agent costs 30% with 5x faster inference

Our Take

Why it matters

Do this week

NVIDIA ships Nemotron 3 Ultra with 5x throughput and token-efficient reasoning

Multi-turn agents amplify the token-efficiency problem

Evaluate Nemotron 3 Ultra for your longest task chains first

One daily brief. Every story gets a hype verdict.

Related stories

The 30-Day AI-Native Challenge: a free/freemium roadmap to real AI skills

Your AI compliance gap is wider than your governance framework

Compliance teams ditch spreadsheets for unified EDD software