Our Take
Blackwell's clean sweep is real—NVIDIA submitted on every benchmark and won time-to-train across the board—but the story is the three-month software velocity: throughput on DeepSeek-V3 jumped 1.3x without hardware changes, a reminder that published silicon specs mean little without the stack.
Why it matters
Training time and per-GPU efficiency are the concrete metrics infrastructure operators use to justify spend. If NVIDIA's software keeps compressing wall-clock training by 30% per quarter, switching costs rise sharply, even as competing hardware catches up on raw FLOPS.
Do this week
Infrastructure teams: Run your own MLPerf Training 6.0 submissions on your target workloads before committing multi-year GPU orders, so you have independent time-to-train data for your cost models.
NVIDIA took every MLPerf Training 6.0 category
NVIDIA delivered first-place finishes across MLPerf Training v6.0, the MLCommons benchmark suite, and was the only vendor to submit results on all test categories (per the NVIDIA blog). Key results included training DeepSeek-V3 (671B parameter Mixture of Experts) in 2.02 minutes on 8,192 Blackwell GPUs, GPT-OSS 20B in 7.43 minutes on 512 GPUs, and Llama 3.1 405B in 7.07 minutes on 8,192 GPUs. NVIDIA also posted per-accelerator efficiency wins across dense and sparse model architectures.
The benchmark introduced new workloads designed to reflect production trends: DeepSeek-V3 and GPT-OSS 20B both use Mixture of Experts routing, a pattern that stresses network fabric and dynamic scheduling. NVIDIA was alone in submitting results on both new models. The company scaled submission clusters up to 8,192 Blackwell GPUs running in unison across distributed data centers, demonstrating production-grade scaling using Spectrum-X Ethernet and Quantum InfiniBand (per the blog).
Software caught up faster than hardware shipped
The second-order detail: NVIDIA's blog explicitly documents that training throughput on DeepSeek-V3 improved 1.3x in three months (from 1,298 TFLOPS/GPU to 1,648 TFLOPS/GPU, or 6,338 tokens/second per GPU) on the same GB300 hardware. This gain came from software-only optimizations across the CUDA-X stack—CuTe kernel fusions, CUDA graph rewrites for MoE token routing, MXFP8 attention quantization, and pipeline stage balancing (per the blog).
For infrastructure operators, this matters because published silicon peak FLOPS are only useful if the software stack can extract them. A GPU that delivers 30% more throughput next quarter without a hardware refresh changes cost-per-token and justifies keeping existing deployments alive longer. This is also a hidden switching cost: teams invested in NVIDIA's Megatron, Transformer Engine, and cuDNN stack get continuous goodput wins that don't show up in competitor datasheets until the next benchmark cycle.
Verify MLPerf results on your own workloads before you buy
MLPerf Training times are standardized, but they do not predict your model's training cost. If you are training a proprietary MoE variant or a different-sized foundation model, run a submitted configuration on your own hardware and data before committing to multi-year GPU procurement. The software wins NVIDIA posted (8% from 1F1B all-to-all overlap, 8% from CUDA graphs, 4% from pipeline balancing) apply to specific model architectures and may not transfer to your use case (per the blog).
Also, pull the actual MLPerf Training 6.0 submissions (available at mlcommons.org) and inspect the software versions, batch sizes, and cluster configurations used. A benchmark win at a specific scale does not guarantee cost-optimality at your target scale, especially if you are training at smaller cluster sizes where the networking optimizations have diminishing returns.