NVIDIA Blackwell Sweeps MLPerf Training 6.0 With New DeepSeek Models

NVIDIA took every MLPerf Training 6.0 category

NVIDIA delivered first-place finishes across MLPerf Training v6.0, the MLCommons benchmark suite, and was the only vendor to submit results on all test categories (per the NVIDIA blog). Key results included training DeepSeek-V3 (671B parameter Mixture of Experts) in 2.02 minutes on 8,192 Blackwell GPUs, GPT-OSS 20B in 7.43 minutes on 512 GPUs, and Llama 3.1 405B in 7.07 minutes on 8,192 GPUs. NVIDIA also posted per-accelerator efficiency wins across dense and sparse model architectures.

The benchmark introduced new workloads designed to reflect production trends: DeepSeek-V3 and GPT-OSS 20B both use Mixture of Experts routing, a pattern that stresses network fabric and dynamic scheduling. NVIDIA was alone in submitting results on both new models. The company scaled submission clusters up to 8,192 Blackwell GPUs running in unison across distributed data centers, demonstrating production-grade scaling using Spectrum-X Ethernet and Quantum InfiniBand (per the blog).

Software caught up faster than hardware shipped

The second-order detail: NVIDIA's blog explicitly documents that training throughput on DeepSeek-V3 improved 1.3x in three months (from 1,298 TFLOPS/GPU to 1,648 TFLOPS/GPU, or 6,338 tokens/second per GPU) on the same GB300 hardware. This gain came from software-only optimizations across the CUDA-X stack—CuTe kernel fusions, CUDA graph rewrites for MoE token routing, MXFP8 attention quantization, and pipeline stage balancing (per the blog).

For infrastructure operators, this matters because published silicon peak FLOPS are only useful if the software stack can extract them. A GPU that delivers 30% more throughput next quarter without a hardware refresh changes cost-per-token and justifies keeping existing deployments alive longer. This is also a hidden switching cost: teams invested in NVIDIA's Megatron, Transformer Engine, and cuDNN stack get continuous goodput wins that don't show up in competitor datasheets until the next benchmark cycle.

Verify MLPerf results on your own workloads before you buy

MLPerf Training times are standardized, but they do not predict your model's training cost. If you are training a proprietary MoE variant or a different-sized foundation model, run a submitted configuration on your own hardware and data before committing to multi-year GPU procurement. The software wins NVIDIA posted (8% from 1F1B all-to-all overlap, 8% from CUDA graphs, 4% from pipeline balancing) apply to specific model architectures and may not transfer to your use case (per the blog).

Also, pull the actual MLPerf Training 6.0 submissions (available at mlcommons.org) and inspect the software versions, batch sizes, and cluster configurations used. A benchmark win at a specific scale does not guarantee cost-optimality at your target scale, especially if you are training at smaller cluster sizes where the networking optimizations have diminishing returns.

NVIDIA Blackwell Sweeps MLPerf Training 6.0 With New DeepSeek Models

Our Take

Why it matters

Do this week

NVIDIA took every MLPerf Training 6.0 category

Software caught up faster than hardware shipped

Verify MLPerf results on your own workloads before you buy

Related stories

Doncasters targets $4.4B valuation in US aerospace IPO

Goldman Sachs hits $1 trillion M&A milestone in first half of 2024

Databricks buys Panther Labs in cybersecurity expansion move