Train Llama 3.1 405B 73% faster with NVIDIA NVFP4 on Blackwell

NVIDIA ships production NVFP4 quantization in MaxText

NVIDIA released a four-bit mixed-precision training recipe for the open-source MaxText framework, enabling subbyte quantization during LLM pretraining on Blackwell hardware. The NVFP4 format (four-bit floating-point with two-level microscaling) uses native hardware support on the GB200 and GB300 Blackwell chips to accelerate matrix multiplications.

The recipe applies NVFP4 only to the MLP (feed-forward) layers of transformer models, leaving attention blocks in higher precision to avoid quantization noise on the softmax. Five techniques work together to preserve convergence: 16-element micro block scaling, E4M3 block scale factors, random Hadamard transform on weight gradients, 2D weight scaling, and stochastic rounding.

Benchmark results (company-reported, measured on Llama 3 8B and 3.1 405B with identical hyperparameters, batch size, and parallelism):

Llama 3 8B on GB200: 1.35x speedup over FP8 baseline (149.7 to 171.3 TF/s per GPU)
Llama 3 8B on GB300: 1.31x speedup (175.9 to 230.1 TF/s per GPU)
Llama 3.1 405B on GB200: 1.44x speedup (155.7 to 222.4 TF/s per GPU)
Llama 3.1 405B on GB300: 1.73x speedup (210.3 to 363.3 TF/s per GPU)

Training loss tracked the FP8 baseline within 0.026 nats over 10,000 steps on the C4 dataset, indicating no measurable accuracy penalty. The recipe is available in the JAX-Toolbox GitHub repository with two modes: te_nvfp4 (with random Hadamard transform, recommended for safer convergence) and te_nvfp4_no_rht (lower overhead, potential convergence risk).

Blackwell-specific advantage, not a general breakthrough

The speedups are real and come from genuine hardware efficiency: NVFP4 delivers 7x GEMM throughput compared to native FP8 on Hopper. But the recipe only works on Blackwell. Users with H100 or H200 clusters cannot adopt this—it is hardware-locked.

The precision trick itself is not new. Four-bit training and microscaling have been published in prior work; what changed is the integration into a production training framework and the native Blackwell instruction set. The novelty is in execution, not method.

For users already committed to Blackwell, the wins compound. A 1.5x step time improvement on a 10-day training run saves 3–4 days of wall clock. For the 405B case, the 1.73x uplift on GB300 is substantial enough to matter in multi-month training schedules.

How to adopt NVFP4 without betting your convergence

Start with te_nvfp4_no_rht on a small synthetic run (50 steps) to measure overhead and baseline throughput. Pin the script, record TF/s per GPU, and compare to your current FP8 baseline on the same hardware.

If step time improves and you have production training lined up, switch to te_nvfp4 (with random Hadamard transform) on a full-scale run before committing Blackwell GPU weeks. Loss curves in the NVIDIA benchmark stayed flat, but verify on your model and dataset before rolling out to multi-GPU clusters.

The recipe requires JAX, NVIDIA Transformer Engine, and CUDA/cuDNN libraries. NVIDIA provides a container (ghcr.io/nvidia/jax:maxtext) with dependencies pre-installed. Run time profiling with Nsight Systems (enabled by default) to confirm that the quantization overhead doesn't dominate on your interconnect or parallelism strategy.

Train Llama 3.1 405B 73% faster with NVIDIA NVFP4 on Blackwell

Our Take

Why it matters

Do this week

NVIDIA ships production NVFP4 quantization in MaxText

Blackwell-specific advantage, not a general breakthrough

How to adopt NVFP4 without betting your convergence

One daily brief. Every story gets a hype verdict.

Related stories

The 30-Day AI-Native Challenge: a free/freemium roadmap to real AI skills

Your AI compliance gap is wider than your governance framework

Compliance teams ditch spreadsheets for unified EDD software