Our Take
NVFP4 is a real precision trick backed by published benchmarks on real hardware, but the speedup comes from Blackwell's native support—useless on older GPUs, and the technique itself is not new.
Why it matters
LLM pretraining is a measured-in-hours problem at scale; shaving 30–40% per step translates to weeks of saved compute per training run. Anyone training on Blackwell should test this immediately.
Do this week
Training teams: run the MaxText NVFP4 recipe on a small Llama 8B job this week before committing Blackwell capacity to FP8 baselines, so you can measure real step time and loss convergence in your cluster.
NVIDIA ships production NVFP4 quantization in MaxText
NVIDIA released a four-bit mixed-precision training recipe for the open-source MaxText framework, enabling subbyte quantization during LLM pretraining on Blackwell hardware. The NVFP4 format (four-bit floating-point with two-level microscaling) uses native hardware support on the GB200 and GB300 Blackwell chips to accelerate matrix multiplications.
The recipe applies NVFP4 only to the MLP (feed-forward) layers of transformer models, leaving attention blocks in higher precision to avoid quantization noise on the softmax. Five techniques work together to preserve convergence: 16-element micro block scaling, E4M3 block scale factors, random Hadamard transform on weight gradients, 2D weight scaling, and stochastic rounding.
Benchmark results (company-reported, measured on Llama 3 8B and 3.1 405B with identical hyperparameters, batch size, and parallelism):
- Llama 3 8B on GB200: 1.35x speedup over FP8 baseline (149.7 to 171.3 TF/s per GPU)
- Llama 3 8B on GB300: 1.31x speedup (175.9 to 230.1 TF/s per GPU)
- Llama 3.1 405B on GB200: 1.44x speedup (155.7 to 222.4 TF/s per GPU)
- Llama 3.1 405B on GB300: 1.73x speedup (210.3 to 363.3 TF/s per GPU)
Training loss tracked the FP8 baseline within 0.026 nats over 10,000 steps on the C4 dataset, indicating no measurable accuracy penalty. The recipe is available in the JAX-Toolbox GitHub repository with two modes: te_nvfp4 (with random Hadamard transform, recommended for safer convergence) and te_nvfp4_no_rht (lower overhead, potential convergence risk).
Blackwell-specific advantage, not a general breakthrough
The speedups are real and come from genuine hardware efficiency: NVFP4 delivers 7x GEMM throughput compared to native FP8 on Hopper. But the recipe only works on Blackwell. Users with H100 or H200 clusters cannot adopt this—it is hardware-locked.
The precision trick itself is not new. Four-bit training and microscaling have been published in prior work; what changed is the integration into a production training framework and the native Blackwell instruction set. The novelty is in execution, not method.
For users already committed to Blackwell, the wins compound. A 1.5x step time improvement on a 10-day training run saves 3–4 days of wall clock. For the 405B case, the 1.73x uplift on GB300 is substantial enough to matter in multi-month training schedules.
How to adopt NVFP4 without betting your convergence
Start with te_nvfp4_no_rht on a small synthetic run (50 steps) to measure overhead and baseline throughput. Pin the script, record TF/s per GPU, and compare to your current FP8 baseline on the same hardware.
If step time improves and you have production training lined up, switch to te_nvfp4 (with random Hadamard transform) on a full-scale run before committing Blackwell GPU weeks. Loss curves in the NVIDIA benchmark stayed flat, but verify on your model and dataset before rolling out to multi-GPU clusters.
The recipe requires JAX, NVIDIA Transformer Engine, and CUDA/cuDNN libraries. NVIDIA provides a container (ghcr.io/nvidia/jax:maxtext) with dependencies pre-installed. Run time profiling with Nsight Systems (enabled by default) to confirm that the quantization overhead doesn't dominate on your interconnect or parallelism strategy.