Our Take
Low-precision training speedups are shape-dependent and quantization overhead can swallow kernel gains entirely—benchmark your actual GEMMs before assuming FP4 will help.
Why it matters
Teams training large models face exploding GPU costs. A 1.48x GEMM speedup can collapse to 1.05x when overhead is included, so predicting end-to-end gains requires measurement, not theory.
Do this week
Training team: run the NVIDIA benchmark script on your model config and batch size this week so you can compare autocast versus prequantized results and decide whether FP4 or MXFP8 actually saves GPU hours.
NVIDIA published a profiling methodology for low-precision transformer training
NVIDIA released guidance on how to convert a transformer config (hidden size, layer count, batch size, sequence length) into concrete GEMM (matrix multiply) shapes, benchmark those shapes across BF16, FP8, and FP4 precisions, and estimate real training speedups before running a full training job. The guidance includes a benchmark script and two measurement modes: autocast (which includes dynamic quantization overhead) and prequantized (raw kernel performance only).
The CodonFM 5B model case study illustrates the variance. Across the same hardware (NVIDIA B300), NVFP4 versus MXFP8 showed speedups ranging from 1.05x (attention output GEMM) to 1.66x (MLP down GEMM). In autocast mode (what actually runs during training), NVFP4 delivered 1.98x over BF16. In prequantized mode, the same FP4 kernels achieved 3.48x. The gap is quantization overhead: Hadamard transforms, block scaling, amax computation.
The methodology also revealed that FP8 DelayedScaling outperformed MXFP8 in autocast mode (7.80 ms/layer versus 8.98 ms) but fell behind in prequantized mode (8.12 ms versus 6.81 ms), showing that quantization cost and raw kernel performance can crown different winners depending on measurement method.
Speedup promises and reality diverge without measurement
Tensor core specs promise 2x to 3x gains from FP4, but CodonFM saw 1.46x to 1.66x on large GEMMs and 1.05x on small ones. Once non-GEMM overhead and Wgrad times are included, the end-to-end gap compresses further. A team that assumes FP4 will deliver 3x training speedup and commits to a week-long training run risks discovering too late that the actual gain is 1.2x or that kernels silently fell back to FP8.
Quantization overhead is not constant. NVFP4 incurs stochastic rounding, random Hadamard transforms, and 2D block scaling. MXFP8 does not. For layers where the GEMM itself is small (like attention output), these operations can dominate the wall-clock time, erasing the theoretical speedup. The benchmark script isolates this by running both autocast and prequantized modes on the same shapes, letting teams see exactly where the gap lives.
Before you commit to FP4, validate on your shapes
Run the benchmark on your exact model config. Do not assume that speedups published for other models apply to yours. Pay special attention to the comparison between autocast and prequantized results. If autocast speedup is much lower than prequantized, quantization overhead is the bottleneck, not kernel throughput. Check GPU memory usage between runs; identical memory suggests kernels fell back to a higher precision without warning.
Use autocast results to predict training speedup. Use prequantized results to diagnose whether quantization or kernel selection is the limiter. If GEMM-level speedup is near 1.0 even in prequantized mode, FP4 kernels are either not dispatching or not beneficial for your shapes. Confirm with NVTE_LOG_LEVEL=1 or Nsight Systems before scaling to a production run.