Benchmark your transformer before training on FP4 to avoid speedup collapse

NVIDIA published a profiling methodology for low-precision transformer training

NVIDIA released guidance on how to convert a transformer config (hidden size, layer count, batch size, sequence length) into concrete GEMM (matrix multiply) shapes, benchmark those shapes across BF16, FP8, and FP4 precisions, and estimate real training speedups before running a full training job. The guidance includes a benchmark script and two measurement modes: autocast (which includes dynamic quantization overhead) and prequantized (raw kernel performance only).

The CodonFM 5B model case study illustrates the variance. Across the same hardware (NVIDIA B300), NVFP4 versus MXFP8 showed speedups ranging from 1.05x (attention output GEMM) to 1.66x (MLP down GEMM). In autocast mode (what actually runs during training), NVFP4 delivered 1.98x over BF16. In prequantized mode, the same FP4 kernels achieved 3.48x. The gap is quantization overhead: Hadamard transforms, block scaling, amax computation.

The methodology also revealed that FP8 DelayedScaling outperformed MXFP8 in autocast mode (7.80 ms/layer versus 8.98 ms) but fell behind in prequantized mode (8.12 ms versus 6.81 ms), showing that quantization cost and raw kernel performance can crown different winners depending on measurement method.

Speedup promises and reality diverge without measurement

Tensor core specs promise 2x to 3x gains from FP4, but CodonFM saw 1.46x to 1.66x on large GEMMs and 1.05x on small ones. Once non-GEMM overhead and Wgrad times are included, the end-to-end gap compresses further. A team that assumes FP4 will deliver 3x training speedup and commits to a week-long training run risks discovering too late that the actual gain is 1.2x or that kernels silently fell back to FP8.

Quantization overhead is not constant. NVFP4 incurs stochastic rounding, random Hadamard transforms, and 2D block scaling. MXFP8 does not. For layers where the GEMM itself is small (like attention output), these operations can dominate the wall-clock time, erasing the theoretical speedup. The benchmark script isolates this by running both autocast and prequantized modes on the same shapes, letting teams see exactly where the gap lives.

Before you commit to FP4, validate on your shapes

Run the benchmark on your exact model config. Do not assume that speedups published for other models apply to yours. Pay special attention to the comparison between autocast and prequantized results. If autocast speedup is much lower than prequantized, quantization overhead is the bottleneck, not kernel throughput. Check GPU memory usage between runs; identical memory suggests kernels fell back to a higher precision without warning.

Use autocast results to predict training speedup. Use prequantized results to diagnose whether quantization or kernel selection is the limiter. If GEMM-level speedup is near 1.0 even in prequantized mode, FP4 kernels are either not dispatching or not beneficial for your shapes. Confirm with NVTE_LOG_LEVEL=1 or Nsight Systems before scaling to a production run.

Benchmark your transformer before training on FP4 to avoid speedup collapse

Our Take

Why it matters

Do this week

NVIDIA published a profiling methodology for low-precision transformer training

Speedup promises and reality diverge without measurement

Before you commit to FP4, validate on your shapes

Related stories

Your Change Plans Need AI Strategy Now, Reuters Says

68% of law firms deploy Harvey AI agents; power users save 11 hours weekly

GLP-1 drugs reach only 2-3% of Europeans who qualify, despite proven ROI