Our Take
The speedup is real but the range (8% to 93%) signals source-dependent gains: internal DeepSeek runs see 8%, GPT-OSS setups claim 93%—check your workload against the actual one tested before budgeting training time.
Why it matters
MoE models dominate large-scale training pipelines, and NVIDIA's fusion kernels directly reduce the memory and CPU overhead that has plagued their iteration time. If you're running DeepSeek-V3–scale training or custom sparse models, this ships in production libraries today.
Do this week
MoE team lead: Benchmark these kernels against your current Transformer Engine version (2.15+) or cuDNN Frontend (1.23.0+) on a single training step before the next monthly optimization cycle so you can quantify real wall-clock gains on your hardware.
Three Custom Kernels Eliminate MoE Training Bottlenecks
NVIDIA released three fused kernel patterns built in the CuTe DSL to address memory and CPU overhead in mixture-of-experts training. The kernels combine matrix multiplication (GEMM), activation functions (SwiGLU, GeGLU, sReLU), and quantization (MXFP8, NVFP4) into single operations, eliminating intermediate tensor reads and writes that leave GPU compute cores idle.
The three patterns are:
- GroupGemm + Quantize
- GroupGemm + Activation + Quantize/Transpose
- GroupGemm + dActivation + Quantize/Transpose (backward pass)
At kernel level, these achieve 1.3x–2.1x speedup over unfused paths (per NVIDIA microbenchmarks). The kernels also implement sync-free execution: they track tokens-per-expert in GPU memory instead of launching separate CPU-dependent kernel calls, which removes CPU synchronization stalls and enables full-iteration CUDA graphs.
In internal testing, NVIDIA reports 8% end-to-end speedup on DeepSeek-V3 pre-training and 93% on GPT-OSS setups (company-reported). The kernels are available now in cuDNN Frontend (v1.23.0+), NVIDIA Transformer Engine (v2.15+), and Megatron-Core (26.04-alpha.rc2+), accessible via direct API calls or higher-level framework abstractions.
End-to-End Gains Depend Heavily on Workload Architecture
The wide gap between reported end-to-end speedups (8% vs. 93%) reflects differences in how tightly other components constrain training. DeepSeek-V3 likely already benefits from other NVIDIA optimizations and parallelism strategies that limit the fusion kernel's contribution. GPT-OSS, by contrast, appears to be a baseline setup where MoE block overhead is less masked by communication or synchronization elsewhere in the stack.
The real value is architectural. By eliminating CPU launch overhead and enabling synchronization-free CUDA graphs, these kernels unlock concurrent execution of other GPU operations (all-gather, all-reduce) during the MoE block itself. This matters most in sparse models where expert dispatch typically creates unpredictable token-to-expert mappings and forces either CPU-side shape calculation or dynamic kernel launches.
For teams already using Transformer Engine or Megatron-Core, the kernels are a drop-in win. For custom sparse architectures, the CuTe DSL source is available for modification and contribution via GitHub.
How to Test and Integrate
Three integration points exist, in order of abstraction. Lowest-level users can invoke kernels directly from cuDNN Frontend and manage caching themselves. Pytorch users can call operations via Transformer Engine's sequential pattern-matching API, which auto-selects the fused kernel. Megatron-Core users need only set configuration flags to enable fusion.
The kernels handle standard MoE activation functions natively and support optional feature scaling, clamping, and bias addition. If your activation function is not yet supported (Mish, ReLU variants), NVIDIA invites PRs against cuDNN Frontend on GitHub.
Start with a single-GPU microbenchmark on your forward and backward passes using your actual batch size, sequence length, and expert count. Measure kernel time and memory bandwidth before and after. Then run a full training iteration on multi-GPU to capture the synchronization and graph-building wins, which dominate the end-to-end speedup in practice.