NVIDIA Fused MoE Kernels Cut Training Time 8-93% for DeepSeek, GPT

Three Custom Kernels Eliminate MoE Training Bottlenecks

NVIDIA released three fused kernel patterns built in the CuTe DSL to address memory and CPU overhead in mixture-of-experts training. The kernels combine matrix multiplication (GEMM), activation functions (SwiGLU, GeGLU, sReLU), and quantization (MXFP8, NVFP4) into single operations, eliminating intermediate tensor reads and writes that leave GPU compute cores idle.

The three patterns are:

GroupGemm + Quantize
GroupGemm + Activation + Quantize/Transpose
GroupGemm + dActivation + Quantize/Transpose (backward pass)

At kernel level, these achieve 1.3x–2.1x speedup over unfused paths (per NVIDIA microbenchmarks). The kernels also implement sync-free execution: they track tokens-per-expert in GPU memory instead of launching separate CPU-dependent kernel calls, which removes CPU synchronization stalls and enables full-iteration CUDA graphs.

In internal testing, NVIDIA reports 8% end-to-end speedup on DeepSeek-V3 pre-training and 93% on GPT-OSS setups (company-reported). The kernels are available now in cuDNN Frontend (v1.23.0+), NVIDIA Transformer Engine (v2.15+), and Megatron-Core (26.04-alpha.rc2+), accessible via direct API calls or higher-level framework abstractions.

End-to-End Gains Depend Heavily on Workload Architecture

The wide gap between reported end-to-end speedups (8% vs. 93%) reflects differences in how tightly other components constrain training. DeepSeek-V3 likely already benefits from other NVIDIA optimizations and parallelism strategies that limit the fusion kernel's contribution. GPT-OSS, by contrast, appears to be a baseline setup where MoE block overhead is less masked by communication or synchronization elsewhere in the stack.

The real value is architectural. By eliminating CPU launch overhead and enabling synchronization-free CUDA graphs, these kernels unlock concurrent execution of other GPU operations (all-gather, all-reduce) during the MoE block itself. This matters most in sparse models where expert dispatch typically creates unpredictable token-to-expert mappings and forces either CPU-side shape calculation or dynamic kernel launches.

For teams already using Transformer Engine or Megatron-Core, the kernels are a drop-in win. For custom sparse architectures, the CuTe DSL source is available for modification and contribution via GitHub.

How to Test and Integrate

Three integration points exist, in order of abstraction. Lowest-level users can invoke kernels directly from cuDNN Frontend and manage caching themselves. Pytorch users can call operations via Transformer Engine's sequential pattern-matching API, which auto-selects the fused kernel. Megatron-Core users need only set configuration flags to enable fusion.

The kernels handle standard MoE activation functions natively and support optional feature scaling, clamping, and bias addition. If your activation function is not yet supported (Mish, ReLU variants), NVIDIA invites PRs against cuDNN Frontend on GitHub.

Start with a single-GPU microbenchmark on your forward and backward passes using your actual batch size, sequence length, and expert count. Measure kernel time and memory bandwidth before and after. Then run a full training iteration on multi-GPU to capture the synchronization and graph-building wins, which dominate the end-to-end speedup in practice.

NVIDIA Fused MoE Kernels Cut Training Time 8-93% for DeepSeek, GPT

Our Take

Why it matters

Do this week

Three Custom Kernels Eliminate MoE Training Bottlenecks

End-to-End Gains Depend Heavily on Workload Architecture

How to Test and Integrate

Related stories

Your compliance API isn't ready for AI agents yet

Regulators now demand proof controls work, not just docs

Banks can't wait for AI rules. Regulators just told you why.