DeepMind DiffusionGemma hits 4x faster text generation on GPUs

DeepMind releases DiffusionGemma, a parallel-decoding text model

DeepMind today released DiffusionGemma, a 26B Mixture of Experts model that replaces autoregressive token-by-token generation with parallel diffusion-based decoding. The model generates 256 tokens simultaneously per forward pass, reaching up to 1000 tokens per second on a single NVIDIA H100 GPU and 700+ tokens per second on an RTX 5090 (per DeepMind's report). The model weights are open under Apache 2.0 and available on Hugging Face.

DiffusionGemma activates only 3.8B of its 26B parameters at inference time and fits within 18GB VRAM on high-end consumer GPUs when quantized. Unlike autoregressive Gemma 4 models, which predict one token based on all prior tokens, DiffusionGemma starts with a canvas of random placeholder tokens and iteratively refines them across multiple passes, with each token attending to all others (bidirectional attention). This approach unlocks specific capabilities: perfectly closing markdown, rendering code in near real-time, and handling tasks like code infilling or Sudoku where future tokens influence current predictions.

DeepMind explicitly positions this as experimental. Output quality is materially lower than Gemma 4. The speedup is strongest at batch sizes 1-4 on a single accelerator, and the company warns that high-throughput cloud serving (where autoregressive batching already saturates compute) will see diminishing returns and potentially higher serving costs. The model is supported by vLLM (Red Hat integration included), MLX, Hugging Face Transformers, and forthcoming llama.cpp support. NVIDIA optimized across consumer (RTX 4090/5090) and enterprise (Hopper/Blackwell with NVFP4 kernels) hardware.

Speed-critical local inference now has a hardware-efficient path forward

Autoregressive inference on a single user's GPU wastes compute cycles. The GPU sits idle between token predictions because memory bandwidth, not compute, limits throughput. DiffusionGemma shifts the bottleneck to compute by processing a full block at once, fully utilizing accelerator arithmetic. This matters for interactive applications that run locally: in-line code completion, rapid iteration loops, and structured editing where latency directly impacts user experience.

The bidirectional attention pattern also enables new model behaviors. Tasks that autoregressive models struggle with (Sudoku, code infilling, markdown formatting) become easier when the model can see the entire output space at once and refine iteratively. A fine-tuned DiffusionGemma successfully solved Sudoku puzzles, a task autoregressive models handle poorly because each token depends on future values.

However, the speed win does not apply universally. Cloud inference that batches thousands of user requests together already saturates compute on autoregressive models. DiffusionGemma's parallel decoding offers no speedup in that scenario and may increase serving costs. Practitioners should not assume 4x speedup applies to their deployment.

Test locally before adopting; measure your batch size and hardware

Download DiffusionGemma and profile it on your actual workload at your expected batch size (1, 2, 4, or higher). If you are building single-user interactive tools on a dedicated GPU, run a latency benchmark against Gemma 4 and measure quality loss on your specific task. If you are deploying to cloud with batch sizes above 4 or expect high QPS, stick with standard Gemma 4. Fine-tuning on task-specific data can recover quality loss; Unsloth and NVIDIA NeMo offer tutorials. Verify VRAM usage with quantization on your target GPU before committing to production.

DeepMind DiffusionGemma hits 4x faster text generation on GPUs

Our Take

Why it matters

Do this week

DeepMind releases DiffusionGemma, a parallel-decoding text model

Speed-critical local inference now has a hardware-efficient path forward

Test locally before adopting; measure your batch size and hardware

Related stories

Half of firms talk change, 17% ask employees how it lands

72% use AI but only 43% of staff trust their judgment. Here's why.

Commercial health plans brace for 9% cost surge in 2027