DiffusionGemma hits 1,000 tokens/sec on H100, cuts real-time AI latency

Parallel generation cuts tokens-per-second by 5–10x on NVIDIA hardware

Google DeepMind released DiffusionGemma, a 26B parameter model built on the Gemma 4 MoE architecture that generates 256 tokens in parallel per diffusion step instead of one token at a time. On an NVIDIA H100 Tensor Core GPU, the model achieves up to 1,000 tokens/sec (company-reported). On NVIDIA DGX Spark it reaches 150 tokens/sec; on DGX Station, up to 2,000 tokens/sec (company-reported).

The model supports up to 256K token context length, runs in BF16 and NVFP4 precision, and is available today on Hugging Face. NVIDIA has packaged it into NIM, a containerized inference microservice with OpenAI-compatible APIs, enabling one-command deployment to local, cloud, or hybrid infrastructure. Fine-tuning support arrives via NVIDIA NeMo AutoModel, which lets developers adapt the model without manual checkpoint conversion.

Developers can prototype on RTX 5090 or DGX Spark using Hugging Face Transformers, then scale to multi-user serving via vLLM. NVIDIA offers free GPU-accelerated endpoints for prototyping through its Developer Program (build.nvidia.com).

Speed solves the serving cost and concurrency problem—if your bottleneck is throughput

Autoregressive inference (the standard in most LLMs) generates one token per forward pass. For a 256-token response, that is 256 network round trips or 256 sequential GPU operations. DiffusionGemma collapses this into roughly 8–10 steps of parallel generation, cutting wall-clock latency and GPU memory occupancy per request.

In practice, this lowers the cost to serve high-concurrency workloads (many simultaneous users) and improves perceived responsiveness in interactive applications. The tradeoff: diffusion-based decoding introduces sampling variance. Model quality is preserved (per NVIDIA's benchmarks), but the generation process is not deterministic token-by-token.

The Day 0 support matters. NVIDIA has bundled DiffusionGemma into NIM at launch, meaning enterprises can move from research to production without rewriting inference stacks. For teams already running vLLM or Hugging Face, the on-ramp is a single Docker command.

Measure latency variance before migrating; parallel generation helps throughput, not tail latency

DiffusionGemma's speed gains are real but conditional. If your bottleneck is aggregate throughput (tokens per second across all users), parallel generation directly helps. If your bottleneck is tail latency (p99 response time for a single user) or strict output determinism, the gains are smaller.

Benchmark on your exact hardware and workload before committing. A single H100 in a lab is not the same as a distributed inference cluster or a CPU-bound serving layer. Test whether your application can tolerate the variance introduced by diffusion-based sampling.

For teams running large models at scale, consider whether cheaper inference (via throughput gains) justifies retraining or fine-tuning for your domain. For startups or small teams, the free prototyping tier on build.nvidia.com is a low-friction way to test the model on your data.

DiffusionGemma hits 1,000 tokens/sec on H100, cuts real-time AI latency

Our Take

Why it matters

Do this week

Parallel generation cuts tokens-per-second by 5–10x on NVIDIA hardware

Speed solves the serving cost and concurrency problem—if your bottleneck is throughput

Measure latency variance before migrating; parallel generation helps throughput, not tail latency

Related stories

Six in 10 workers skip reading employment contracts

Jury awards former Ameris Bank exec $80M in wrongful termination case

SpaceX IPO mints 4,400 millionaires. Here's how you compete for AI talent.