DFlash Lifts LLM Throughput Up to 15x on NVIDIA Blackwell

NVIDIA and UC San Diego release DFlash, a block-diffusion speculative decoder

Researchers at UC San Diego published DFlash in February 2026 as an upgrade to autoregressive speculative decoding for LLM inference. NVIDIA has productized the method across its GPU stacks: 20 pre-trained checkpoints are now live on Hugging Face with recipes for Blackwell and Hopper. Both vLLM and SGLang integrate DFlash through the open-source Speculators library, allowing teams to swap it in with config changes alone.

The core innovation replaces sequential token drafting with parallel block generation. Instead of a small draft model proposing one token at a time, DFlash's lightweight block-diffusion drafter predicts an entire block of masked future tokens in a single forward pass. The target model verifies the block in parallel, maintaining output distribution while reducing drafting latency.

Performance varies by hardware and concurrency. On an 8-GPU DGX B300 system (16 Blackwell Ultra dies), DFlash reaches 15x throughput gain over autoregressive decoding on gpt-oss-120b at high concurrency (500–600 tokens/sec per user). On single-GPU setups, gains are more modest: 5.8x on Gemma 4 31B via vLLM (per NVIDIA's benchmark), 5.1x on Qwen3 8B via SGLang (per NVIDIA's benchmark). Across multiple Speed-Bench datasets, DFlash outperforms EAGLE-3 speculative decoding by 1.5x to 1.7x on average (company-reported).

Blackwell's dual-die architecture makes block parallelism valuable

Each Blackwell Ultra GPU pairs two reticle-sized dies (160 SMs, 640 Tensor Cores, 15 PFLOPS dense FP4 compute) via 10 TB/s chip-to-chip interconnect. Traditional autoregressive inference is memory-bound in the decode phase: tokens are generated one-by-one, leaving compute on the table while the system waits for memory movement.

Block-diffusion drafting exposes parallelism that Blackwell can actually use. By generating multiple candidate tokens at once and verifying them in parallel, DFlash shifts work from sequential memory access into parallel block operations. On Blackwell, this reshuffles the bottleneck from memory-latency-limited to compute-limited, allowing the GPU to serve more concurrent users at the same per-user token latency.

The technique works across model families (Qwen, Llama, Gemma, gpt-oss) and datasets (coding, reasoning, RAG, multilingual). Smaller models benefit too: DFlash nearly doubles interactivity over EAGLE-3 on Llama 3.1 8B on the Speed-Bench multilingual dataset (company-reported).

Evaluate DFlash against your latency target before scaling

The integration path is low-friction. On vLLM, pin a DFlash checkpoint and update the speculative decoding algorithm in your config; no application code changes required. SGLang users do the same. Both paths are production-ready with recipes for Blackwell and Hopper.

The catch: the 15x speedup is cluster-level throughput on dual-GPU setups at the highest user concurrency. If you run a single Blackwell GPU or target lower concurrency (batch size 1), expect 2x to 6x gains instead. Measure your actual deployment topology and interactivity target before committing to DFlash's memory footprint (it requires storing an additional lightweight diffusion model).

Checkpoints are available for Qwen, Kimi K2.6, Llama, Gemma, and gpt-oss families on Hugging Face. Start with a staging deployment on your most latency-sensitive workload (code generation or multi-turn reasoning) and compare the Pareto curve (throughput vs. per-user latency) against your current baseline.

DFlash Lifts LLM Throughput Up to 15x on NVIDIA Blackwell

Our Take

Why it matters

Do this week

NVIDIA and UC San Diego release DFlash, a block-diffusion speculative decoder

Blackwell's dual-die architecture makes block parallelism valuable

Evaluate DFlash against your latency target before scaling

Related stories

Nephrology trials cost $30M for Phase III. Biomarkers cut time to decision.

Three Pneumonia Subtypes Found in Lung Fluid, Not Blood Tests

80% of Medicare denials get overturned on appeal — but almost no one appeals