GB200 clusters need block scheduling to avoid 90% bandwidth drops

GB200 creates 130TB/s rack-scale domains with hard boundaries

NVIDIA GB200 NVL72 extends NVLink coherence across 72 Blackwell GPUs in 18 compute trays within a single rack, delivering 1.8TB/s bidirectional throughput per GPU and 130TB/s aggregate bandwidth (per NVIDIA). Communication crossing domain boundaries drops to 50GB/s through InfiniBand or Ethernet, creating a performance cliff that makes rack boundaries hard constraints rather than soft preferences.

This breaks traditional cluster schedulers that treat network fabric as a hierarchical tree where jobs can fragment across switches with modest performance impact. Slurm's topology/tree plugin, the standard for large-scale clusters, makes best-effort attempts to minimize switch spanning but will fragment allocations to reduce queue times.

NVIDIA and SchedMD developed the topology/block plugin in Slurm 23.11 to handle rack-scale architectures. The plugin treats each NVL72 domain as a rigid scheduling unit where jobs requesting 18 nodes or fewer stay within one block, preventing fragmentation.

Workload placement becomes binary: fast or 36x slower

The performance gap between intra-rack (1.8TB/s) and inter-rack (50GB/s) communication represents a 36x difference, making topology awareness mandatory rather than optional. Traditional schedulers that fragment jobs across racks will see sharp performance drops and increased queue times as they wait for contiguous allocations.

Slurm introduced the --segment parameter to let applications specify their actual NVLink requirements. A job requesting 12 nodes with --segment=4 can split across three blocks, while tensor parallelism workloads might need --segment=1 for maximum flexibility. Expert parallelism requires larger segments to keep all-to-all operations within single NVLink domains.

The scheduler can assign multiple segments to the same block when possible. Using --segment=16 for a 32-node job ensures balanced allocation (16 nodes per block) rather than uneven splits like 18+14 nodes.

Configure one block per NVL72 domain in topology.yaml

Administrators should define one block per GB200 NVL72 domain (18 nodes) using Slurm's topology.yaml file introduced in version 25.05. The configuration prevents jobs from fragmenting across NVLink boundaries while allowing larger jobs to span the minimum required blocks.

Enable the switch/nvidia_imex plugin for driver-level isolation between jobs sharing the same NVLink domain. This prevents interference without requiring custom prolog/epilog scripts.

Set segment size based on workload requirements: --segment=1 for maximum scheduling flexibility when NVLink is only needed within single nodes, --segment=4 or --segment=8 for tensor parallelism, and --segment=16 for balanced multi-block allocations. Avoid --segment=18 as it reduces scheduling opportunities when blocks have drained nodes.

Administrators can enforce segment policies through cli_filter/lua scripts that reject jobs not meeting cluster guidelines, ensuring users specify appropriate topology requirements rather than defaulting to full-rack allocations.

GB200 clusters need block scheduling to avoid 90% bandwidth drops

Our Take

Why it matters

Do this week

GB200 creates 130TB/s rack-scale domains with hard boundaries

Workload placement becomes binary: fast or 36x slower

Configure one block per NVL72 domain in topology.yaml

Related stories

Gresham and FundGuard merge data platforms for asset managers

ANNA Money adds 3.66% savings account for UK small businesses

Payward buys Reap for $600M to merge stablecoin cards with B2B rails