Our Take
This is a rare case where hardware architecture forces fundamental changes to cluster software, making rack boundaries hard scheduling constraints rather than optimization hints.
Why it matters
Organizations building GB200 clusters will face job fragmentation and performance drops without updating their workload managers, while existing topology assumptions break down at rack scale.
Do this week
Cluster admins: upgrade to Slurm 23.11+ and configure topology/block plugin before deploying GB200 to avoid cross-rack performance penalties.
GB200 creates 130TB/s rack-scale domains with hard boundaries
NVIDIA GB200 NVL72 extends NVLink coherence across 72 Blackwell GPUs in 18 compute trays within a single rack, delivering 1.8TB/s bidirectional throughput per GPU and 130TB/s aggregate bandwidth (per NVIDIA). Communication crossing domain boundaries drops to 50GB/s through InfiniBand or Ethernet, creating a performance cliff that makes rack boundaries hard constraints rather than soft preferences.
This breaks traditional cluster schedulers that treat network fabric as a hierarchical tree where jobs can fragment across switches with modest performance impact. Slurm's topology/tree plugin, the standard for large-scale clusters, makes best-effort attempts to minimize switch spanning but will fragment allocations to reduce queue times.
NVIDIA and SchedMD developed the topology/block plugin in Slurm 23.11 to handle rack-scale architectures. The plugin treats each NVL72 domain as a rigid scheduling unit where jobs requesting 18 nodes or fewer stay within one block, preventing fragmentation.
Workload placement becomes binary: fast or 36x slower
The performance gap between intra-rack (1.8TB/s) and inter-rack (50GB/s) communication represents a 36x difference, making topology awareness mandatory rather than optional. Traditional schedulers that fragment jobs across racks will see sharp performance drops and increased queue times as they wait for contiguous allocations.
Slurm introduced the --segment parameter to let applications specify their actual NVLink requirements. A job requesting 12 nodes with --segment=4 can split across three blocks, while tensor parallelism workloads might need --segment=1 for maximum flexibility. Expert parallelism requires larger segments to keep all-to-all operations within single NVLink domains.
The scheduler can assign multiple segments to the same block when possible. Using --segment=16 for a 32-node job ensures balanced allocation (16 nodes per block) rather than uneven splits like 18+14 nodes.
Configure one block per NVL72 domain in topology.yaml
Administrators should define one block per GB200 NVL72 domain (18 nodes) using Slurm's topology.yaml file introduced in version 25.05. The configuration prevents jobs from fragmenting across NVLink boundaries while allowing larger jobs to span the minimum required blocks.
Enable the switch/nvidia_imex plugin for driver-level isolation between jobs sharing the same NVLink domain. This prevents interference without requiring custom prolog/epilog scripts.
Set segment size based on workload requirements: --segment=1 for maximum scheduling flexibility when NVLink is only needed within single nodes, --segment=4 or --segment=8 for tensor parallelism, and --segment=16 for balanced multi-block allocations. Avoid --segment=18 as it reduces scheduling opportunities when blocks have drained nodes.
Administrators can enforce segment policies through cli_filter/lua scripts that reject jobs not meeting cluster guidelines, ensuring users specify appropriate topology requirements rather than defaulting to full-rack allocations.