Our Take
This is infrastructure mapping, not breakthrough research: useful reference material for teams already committed to AWS but won't change architectural decisions.
Why it matters
ML engineers scaling beyond single-GPU setups need concrete specs on memory bandwidth, NVLink domains, and storage tiers before committing to multi-million-dollar cluster builds.
Do this week
Infrastructure teams: audit your current NVLink domain size against communication patterns in MoE models before your next hardware refresh cycle.
Hugging Face details AWS infrastructure for distributed training
Hugging Face published a technical guide mapping open-source ML frameworks to AWS accelerated computing instances. The analysis covers how PyTorch, JAX, Kubernetes, and Slurm interact with AWS P5/P6 instances across pre-training, post-training, and inference workloads.
The guide provides detailed specifications for AWS GPU instances. P5.48xlarge instances pack eight H100 GPUs with 640 GB total HBM3, 7.2 TB/s aggregate NVLink bandwidth, and 400 GB/s EFA networking. P6-B200.48xlarge instances double the Tensor throughput to 2.25 PFLOPS per GPU (company-reported) while expanding to 1,440 GB HBM3e capacity.
AWS extends NVLink domains through UltraServers, connecting multiple instances via dedicated accelerator interconnect. P6e-GB200 UltraServers expose up to 72 Blackwell GPUs within a single NVLink domain, reducing cross-node communication for communication-intensive patterns like expert parallelism in mixture-of-experts models.
NVLink domain size constrains MoE scaling
The guide identifies NVLink domain boundaries as a first-order constraint for workloads with high per-step communication intensity. When all-to-all token dispatch in MoE models spans many GPUs, staying within the NVLink fabric avoids EFA networking overhead.
Three storage tiers support different access patterns: local NVMe SSD (30.72 TB per instance) for hot data, Amazon FSx for Lustre for shared high-throughput access, and S3 for durable checkpoint storage. The tiered approach addresses both distributed training data streaming and large-scale inference weight staging.
EFA versions matter for collective communication performance. EFAv4 on P6 instances delivers 18% better collective communication performance than EFAv3 (company-reported), while EFAv3 reduces packet latency by 35% compared to EFAv2.
Choose instance families by communication patterns
Teams running MoE models should size NVLink domains to minimize cross-node all-to-all operations. Standard P5/P6 instances limit NVLink domains to 8 GPUs, forcing larger expert counts onto EFA networking. UltraServers expand this to 72 GPUs but require architectural planning.
Storage tier selection depends on checkpoint frequency and dataset size. Models generating multi-terabyte checkpoints benefit from direct Lustre integration with S3 through Data Repository Associations, enabling automatic durability without explicit copy operations.
The guide positions itself as the first in a series covering resource orchestration, ML software stacks, and observability layers. Each layer builds on the infrastructure foundation detailed here.