Our Take
NVIDIA published vendor benchmarks on its own hardware showing DeepSpeed Ulysses outperforms AllGather KV and Ring Attention for long-sequence workloads, but no independent reproducer exists yet.
Why it matters
Teams building media generation pipelines (video, high-res images) hit memory walls on single GPUs; this feature lets them scale to 8-GPU clusters while preserving production optimizations. The three parallelism strategies carry different memory and communication trade-offs, so the choice matters for your latency budget.
Do this week
ML infrastructure lead: run the TensorRT 11 samples on your multi-GPU cluster before committing to a context parallelism strategy, because AllGather KV, Ring Attention, and Ulysses scale differently depending on sequence length and number of GPUs.
TensorRT 11 Adds Native Multi-GPU Inference with Three Parallelism Options
NVIDIA released TensorRT 11.0 with IDistCollectiveLayer primitives that let a single neural network execute across multiple GPUs using integrated distributed communication. The new feature supports eight collective operations: AllReduce, Broadcast, Reduce, AllGather, ReduceScatter, AlltoAll, Gather, and Scatter. All are powered by NVIDIA's NCCL library, which automatically selects the optimal transport (NVLink, NVSwitch, PCIe, InfiniBand) based on topology.
The release targets two main parallelism strategies. Tensor parallelism partitions layer weights across GPUs, which is necessary when a single layer exceeds one GPU's memory. Context parallelism partitions the input sequence along the token dimension, which is especially effective for long-sequence workloads like diffusion models.
For context parallelism, NVIDIA documented three implementations with different latency and memory trade-offs. AllGather KV exchanges key and value shards before each attention block, reducing per-GPU compute but requiring one collective per attention layer. Ring Attention overlaps communication and computation by streaming K and V past in a ring topology, reducing memory footprint via online softmax. DeepSpeed Ulysses uses two all-to-all collectives on Q, K, and V to partition by head rather than sequence, then gathers results.
Vendor Benchmarks Show Ulysses Fastest for Tens of Thousands of Tokens
NVIDIA benchmarked context parallelism across 8 GPUs using two production models: NVIDIA Cosmos 3 for video generation and Flux.1 for text-to-image (company-reported). On Cosmos 3 video, Ulysses delivered the lowest end-to-end latency, with Ring Attention second and AllGather KV third. On Flux.1 image generation, Ulysses again won, though Ring Attention also scaled to 4 GPUs effectively.
The practical implication is that neither strategy dominates uniformly. AllGather KV is simpler but incurs one all-gather per attention block. Ring Attention saves memory but requires careful synchronization. Ulysses is fastest for extreme sequence lengths (tens of thousands of tokens) but uses two all-to-all collectives, which can dominate communication time on smaller models or shorter sequences.
No independent benchmark of these three strategies against each other exists yet. The comparison comes only from NVIDIA's own test bed on a single 8-GPU node.
Test All Three Strategies on Your Hardware Before Committing
The TensorRT 11 SDK includes a working C++ example and a PyTorch-to-TensorRT conversion path (Torch-TensorRT) so you can develop in PyTorch and deploy optimized engines in production. The workflow supports multi-device inference out of the box, but choosing the right parallelism strategy depends on your sequence length, number of GPUs, and interconnect topology.
NVIDIA's benchmarks assume single-node, 8-GPU setups. If you are targeting fewer GPUs, longer sequences, or a different interconnect (PCIe vs NVLink), the winner may differ. Run the provided samples on your own cluster before deploying to production, and monitor collective communication time separately from compute time to catch topology bottlenecks early.