Back to news
NewsMay 6, 2026· 2 min read

OpenAI open-sources MRC networking protocol for AI training

New networking protocol spreads packets across hundreds of paths to prevent training failures from network congestion and switch outages.

By Agentic DailyVerified Source: OpenAI

Our Take

MRC solves real problems at supercomputing scale, but adoption depends on convincing the entire networking stack to rebuild around multipath architectures.

Why it matters

Network failures routinely crash expensive training runs worth millions in GPU time. Organizations running large-scale AI training need resilience protocols that can route around failures in microseconds, not seconds.

Do this week

AI infrastructure teams: Review the MRC specification on OCP this month to evaluate multipath networking for your next hardware refresh cycle.

OpenAI releases MRC protocol through Open Compute Project

OpenAI published the Multipath Reliable Connection (MRC) specification through the Open Compute Project, making available a networking protocol designed for large-scale AI training clusters. The protocol is already deployed across OpenAI's NVIDIA GB200 supercomputers, including systems at Oracle Cloud Infrastructure in Texas and Microsoft's Fairwater supercomputers.

MRC extends RDMA over Converged Ethernet (RoCE) by splitting single high-speed network interfaces into multiple smaller links. Instead of one 800Gb/s connection, MRC creates eight 100Gb/s links connecting to separate switches. This creates parallel network "planes" that can route around failures independently.

The protocol spreads individual data transfers across hundreds of paths simultaneously. When packets arrive out of order, MRC delivers them directly to their final memory addresses. If a path becomes congested or fails, MRC stops using it within microseconds and redistributes traffic to other paths.

OpenAI developed MRC with AMD, Broadcom, Intel, Microsoft, and NVIDIA. The company reports using it to train multiple frontier models and published implementation details in a research paper titled "Resilient AI Supercomputer Networking using MRC and SRv6."

Training jobs worth millions crash from single network failures

Large AI training runs involve millions of synchronized data transfers per training step. One delayed transfer can force thousands of GPUs to wait, rippling through the entire job. Traditional networking protocols assign each transfer to a single path, creating bottlenecks when multiple flows collide.

At supercomputer scale, network components fail constantly. OpenAI reports observing multiple link failures per minute during training runs. Previously, a single switch reboot would crash training jobs, forcing expensive restarts from saved checkpoints.

MRC's multipath approach allows training to continue even during major failures. During one frontier model training run, OpenAI rebooted four tier-1 switches without coordinating with training teams. The job experienced temporary slowdowns but recovered without human intervention.

Evaluate multipath networking for next hardware cycle

MRC requires rebuilding network architecture around multipath topologies. Organizations must split high-speed interfaces into multiple lower-speed connections and deploy switches that support the resulting configuration. This affects both hardware procurement and network operations.

The protocol uses IPv6 Segment Routing (SRv6) with static routing tables instead of dynamic protocols like BGP. This simplifies switch software but requires careful path planning during network design.

OpenAI's implementation connects over 131,000 GPUs using only two switch tiers instead of the three or four required by conventional 800Gb/s networks (per company specifications). This reduces component count, power consumption, and failure modes, but demands expertise in multipath network design.

#Developer Tools#Enterprise AI#Open Source#Research
Share:
Keep reading

Related stories