OpenAI open-sources MRC networking protocol for AI training

OpenAI releases MRC protocol through Open Compute Project

OpenAI published the Multipath Reliable Connection (MRC) specification through the Open Compute Project, making available a networking protocol designed for large-scale AI training clusters. The protocol is already deployed across OpenAI's NVIDIA GB200 supercomputers, including systems at Oracle Cloud Infrastructure in Texas and Microsoft's Fairwater supercomputers.

MRC extends RDMA over Converged Ethernet (RoCE) by splitting single high-speed network interfaces into multiple smaller links. Instead of one 800Gb/s connection, MRC creates eight 100Gb/s links connecting to separate switches. This creates parallel network "planes" that can route around failures independently.

The protocol spreads individual data transfers across hundreds of paths simultaneously. When packets arrive out of order, MRC delivers them directly to their final memory addresses. If a path becomes congested or fails, MRC stops using it within microseconds and redistributes traffic to other paths.

OpenAI developed MRC with AMD, Broadcom, Intel, Microsoft, and NVIDIA. The company reports using it to train multiple frontier models and published implementation details in a research paper titled "Resilient AI Supercomputer Networking using MRC and SRv6."

Training jobs worth millions crash from single network failures

Large AI training runs involve millions of synchronized data transfers per training step. One delayed transfer can force thousands of GPUs to wait, rippling through the entire job. Traditional networking protocols assign each transfer to a single path, creating bottlenecks when multiple flows collide.

At supercomputer scale, network components fail constantly. OpenAI reports observing multiple link failures per minute during training runs. Previously, a single switch reboot would crash training jobs, forcing expensive restarts from saved checkpoints.

MRC's multipath approach allows training to continue even during major failures. During one frontier model training run, OpenAI rebooted four tier-1 switches without coordinating with training teams. The job experienced temporary slowdowns but recovered without human intervention.

Evaluate multipath networking for next hardware cycle

MRC requires rebuilding network architecture around multipath topologies. Organizations must split high-speed interfaces into multiple lower-speed connections and deploy switches that support the resulting configuration. This affects both hardware procurement and network operations.

The protocol uses IPv6 Segment Routing (SRv6) with static routing tables instead of dynamic protocols like BGP. This simplifies switch software but requires careful path planning during network design.

OpenAI's implementation connects over 131,000 GPUs using only two switch tiers instead of the three or four required by conventional 800Gb/s networks (per company specifications). This reduces component count, power consumption, and failure modes, but demands expertise in multipath network design.

OpenAI open-sources MRC networking protocol for AI training

Our Take

Why it matters

Do this week

OpenAI releases MRC protocol through Open Compute Project

Training jobs worth millions crash from single network failures

Evaluate multipath networking for next hardware cycle

Related stories

Gresham and FundGuard merge data platforms for asset managers

ANNA Money adds 3.66% savings account for UK small businesses

Payward buys Reap for $600M to merge stablecoin cards with B2B rails