What problem does the Multipath Reliable Connection protocol solve?

The protocol addresses data transfer delays and network congestion in large AI training clusters, which previously caused graphics processing units to remain idle and waste computational resources.

How does MRC change traditional network switch configurations?

MRC splits a single eight hundred gigabit per second interface into multiple smaller links, allowing a sixty-four port switch to connect five hundred twelve ports at one hundred gigabits per second and reducing required switch tiers from four to two.

Which organizations collaborated to develop the MRC standard?

OpenAI partnered with AMD, NVIDIA, Intel, Microsoft, and Broadcom to design the protocol, which was subsequently released through the Open Compute Project for industry-wide adoption.

Where has the MRC protocol been deployed for testing?

OpenAI has deployed the standard across supercomputers housing NVIDIA GB200 Blackwell graphics processing units, including facilities at Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft Fairwater supercomputers.

AI Industry

OpenAI and Tech Giants Launch MRC Protocol for AI Training

Christopher Holloway

May 06, 2026 - 18:30

Updated: 18 days ago

0 6

OpenAI and Tech Giants Launch MRC Protocol for AI Training

OpenAI has collaborated with AMD, NVIDIA, Intel, Microsoft, and Broadcom to develop Multipath Reliable Connection, an open networking protocol that enhances GPU performance and resilience in large-scale AI training environments. Released through the Open Compute Project, the standard splits high-speed network interfaces into multiple parallel paths to prevent data transfer delays and reduce hardware idle time.

The rapid expansion of artificial intelligence has pushed hardware manufacturers and cloud providers to their architectural limits. As training workloads grow exponentially, the underlying infrastructure required to move data between processors has become the primary bottleneck. A recent collaborative effort among major technology firms addresses this constraint by introducing a standardized networking protocol designed specifically for massive computational clusters.

What is the Multipath Reliable Connection protocol?

The technology sector recently witnessed a significant milestone when OpenAI announced a comprehensive partnership involving AMD, NVIDIA, Intel, Microsoft, and Broadcom. This coalition aims to accelerate large-scale artificial intelligence training by introducing a new networking standard known as Multipath Reliable Connection. The protocol was officially released through the Open Compute Project to encourage widespread adoption across the broader artificial intelligence industry. By establishing an open standard, the participating firms intend to remove proprietary barriers that have historically complicated hardware interoperability.

At its core, the Multipath Reliable Connection protocol extends the existing Remote Direct Memory Access over Converged Ethernet framework. This extension enables hardware-accelerated remote direct memory access for both graphics processing units and central processing units. The primary objective is to improve network performance and resilience within massive training clusters. When multiple processors operate simultaneously, the ability to move data efficiently becomes just as critical as raw computational power. The new standard addresses this requirement by fundamentally altering how network interfaces handle data transmission.

Traditional networking architectures often treat a single high-speed interface as one continuous data channel. The new approach divides that channel into multiple smaller, independent links. For instance, a single network interface can now connect to eight different switches simultaneously. This configuration allows engineers to build eight separate parallel networks, each operating at a lower speed but functioning as a unified system. The architectural shift reduces the risk of a single point of failure disrupting the entire training process.

The collaborative development process spanned approximately two years, during which engineers from each partner company contributed to the protocol design. OpenAI has already deployed this networking standard across its own supercomputers, which house NVIDIA GB200 Blackwell graphics processing units. These systems include infrastructure located at Oracle Cloud Infrastructure facilities in Abilene, Texas, as well as Microsoft Fairwater supercomputers. This development aligns closely with the broader evolution of the next phase of the Microsoft OpenAI partnership, which continues to shape cloud infrastructure strategies.

The release of this standard through the Open Compute Project ensures that the technology remains accessible to independent researchers and enterprise developers. Open specifications allow third-party hardware manufacturers to design compatible network switches and interface cards without licensing restrictions. This accessibility accelerates the transition from experimental prototypes to production-ready infrastructure. The industry benefits from standardized testing methodologies and shared documentation that simplify integration efforts.

Why does network congestion matter for large AI clusters?

The development of this networking standard was driven by a specific operational challenge that has plagued the artificial intelligence industry. Training large language models requires continuous data transfer between thousands of processors. Even a single delayed data packet can disrupt the entire computational process. When data arrives late, the affected graphics processing units must remain idle while waiting for the missing information. This idle time represents a significant waste of computational resources and delays model completion.

Network congestion emerges as a primary source of these delays. As clusters grow in size, the volume of data moving between switches and processors increases exponentially. The larger the cluster becomes, the more frequently congestion occurs. Link failures and device malfunctions compound the problem, creating bottlenecks that slow down the entire training pipeline. Engineers have traditionally struggled to mitigate these issues because conventional routing methods cannot react quickly enough to sudden traffic spikes.

The impact of delayed data extends beyond simple performance degradation. When training workloads stall, the financial costs associated with cloud computing and hardware utilization rise sharply. Organizations investing billions in supercomputing infrastructure expect maximum throughput to justify their expenditures. Idle processors generate no value, yet they continue to consume power and require cooling. Solving the congestion problem directly translates to improved operational efficiency and reduced computational expenses for artificial intelligence developers.

Addressing these challenges requires a fundamental rethinking of network control planes. Older systems rely on complex routing tables and centralized management structures that struggle to scale. The new protocol simplifies network control by distributing traffic management across multiple paths. This distribution allows the system to route around failures in microseconds rather than seconds. The speed of failure recovery ensures that training workloads maintain continuous momentum without manual intervention.

Historical attempts to solve congestion have relied on increasing individual link speeds. While faster cables and switches help, they do not eliminate the fundamental problem of centralized routing bottlenecks. The multipath approach acknowledges that physical limitations will always exist in massive data centers. By embracing redundancy and parallelism, engineers can build systems that degrade gracefully rather than collapse under pressure. This philosophy aligns with modern distributed computing principles.

How does MRC restructure traditional networking architecture?

The architectural transformation enabled by the new standard involves splitting high-speed network interfaces into manageable segments. Instead of relying on a single eight hundred gigabit per second link, the protocol distributes traffic across numerous smaller connections. Each segment operates independently while contributing to the overall data throughput. This segmentation allows the network to maintain high speeds even when individual paths experience temporary degradation or congestion. The system dynamically balances the load across all available routes.

This restructuring dramatically changes the physical layout of supercomputing facilities. A conventional switch capable of connecting sixty-four ports at eight hundred gigabits per second can now connect five hundred twelve ports at one hundred gigabits per second. The reduction in per-port speed is offset by the massive increase in total connectivity. Engineers can build networks that fully connect approximately one hundred thirty-one thousand graphics processing units using only two tiers of switches.

Traditional networking configurations would require three or four switch tiers to achieve similar connectivity levels. Each additional tier introduces latency, increases power consumption, and complicates maintenance. The two-tier architecture simplifies the physical infrastructure while improving data flow efficiency. The streamlined design reduces the number of hops data must travel between processors. Fewer hops mean faster communication and reduced energy expenditure across the entire cluster.

The protocol also enables hardware-accelerated remote direct memory access, which bypasses the central processing unit during data transfers. This feature allows graphics processing units to communicate directly with one another, significantly reducing latency. The direct memory access capability ensures that computational workloads receive data without waiting for system-level processing. The combination of direct memory access and multipath routing creates a highly resilient network environment optimized for continuous artificial intelligence training.

The shift toward parallel networking planes also simplifies troubleshooting and maintenance procedures. Network administrators can isolate problematic segments without shutting down entire training workloads. This modularity reduces downtime during hardware upgrades or firmware updates. The industry has long sought a scalable solution that balances performance with operational simplicity. The multipath architecture delivers both by distributing complexity across the network fabric rather than concentrating it in central routing devices.

What are the deployment implications for future supercomputers?

The introduction of this open networking standard carries significant implications for the future of artificial intelligence infrastructure. OpenAI has already integrated the protocol into multiple training workloads across both NVIDIA and Broadcom hardware. The successful deployment validates the technical approach and provides a blueprint for other organizations seeking to scale their computational capabilities. The open nature of the standard encourages cross-industry collaboration and accelerates technological progress.

The protocol will serve as a foundational element for OpenAI's upcoming Stargate supercomputer project. This facility, constructed by Oracle Cloud Infrastructure in Abilene, Texas, aims to deploy ten gigawatts of artificial intelligence compute capacity by twenty twenty-nine. The project has already deployed over three gigawatts in the past three months, demonstrating rapid scaling capabilities. The availability of the open networking standard accelerates the deployment timeline and reduces technical uncertainty for large-scale construction projects.

Industry stakeholders recognize that solving the hardest problems within artificial intelligence requires shared infrastructure standards. By releasing the protocol through the Open Compute Project, the participating firms have removed barriers to adoption. Other technology companies can now implement the standard without negotiating complex licensing agreements. This approach fosters a more competitive and innovative market environment where hardware manufacturers focus on performance improvements rather than proprietary compatibility. This strategic alignment directly supports efforts toward unlocking human ambition to drive business growth with artificial intelligence by removing technical friction from large-scale deployment.

The long-term impact extends beyond immediate computational gains. As artificial intelligence models continue to grow in complexity, the demand for efficient data movement will only increase. Organizations that adopt standardized networking protocols will gain a structural advantage in training speed and reliability. The collaboration between OpenAI, AMD, NVIDIA, Intel, Microsoft, and Broadcom establishes a precedent for future infrastructure development. The industry now has a proven framework for scaling computational networks to meet growing demands.

The integration of this standard into existing cloud environments requires careful planning and phased migration strategies. Data center operators must upgrade switch firmware and replace legacy network interface cards to fully realize the performance benefits. However, the long-term return on investment justifies the initial capital expenditure. Companies that modernize their networking infrastructure now will be better positioned to handle the next generation of machine learning workloads.

Conclusion

The artificial intelligence sector continues to evolve at a pace that outstrips traditional hardware development cycles. Infrastructure constraints have historically limited the speed at which models could be trained and deployed. The introduction of a standardized, open networking protocol addresses these limitations by optimizing data flow across massive computational clusters. The collaborative effort among major technology firms demonstrates a clear commitment to overcoming architectural bottlenecks. As supercomputing facilities continue to expand, the adoption of efficient networking standards will determine which organizations can sustain long-term innovation. The focus now shifts from raw computational power to intelligent data management, marking a new phase in artificial intelligence infrastructure development.

Windows 11 Low Latency Profile Optimizes CPU Clocks for Faster Interface Resp...

What's Your Reaction?

Like 0

Dislike 0

Love 0

Funny 0

Wow 0

Sad 0

Angry 0

Christopher Holloway

Christopher Holloway is the founder and director of Progressive Robot, a UK-based technology company. A full-stack engineer with more than two decades of experience, he works across PHP development, ecommerce, Linux infrastructure, technical SEO and AI automation, and writes here on technology, AI, hardware and software.