AMD, Meta & NVIDIA Launch OCI MSA for Open AI Infrastructure Optical Scale-Up

by priyanka.patel tech editor

The relentless demand for processing power in artificial intelligence is driving a fundamental shift in how data centers are built. A new consortium, the Optical Compute Interconnect (OCI) Multi-Source Agreement (MSA), has formed with founding members AMD, Broadcom, Meta, Microsoft, NVIDIA, and OpenAI, aiming to accelerate the adoption of optical interconnects – a technology seen as crucial for scaling AI infrastructure beyond the limitations of traditional copper-based systems. This collaboration signals a move towards a more open and multi-vendor ecosystem, a departure from the historically siloed approach to high-performance computing.

As large language models (LLMs) continue to grow in complexity, the bandwidth requirements for communication between processors are skyrocketing. Traditional copper interconnects, while effective for many applications, are hitting physical limits in terms of distance and data transfer rates. This bottleneck impacts the architecture of “scale-up” domains within AI clusters – essentially, the ability to combine multiple processors to tackle increasingly complex tasks. The OCI MSA intends to address this challenge by establishing standardized specifications for optical interconnects, paving the way for faster, more efficient, and scalable AI systems.

The core of the initiative lies in defining open specifications for optical interfaces. Currently available at www.oci-msa.org, these specifications focus on optimizing power consumption, latency, and cost. They leverage technologies like non-return to zero (NRZ) modulation and wavelength division multiplexing (WDM) to maximize data throughput. A key aspect of this approach is a shift from a module-centric to a silicon-centric model, integrating optics more closely with the computing and networking silicon itself.

Breaking Down the Barriers to Optical Interconnects

For years, optical interconnects have been seen as a promising solution for high-bandwidth, low-latency communication. However, a lack of standardization has hindered widespread adoption. Different vendors have developed proprietary solutions, creating compatibility issues and increasing costs. The OCI MSA aims to solve this problem by creating a “plug-and-play” ecosystem, allowing hyperscalers to seamlessly integrate processors (XPUs) and switches from various manufacturers using a common optical physical layer (PHY). This interoperability is expected to significantly reduce integration risks and accelerate development cycles.

“AMD is a founding member and strong supporter of the MSA, as this initiative establishes open specifications for the industry to build a robust and multi-vendor optical interconnect ecosystem,” said Brian Amick, Senior Vice President of Technology & Engineering at AMD, in a statement. This sentiment is echoed by other founding members, who notice the OCI MSA as a critical step towards unlocking the full potential of AI.

Near Margalit, Vice President & General Manager, Optical Systems Division at Broadcom, highlighted the importance of leveraging existing technologies. “Broadcom is proud to leverage our multi-generational CPO platforms and industry partnerships to drive the OCI specifications forward,” Margalit stated. “OCI-MSA enables seamless integration with existing electrical SerDes-based ASICs, while also providing a clear path towards direct integration with ASICs, ensuring the ecosystem remains flexible and high-performing.”

A Roadmap for Scalability and Efficiency

The OCI MSA isn’t just about defining a standard; it’s about creating a roadmap for future innovation. The specifications outline a path towards increasing interface density, scalability, and interoperability. Key features include:

  • High-Density Interfaces: Promoting OCI GEN1 (4λ × 50Gbps NRZ, 200Gbps per direction) and OCI GEN2 (400Gbps per direction bidirectional (BiDi), reaching up to 800Gbps per fiber).
  • Massive Scalability: A roadmap to increase the number of wavelengths and data rates, potentially reaching 3.2Tbps per fiber and beyond, enabling larger scale-up domains with more GPUs and higher bandwidth per GPU.
  • Interoperable Form Factors: Support for pluggable optics, on-board optics, and co-packaged optics (CPO).
  • Large-Scale Efficiency: Enabling optical solutions to meet aggressive performance, power, and cost targets previously associated with copper connectivity, while offering significantly greater reach.

Meta, a key player in the development of AI models, sees this as a crucial step in overcoming the limitations of current infrastructure. “The need for technology to address the power and cost constraints impacting AI cluster design is real and urgent,” said Dan Rabinovitsj, Vice President of Hardware Systems at Meta. “We are driving the adoption of these OCI protocols to decouple the requirements of larger scale-up domains from the limitations of electrical backplanes in high-performance AI clusters.”

The Future of AI Infrastructure

The implications of the OCI MSA extend beyond simply faster data transfer rates. By fostering a more open and competitive ecosystem, the consortium aims to drive down costs and accelerate innovation in AI hardware. This could lead to more powerful and accessible AI technologies, benefiting a wider range of industries and applications. Saurabh Dighe, Corporate Vice President, Azure Systems and Architecture at Microsoft, emphasized that optical technologies, protocols, and switch architectures are foundational for building scalable, high-performance AI compute domains. “OCI MSA advances this vision with innovative physical layer specifications, paving the way for open standards, differentiated implementations, and system architecture innovation,” Dighe stated.

NVIDIA and OpenAI also voiced strong support for the initiative. Gilad Shainer, Senior Vice President of Networking at NVIDIA, stated that the company joined the OCI MSA to help build a common optical standard across the global AI infrastructure. Richard Ho, Head of Hardware at OpenAI, added that continued advancements in AI depend on scaling AI supercomputers with more petaflops, larger memory bandwidth, and higher network bandwidth, all of which require the capabilities that OCI MSA aims to deliver.

The OCI MSA represents a significant step towards a more sustainable and scalable future for AI. The consortium is expected to continue refining its specifications and working with industry partners to accelerate the adoption of optical interconnects. The next key milestone will be the release of updated specifications and the demonstration of interoperability between different vendors’ solutions, expected in the latter half of 2024. This collaborative effort promises to reshape the landscape of AI infrastructure, enabling the next generation of intelligent systems.

What are your thoughts on the potential impact of optical interconnects on the future of AI? Share your comments below, and please share this article with your network.

You may also like

Leave a Comment