NVIDIA Details CPO, Optical Chips Shine at Hotchips
Yesterday, we shared the significant presentations from the first day of Hot Chips 2025. For details, please refer to the article "A Comprehensive Look at Popular Chips". As of today, traditional giants and emerging players such as NVIDIA, Ayar Labs, Lightmatter, and Celestial have also made detailed presentations on optical chips. We have compiled the following content for our readers.
NVIDIA's CPO Optical Devices
This was an exciting part of the Hot Chips 2025 conference. We heard Gilad's speech on the gigabit co-packaged silicon photonics switch.
During the speech, NVIDIA first discussed the need for co-packaged photonics and how it can significantly scale up AI factories. The company mentioned that compared to traditional cloud data centers, AI factories consume about 17 times more optical power. This is mainly because the increase in GPU clusters requires dozens of optical transceivers to communicate with other GPUs. Therefore, the cost of network photonics alone accounts for about 10% of the total computing power of an AI factory. NVIDIA plans to reduce this substantial cost through its Spectrum-X Ethernet photonics technology.
NVIDIA views data centers as computers rather than individual GPUs.
The BlueField-3 DPU is designed as a NIC to access the network.
Artificial intelligence requires zero-jitter communication because they are large-scale, complex, and span long distances.
There are many types of Ethernet architectures. Although they are all Ethernet, their requirements and goals vary.
NVIDIA Spectrum-X Ethernet is designed to allow large GPU clusters to use Ethernet.
It is reported that Spectrum-X Ethernet photonics is a unique implementation. It is said to be the first technology to adopt 200 G/channel SerDes, which is the cutting-edge standard in the field of electrical signal transmission. Compared with pluggable transceivers, Spectrum-X photonics has better signal integrity and lower DSP requirements because in this implementation, the photonic engine (PIC) is adjacent to the switch ASIC. This means there is no need for long PCB traces, and the number of lasers is significantly reduced. For example, the number of lasers for a 1.6 Tb/s link is reduced from 8 to 2, which means lower power consumption and higher transmission reliability.
Spectrum-X aims to provide low-jitter communication for AI workloads. Jitter in AI networks can cause GPUs to be idle among a large number of GPUs. This is not only inefficient but also costly as GPUs are expensive when idle. NVIDIA is designing this end-to-end so that all functions are no longer concentrated only on the switch.
Spectrum-X provides higher NCCL performance. NVIDIA hopes to ensure that when multiple jobs are executed simultaneously on a large infrastructure, they do not interfere with each other. For example, if there is one job on a switch, but there are also other jobs at the same time, you definitely don't want these jobs to interfere with the performance of other jobs.
This is a new product this year, indicating that Spectrum-X has better scheduling performance for mixture-of-experts models than standard Ethernet.
The following is the impact of Spectrum-X on multi-tenant data centers.
NVIDIA's silicon photonics solution uses a silicon photonics CPO chip with a transmission rate of up to 1.6T. This solution integrates an MRM (micro-ring modulator), which can provide higher bandwidth while reducing power consumption and footprint. More importantly, NVIDIA Photonics is the first to adopt 3D stacking technology between the photonic layer and the electronic layer, thereby reducing wiring complexity and increasing bandwidth density. The Green team is collaborating with TSMC in the field of silicon photonics because this Taiwanese giant is the preferred choice to meet the requirements of photonics.
It is reported that the NVIDIA photonics technology used in data centers has 3.5 times higher energy efficiency, 10 times higher flexibility, and 1.3 times longer uptime compared to optical standards. This indicates that once photonics technology becomes the mainstream interconnection, AI computing will experience significant development. The company also demonstrated its first full-size switch, Spectrum-6 102T, with integrated photonics technology. This will be Team Green's flagship product. The following are its main features:
- 2 times throughput
- 63 times signal integrity
- 4 times reduction in the number of lasers
- 1.6 times bandwidth density
- 13 times higher laser reliability
- Replaces 64 independent transceivers
The following is a summary slide of the differences between Spectrum-X Ethernet and off-the-shelf (Broadcom) Ethernet.
Since fiber network components consume a large amount of power, scaling up also becomes a challenge.
This is the next-generation Spectrum-X Ethernet photonics technology. It does not require power to connect pluggable optical engines, thus saving a large amount of power.
NVIDIA Photonics is a 1.6T CPO chip equipped with a new type of micro-ring modulator. NVIDIA is also focusing on detachable fiber connectors. You may notice in the picture that the CPO connection methods of Spectrum-X and Quantum-X are different. This is due to the evolution of the solution.
To achieve this function, the cooperation of many components is required. It should be noted that there is a pluggable laser in this design.
NVIDIA demonstrated its functions in the field.
NVIDIA has a 102T switch, namely the Spectrum-6 102T switch with integrated silicon photonics.
In this way, reliability will increase, but power consumption will decrease.
NVIDIA has Quantum-X and Spectrum-X switches, and is about to launch CPO. It should be noted that I will try to explore these switches in more detail in the future.
Vertical scaling, horizontal scaling, and now horizontal scaling again. If you want to scale beyond the data center, you not only need to have a high-quality network but also extremely high speed.
Spectrum-XGS is the company's method of extending the horizontally scalable network to horizontal scaling. This means that now not only hardware but also distance-aware algorithms are required.
NVIDIA says that using this technology can increase horizontal scaling performance by 1.9 times, and there is still room for further improvement.
This is a large-scale training operation.