HomeArticle

AI space race? NVIDIA's H100 has just gone into space, and Google's Project Suncatcher is also planning to send TPUs into space.

机器之心2025-11-05 10:17
The energy emitted by the sun is 100 trillion times higher than the total human electricity production.

On November 2nd, NVIDIA sent its H100 GPUs into space for the first time. Refer to the report "NVIDIA Launches First Space AI Server, H100 in Orbit". Just recently, Google announced that they too will send their TPUs into space.

This project is named Project Suncatcher. It is a "design for a space-based scalable AI infrastructure system". Google CEO Sundar Pichai said that this project can better utilize the sun's energy to power AI, as the energy emitted by the sun is 100 trillion times higher than the total global electricity production.

He said, "Like any moon landing project, it will require us to solve many complex engineering challenges. Early research shows that our Trillium-generation TPUs (our tensor processing units, designed specifically for AI) can withstand particle accelerator tests (simulating radiation levels in low Earth orbit). However, significant challenges still remain, such as thermal management and on-orbit system reliability."

He also announced the time of the first launch: Early 2027. At that time, Google will launch two prototype satellites with Planet.

Naturally, this move has attracted wide attention and discussion:

Some people have also let Google's Veo make some exaggerated imaginations:

Project Suncatcher

Design for a Space-Based Scalable AI Infrastructure System

Project Suncatcher is a grand exploration aiming to equip a solar satellite constellation (a system formed by a group of artificial satellites working together) with TPUs and free-space optical communication links, with the hope of expanding the computational scale of machine learning in space in the future.

Google said that this may further "unleash its full potential."

After all, the sun is the ultimate energy source in the solar system, and the energy it radiates is more than 100 trillion times that of the total global electricity generation. In a suitable orbit, the efficiency of solar panels can be eight times higher than on Earth, and they can generate electricity almost continuously, thus reducing the need for batteries. Therefore, space may be the best place to expand AI computing in the future.

Based on this concept, Google launched Project Suncatcher. They envisioned a compact constellation of solar satellites equipped with Google TPUs and connected by free-space optical communication links.

Google said, "This approach not only has great potential for scaling but also minimizes the impact on Earth's resources."

Google also published a preprint paper, "Towards a future space-based, highly scalable AI infrastructure system design", sharing some early research results. It introduces some fundamental progress Google has made to achieve this grand goal, including high-bandwidth communication between satellites, orbital dynamics, and the impact of radiation on computing.

Paper title: Towards a future space-based, highly scalable AI infrastructure system design

Paper link: https://goo.gle/project-suncatcher-paper

Paper abstract: If AI is regarded as a fundamental general-purpose technology, we should anticipate a continuous increase in the demand for AI computing power and energy. The sun is by far the largest energy source in the solar system, so it is worth exploring how future AI infrastructure can most effectively utilize this energy. This paper explores a scalable space-based machine learning computing system that utilizes a constellation of satellites equipped with solar arrays, inter-satellite links based on free-space optical communication, and Google's tensor processing unit (TPU) accelerator chips. To achieve high-bandwidth, low-latency inter-satellite communication, these satellites will fly in close formation. We present a basic formation flight scheme for an 81-satellite cluster with a radius of 1 kilometer and describe a method for controlling large-scale constellations using high-precision machine learning models. The Trillium TPU has been radiation-tested and can withstand the total ionizing dose equivalent to a five-year mission cycle without permanent damage, and its bit-flip errors have been characterized. Launch cost is a key component of the overall system cost; learning curve analysis shows that by the mid-2030s, the cost of launching satellites to low Earth orbit (LEO) may drop to about $200 per kilogram or less.

It states, "By focusing on a modular design consisting of smaller, interconnected satellites, we are laying the foundation for a highly scalable space-based AI infrastructure in the future."

Google also said, "Project Suncatcher is a continuation of Google's 'moon landing' tradition of tackling difficult scientific and engineering problems. Like all 'moon landing' projects, there will inevitably be unknowns. But it is in this spirit that we started building large-scale quantum computers a decade ago (when it was not yet considered a realistic engineering goal) and conceived of self-driving cars 15 years ago, which ultimately led to the birth of Waymo. Now, Waymo is providing millions of passenger services globally."

System Design and Key Challenges

The system consists of a satellite network constellation, most likely operating in a "dawn–dusk sun-synchronous low Earth orbit" where they can receive sunlight almost continuously. This orbit choice maximizes solar energy collection efficiency and reduces the need for bulky on-board batteries. To make the system feasible, several technical obstacles must be overcome:

1. Achieving Data Center-Scale Inter-Satellite Links

Large-scale ML workloads require tasks to be distributed across numerous accelerators through high-bandwidth, low-latency connections. To provide performance comparable to that of terrestrial data centers, the links between satellites need to support rates of tens of terabits per second.

Google's analysis shows that this should be achievable using multi-channel dense wavelength division multiplexing (DWDM) transceivers and spatial multiplexing technology.

However, the received power level required to achieve this bandwidth is thousands of times higher than that of traditional long-range deployments. Since the received power is inversely proportional to the square of the distance, this challenge can be overcome by having the satellites fly in very close formation (kilometers or less), thus "closing" the link budget (the accounting of end-to-end signal power loss in a communication system).

The Google team has begun to validate this approach through a bench-top validator, which has successfully achieved a one-way transmission rate of 800 Gbps (1.6 Tbps in total) using a pair of transceivers.

2. Controlling a Large, Tightly Clustered Satellite Formation

High-bandwidth inter-satellite links require satellites to fly in a much more compact formation than any existing system.

Google has developed numerical and analytical physical models to analyze the orbital dynamics of such a constellation. They used an approximation starting from the Hill-Clohessy-Wiltshire equations (which describe the orbital motion of a satellite relative to a circular reference orbit in the Keplerian approximation) and a differentiable model based on JAX for numerical refinement to account for further perturbations.

At the planned constellation altitude, the non-spherical characteristics of the Earth's gravitational field and potential atmospheric drag are the main non-Keplerian effects affecting the orbital dynamics of the satellites.

The following figure shows an example of the trajectory of an 81-satellite constellation configuration (over a complete orbital period) within the orbital plane at an average cluster altitude of 650 kilometers. The cluster radius R = 1 kilometer, and under the influence of the Earth's gravity, the distance between second-nearest neighbor satellites oscillates between approximately 100–200 meters.

The evolution of a "free-falling" (thrustless) constellation under the influence of the Earth's gravity, modeled in sufficient detail to obtain a sun-synchronous orbit. The coordinate system is a non-rotating system relative to the central reference satellite S0. The arrow points to the Earth's center. Magenta: The nearest neighbors of satellite S0. Orange: An example "peripheral" satellite S1. Orange dashed line: The position of S1 relative to the cluster center (in the non-rotating coordinate system).

The model shows that when the satellites are positioned only a few hundred meters apart, only moderate "station-keeping" maneuvers may be required to maintain the stability of the constellation within the desired sun-synchronous orbit.

3. Radiation Tolerance of TPUs

For ML accelerators to work effectively in space, they must be able to withstand the environment of low Earth orbit.

Google tested its v6e Cloud TPU - Trillium in a 67 MeV proton beam to examine its susceptibility to total ionizing dose (TID) and single-event effects (SEEs).

The results are very promising. Although the high-bandwidth memory (HBM) subsystem is the most sensitive component, it only starts to exhibit anomalies after accumulating a dose of 2 krad (Si) - almost three times the expected (shielded) five-year mission dose of 750 rad (Si). At the maximum tested dose of 15 krad (Si) on a single chip, no hard faults attributable to TID were observed, indicating that the Trillium TPU has remarkable radiation resistance for space applications.

4. Economic Feasibility and Launch Costs

Historically, high launch costs have been a major obstacle to large-scale space-based systems. However, Google's analysis of historical and predicted launch pricing data shows that with further research, by the mid-2030s, the price may drop below $200 per kilogram.

At this price point, the launch and operating costs of a space-based data center, calculated in "kilowatt-years", may become roughly comparable to the reported energy costs of an equivalent terrestrial data center.

The payload launch mass of SpaceX, measured in inflation-adjusted "lowest achieved price" since the successful launch of Falcon 1, presented incrementally by different rocket categories. Note the significant price drops with the introduction of the Falcon 9 and Falcon Heavy rockets.

Future Directions

Google's preliminary analysis shows that the core concept of space-based ML computing is not hindered by fundamental physics or insurmountable economic barriers.

However, significant engineering challenges still remain, such as thermal management, high-bandwidth ground communication, and on-orbit system reliability.

To begin addressing these challenges, Google's next milestone is to conduct a "learning mission" in collaboration with Planet, planning to launch two prototype satellites in early 2027. This experiment will test the operation of Google's models and TPU hardware in space and validate the feasibility of performing distributed ML tasks using optical communication inter-satellite links.

Ultimately, as research continues, a gigawatt-scale satellite constellation may become possible; this could in turn give rise to new computing architectures that are more naturally suited to the space environment.

Just as the development of complex system-on-chip (SoC) technology was driven by and in turn enabled modern smartphones, scaling and integration will also unlock infinite possibilities in space.

Reference Link

https://research.google/blog/exploring-a-space-based-scalable-ai-infrastructure-system-design/

This article is from the WeChat official account "Machine Intelligence". Editor: Panda. Republished with permission from 36Kr.