Google's Jeff Dean's Revolutionary Paper: Elastic Large-Scale Distributed Pretraining Now Feasible

The uncompleted goal in a paper where Jeff Dean was the first author 14 years ago has now been achieved by himself.

Elastic AI pre - training has advanced to the next frontier! No surprise: it's from Google.

It is reported that the Decoupled DiLoCo they proposed is a revolutionary distributed training technology that can utilize heterogeneous hardware around the world for training. Moreover, even when the hardware fails, the system will not stop running!

This significant research result has attracted wide attention. One of the lead authors of the paper, Arthur Douillard, shared a tweet on X that has received over 2.6 million views!

Notably, Jeff Dean, a well - known researcher and the chief scientist of Google DeepMind and Google Research, is also one of the authors. He also posted several tweets to introduce this result.

In the tweet, he also recalled his first - author paper "Large Scale Distributed Deep Networks" 14 years ago. In this NeurIPS 2012 paper, they had already proven that large - scale training and asynchronous technology can be used to train very large neural networks and distribute training tasks to thousands of machines in a fault - tolerant manner.

Now, Decoupled DiLoCo is expected to turn this concept into a practical large - scale engineering practice.

Background: The larger the scale, the more frequent the failures

To understand the significance of this work, we first need to understand a fundamental dilemma in modern AI training.

Today, when training large language models, a parallel method called SPMD (Single Program Multiple Data) is commonly used. Simply put, it's like all workers in a factory must operate a production line synchronously - each person does their own step, but everyone must complete at the same time to move on to the next step. If any workstation has a problem, the entire production line has to stop and wait.

This is not a problem on a small scale. But when the cluster scale expands to hundreds of thousands or even millions of chips, probability starts to play a role.

The paper has a direct calculation: assuming that each chip fails only once a year on average, which sounds quite reliable. But if there are 2.4 million chips in the cluster, the average time between failures of the entire cluster is shortened to less than a minute. At this scale, hardware failures can no longer be regarded as accidents, but are part of the daily training process.

The existing solution is the so - called "elastic training": after detecting that a machine has crashed, the cluster configuration is re - adjusted, and the remaining healthy machines continue to run. However, this re - configuration process itself takes a lot of time, resulting in the entire cluster being unable to perform effective calculations during the waiting period.

The simulation data in the paper shows that at the scale of 2.4 million chips, even with the elastic mechanism, the actual effective computing time (i.e., "Goodput", effective throughput) is only 40% - that is, 60% of the time, the cluster is in a state of waiting or re - configuration, wasting computing power in vain.

Breaking the shackles of "being in sync"

The core idea of Decoupled DiLoCo is to completely abandon the premise of keeping all machines in sync.

This framework splits the entire training cluster into several independent "Learners". Each learner independently trains with the data it is assigned, without waiting for other learners. When a learner fails, the other learners are completely unaware and continue their own training rhythm. This is like splitting a large joint examination room into several independent examination rooms. If there is a fire in one examination room and the students are evacuated, it does not affect the students in other examination rooms to continue answering questions.

So how do the learners collaborate to train the same model in the end?

Here, a lightweight "Syncer" is introduced. The syncer runs on relatively stable CPU resources and is responsible for periodically collecting the parameter updates of each learner, merging them, and then pushing the merged results back.

The key is that the syncer does not need to wait for all learners to be ready before starting the merge. As long as a sufficient number (referred to as the "Minimum Quorum" in the paper) of learners report their progress, the syncer can start working. The failed learner is skipped directly and will catch up after it recovers.

In addition, since the computing speeds of different learners may vary (especially when using a mixture of new and old generations of chips), a fast - running learner will process more data than a slow one within the synchronization interval. To avoid the fast learner having "one vote counting for multiple votes" during the merge, the syncer introduces a dynamic weight mechanism based on the number of processed tokens, making the merge result more fairly reflect the actual contribution of each learner.

There is also a detail called "Adaptive Grace Window": after reaching the minimum quorum, the syncer will not merge immediately but will wait a little longer to allow more learners to catch up with this round of synchronization, thereby improving the quality of each merge. This waiting time is carefully controlled within a range that does not affect the overall training speed.

Another technical detail is "Balanced Tensor Fragmentation". Instead of transmitting the model parameters as a whole, they are cut into several fragments of similar sizes, and only one fragment is transmitted at each step, evenly distributing the communication pressure and avoiding "pulsed" transmission with fluctuating bandwidth usage.

Experimental results: Performance hardly drops even with a very high failure rate

The paper verifies the actual effect of this solution through a large number of experiments.

In terms of Goodput (effective throughput), when simulating a scenario with 2.4 million chips and each chip failing once a year on average (at this time, the failure interval of the entire cluster is less than a minute), when Decoupled DiLoCo uses 8 learners, the Goodput is maintained at 88%. Under the same conditions, the Goodput of the traditional elastic data parallel scheme is only 58%.

In terms of model quality, the paper compares the training results of a 5B - parameter dense model on 1 trillion tokens. Whether it is text benchmarks (ARC, BoolQ, HellaSwag, etc.) or visual benchmarks (DocVQA, TextVQA, etc.), the downstream evaluation results of Decoupled DiLoCo are almost the same as those of traditional data parallel training. That is to say, the fault - tolerance ability is greatly improved without sacrificing the model quality.

The paper also verifies the performance of this solution in the scenario of mixing old - style chips (TPUv5e and TPUv5p). Even if the slowest learner is nearly 20% slower than the fastest one, through the combination of the minimum quorum and the adaptive grace window, the system still achieves a model quality comparable to that of fully synchronous training, while maintaining a 100% computing utilization rate.

In terms of bandwidth consumption, the figures are particularly astonishing. To achieve a 90% computing utilization rate, the traditional data parallel scheme requires a bandwidth of about 104 Gbits/s in a scenario with a 1 - second computing step and 2 data centers; Decoupled DiLoCo only needs 1.7 Gbits/s, and after int4 compression, it is further reduced to 0.43 Gbits/s. The bandwidth requirement is reduced by about two orders of magnitude.

Greater imagination space: "Scavenging" computing power

The low bandwidth requirement brings an unexpected additional value: it is possible to "scavenge" temporarily available computing resources at any time.

In traditional data parallel training, to add a new machine, the current complete model parameters need to be transferred first. This process may take a lot of time for the entire cluster, and the training efficiency will drop significantly at the moment of adding a new machine.

Decoupled DiLoCo is different. When a new learner joins, it can asynchronously pull a copy of the current model state from a neighboring learner, and during this period, other learners are completely unaffected and continue normal training.

The paper conducts an experiment: during the training process, additional temporary learners are dynamically added (simulating the scenario of increased available computing power during the day). The results show that the more temporary computing power is added, the shorter the training completion time, and the model quality is not affected. In the data parallel benchmark under the same settings, the additional computing power needs to be more than doubled before the benefits start to be reflected.

This means that scattered computing power distributed in different regions, different time zones, and different generations of hardware can also be incorporated into the same training task, even if the network bandwidth between them is only a fraction of that inside an ordinary data center.

An old idea finally meets the engineering conditions

Jeff Dean said when recalling the 2012 paper that they had already thought at that time: if a certain degree of inconsistency can be tolerated, can training be made more elastic? However, limited by the scale and engineering conditions at that time, this idea could not be fully implemented.

Fourteen years later, when the model scale expands to billions of parameters and the training cluster often has hundreds of thousands or even millions of chips, this problem is no longer just a research problem but an "essential to solve" engineering problem.

The answer given by Decoupled DiLoCo is: abandon global strong consistency, exchange asynchronous and decentralized methods for availability, and at the same time, through careful algorithm design, reduce the loss of model quality to almost negligible.

The end of the paper writes: as pre - training expands to cross - regional clusters, the environment with limited bandwidth and hardware reliability will become more and more common. The training paradigm of "availability first" will change from "having an advantage" to "being necessary".

It seems that this paper is redefining the infrastructure for the next - generation ultra - large - scale model training.

This article is from the WeChat official account "MachineHeart" (ID: almosthuman2014). The author is someone who focuses on AI. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Google's Jeff Dean's Groundbreaking Paper: Elastic Large-Scale Distributed Pretraining Is Finally Feasible

Background: The larger the scale, the more frequent the failures

Experimental results: Performance hardly drops even with a very high failure rate

Greater imagination space: "Scavenging" computing power

An old idea finally meets the engineering conditions