HomeArticle

Stanford: A Battle Among the "Gods" of Optimizers? AdamW Wins with "Stability"

机器之心2025-09-08 07:30
AdamW remains a robust choice for pre-training, but matrix-based methods demonstrate significant advantages under specific data-to-model ratios.

Since its proposal in 2014, Adam and its improved version AdamW have long dominated the pre - training of open - weight language models, helping models maintain stability and achieve fast convergence on massive data.

As the scale of models expands rapidly, pre - training has become a typical representative of computationally intensive tasks and is often the most significant computational cost in large - model research and development. Against this backdrop, the design of optimizers is directly related to the convergence speed and computational cost.

Researchers have explored various improvement directions. Among them, the fastest optimizers often use matrix - type pre - conditioners (such as Muon, Soap, Kron), which can bring about a 30–40% acceleration at the iteration level compared to the well - tuned AdamW.

A study by the Percy Liang team at Stanford University points out that despite many alternatives claiming to provide significant acceleration (1.4 to 2 times), AdamW remains a robust first choice for pre - training, but matrix - type methods show obvious advantages under specific data - to - model ratios.

  • Paper Title: Fantastic Pretraining Optimizers and Where to Find Them
  • Paper Address: https://www.arxiv.org/pdf/2509.02046v1
  • Github: https://github.com/marin-community/marin/issues/1290
  • Blog: https://wandb.ai/marin-community/marin/reports/Fantastic-Optimizers-and-Where-to-Find-Them--VmlldzoxMjgzMzQ2NQ

Researchers believe that this phenomenon may stem from two key methodological flaws:

  • Problem 1: Unfair hyperparameter tuning.

The baseline model is usually under - tuned: In the commonly used AdamW baseline, simply tuning the learning rate parameter can achieve a 2 - fold acceleration on a model with 130 million parameters.

Fixing shared hyperparameters does not guarantee fairness in comparison: For example, compared with the standard weight decay value of 0.1, the Lion optimizer prefers a higher weight decay value (such as 0.6).

Left: The commonly used AdamW baseline has the problem of under - tuning. In the GPT - 3 training scheme proposed by Brown et al. [2020] and adopted in subsequent studies, simply adjusting the learning rate hyperparameter for a 100 - million - parameter model can achieve up to a 2 - fold acceleration, highlighting the importance of proper hyperparameter optimization. Right: Fixing hyperparameters across different optimizers does not guarantee fairness in comparison. In previous studies, shared hyperparameters such as the learning rate and weight decay are usually set as constants. However, even for conceptually similar optimizers, the corresponding optimal hyperparameters can vary significantly.

  • Problem 2: Insufficient test scale

Most tests only use small - scale models (with far less than 1 billion parameters) or follow the 1 - fold data ratio proposed in the Chinchilla paper. So, what will the results be like for larger - scale models or higher data ratios?

In addition, checkpoints in the early stage of training can also be misleading. During the learning rate decay phase, the loss curves of different methods may cross, leading to a reversal of the final rankings. Therefore, a final evaluation at the end of training must be conducted under (different) settings.

Left: The acceleration effect decays as the model scale increases. Although some optimizers can show a relatively high acceleration ratio (1.3 - 1.4 times) compared to AdamW on models with less than 1 billion parameters, when the model scale increases to 1.2 billion parameters, the acceleration ratio decays to only 1.1 times. Right: Matrix - based optimizers generally perform better than scalar - based optimizers. This figure shows the loss curves of three scalar - based optimizers (AdamW, Nesterov AdamW, Mars) and three matrix - based optimizers (Kron, Soap, Muon) during training under different Chinchilla data ratios. Matrix - based optimizers achieve a consistent acceleration effect compared to scalar - based optimizers. In addition, in the case of over - training, these three matrix - based optimizers will eventually converge to similar loss values.

To verify this hypothesis, researchers conducted a systematic comparative study covering eleven different deep - learning optimizers. They conducted rigorous and independent hyperparameter tuning for each optimizer under multiple model scales (from 100 million to 1.2 billion parameters) and data - to - model ratios (from 1 - fold to 8 - fold of the optimal Chinchilla ratio).

The optimizers used in this study.

The research findings are as follows:

  • Independent tuning is crucial: The optimal hyperparameter configuration of one optimizer often cannot be directly transferred to another optimizer. Without independent tuning, not only is the comparison result unfair, but the actual acceleration effect of new optimizers compared to the well - tuned AdamW is far lower than the claimed value.
  • Short - term evaluation is misleading: Evaluating the performance of optimizers only within a short - term training window is unreliable. As training progresses and the learning rate decays, the performance rankings of different optimizers may reverse, and their loss curves may even cross multiple times.
  • Matrix methods lead in performance: All the fastest optimizers use matrix - based pre - conditioners instead of traditional element - wise scalar scaling. Methods such as Muon, Soap, and Kron can achieve a 30–40% single - step training speed improvement compared to the strictly tuned AdamW.

Interestingly, the optimal choice is also related to the specific scenario: Under the standard Chinchilla data ratio, Muon performs best; when the ratio of data volume to model scale increases to more than 8 times, Soap becomes a better choice.

Method

The research designed a rigorous methodology to evaluate these optimizers, which is divided into three main stages. First is the general setting stage, which clarifies the experimental environment. The research used four Transformer models of different scales, with the number of parameters ranging from 130M to 1.2B and a sequence length of 4096. It also detailed the specific configurations such as the number of model layers and hidden dimensions.

The detailed architectural hyperparameters of each model scale studied.

In terms of data, the research mixed the DCLM - baseline, StarCoder V2, and ProofPile 2 datasets and used the LLaMA - 3 tokenizer for tokenization, ensuring the richness of the training data. The evaluated optimizers include AdamW, NAdamW, Mars, Cautious, Lion, Adam - mini, Muon, Scion, Kron (PSGD), Soap, and Sophia, representing the mainstream and cutting - edge methods in the current deep - learning optimization field.

Stage I: Comprehensive parameter scanning

The research aims to address the problem that the performance of the baseline optimizer is underestimated due to improper hyperparameter adjustment. The research used the coordinate descent method to conduct an exhaustive search on the hyperparameters (including learning rate, weight decay, warm - up steps, β₁, β₂, ε, maximum gradient norm, and batch size) of all optimizers on a preset grid.

The experimental settings in this stage cover the training of 130M, 300M, and 500M models with 1 - fold Chinchilla data volume, as well as the training of the 130M model with 2 - fold, 4 - fold, and 8 - fold Chinchilla data volume.

The research found that strict hyperparameter adjustment for each optimizer is crucial because the optimal hyperparameter configurations vary significantly among different optimizers, and blindly transferring hyperparameters will lead to an unfair comparison.

In addition, the research also observed that compared with the well - adjusted baseline AdamW, the actual acceleration effect is generally lower than the values claimed in some previous studies.

Stage II: Sensitive hyperparameter identification

Based on the results of the first stage, the research identified sensitive hyperparameters whose optimal values change with the model scale, such as the learning rate and warm - up length. Subsequently, these sensitive hyperparameters were further searched on a grid for 300M and 500M models and 2 - fold, 4 - fold, and 8 - fold Chinchilla data volumes.

The main results of the first and second stages. Top: We plotted the validation set losses of the models on the C4/EN dataset in the first - and second - stage experiments. Each point in the figure corresponds to the optimal loss value that each optimizer can achieve under the corresponding Chinchilla data ratio. Bottom: We plotted the performance of some optimizers on the HellaSwag benchmark. These optimizers include the AdamW baseline, the top 2 scalar - based optimizers in terms of performance, and the top 3 matrix - based optimizers in terms of performance. The performance data comes from their respective optimal running batches.

By combining the results of the first two stages, the research obtained nearly optimal hyperparameter sets and their corresponding losses under 12 different settings. To quantify the acceleration effect of different optimizers relative to AdamW, the research fitted the scaling law of AdamW loss with the change of data budget and calculated the ratio of the AdamW data volume required to achieve the same loss to the actual data volume required by the optimizer as the acceleration ratio.

The research found that although matrix - based optimizers generally perform better than scalar - based optimizers, their acceleration ratios do not exceed 1.4 times in actual tests. Many alternative optimizers seem to have advantages for small - scale models or limited data ratios, but as the model scale expands, these acceleration advantages gradually disappear or even reverse, and AdamW remains the most robust choice for pre - training.

Stage III: Case study

This stage aims to conduct an in - depth exploration of larger - scale experiments. The research first examined the fitting degree of hyperparameters by fitting a smooth law of the form

to predict the optimal settings under the model scale N and data scale D.

To verify these scaling laws, the research conducted a comprehensive scan of the 1.2B model with 1 - fold Chinchilla data volume. The results showed that the performance difference between the predicted configuration and the actual optimal configuration was minimal, proving the effectiveness of the prediction.

Subsequently, the research conducted two case studies: one was to train the 1.2B model with 1 - to 8 - fold Chinchilla data volume to examine the change of the optimizer acceleration effect with the expansion of the model scale; the other was to train the 130M and 300M models with 16 - fold Chinchilla data volume to observe the performance of optimizers under an extreme data - to - model ratio.

Case analysis. Left: The scaling of the validation set losses of four optimizers (AdamW, NAdamW, Muon, and Soap) on a 1.2 - billion - parameter model. The results show that Muon and Soap still have a significant acceleration effect compared to AdamW, but have no obvious acceleration advantage compared to NAdamW. Middle: The acceleration ratio was estimated using the same method as in Figure 3. We observed that the acceleration ratios of Muon and Soap decay as the model scale increases and eventually drop to only 1.1 times. Right: In the setting of a 300 - million - parameter model and a 16 - fold Chinchilla data ratio, the experimental results show that when the ratio of data to model further increases, the performance of Soap is better than that of Muon.

The results of this stage further reveal the potential limitations of the Muon optimizer: Although Muon still has an acceleration effect on a model with up to 1.2B parameters, the acceleration ratio drops below 1.2 times. Under a high data - to - model ratio (such as 16 - fold Chinchilla), NAdamW and Soap outperform Muon on the 130M model, and Soap also outperforms Muon on the 300M model. The research speculates that at a high data - to - model ratio, the second - order momentum maintained by Soap and Kron becomes more effective.

This article is from the WeChat official account