HomeArticle

HuggingFace released a practical guide over 200 pages long, providing step-by-step guidance on training large models from decision-making to implementation.

机器之心2025-11-10 07:57
A practical journey through the challenges, decisions, and messy realities of training state-of-the-art language models.

Recently, HuggingFace published an over-200-page long technical blog, systematically sharing end-to-end experiences in training advanced LLMs.

The focus of the blog is on the "messy reality" in the LLM development process. It candidly records what methods work, what fail, and how to deal with the pitfalls encountered in actual engineering. The content is based on the team's actual project experience, especially their recent process of training the 3B parameter model SmolLM3 using 384 H100 GPUs.

The blog provides in-depth technical details, code snippets, and debugging techniques, which are very instructive for readers interested in building LLMs themselves.

Below is an overview of the blog content. It is highly recommended that interested readers read the original article.

  • Blog address: https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook#positional-encodings--long-context

Training Compass: Why→What→How

This part poses a key question before delving into the technical details (how to train): "Do you really need to train this model?"

Given the emergence of world-class open-source models (such as Qwen, Gemma, Llama, etc.), most people may not need to train their own models from scratch.

Why

The article lists some wrong reasons for not training a model, such as: "We have idle computing power", "Everyone else is doing it", or "AI is the future".

Then it provides a flowchart to help you think about whether you really need to train your own model.

When you find that: the existing model is unavailable -> prompt engineering cannot solve the problem -> fine-tuning cannot solve the problem, you can consider training from scratch.

Customized pre-training is usually applicable to three main areas:

  • Research: You have a clear scientific question to answer. For example, testing a new optimizer, exploring model capabilities (e.g., using only reinforcement learning), or testing a new dataset (e.g., pure synthetic data).
  • Production: Your business has specific needs that cannot be met. Such as highly specialized vocabulary or logic in DNA, law, finance, etc.; need to run on specific hardware (e.g., drones, local FPGAs), or have strict latency requirements; be in a regulated industry, requiring 100% control and traceability of training data and model behavior.
  • Strategic Open Source: You find and have the ability to fill a specific gap in the current open-source ecosystem.

What

Once you have clarified the "Why", you can deduce the "What to train". This includes the model type (dense, MoE, hybrid, a new type), model size, architectural details, and data mix.

At the same time, the previous domain goals determine your training decisions: for example, for device-side operation -> train a small and efficient model; need multilingual capabilities -> use a larger tokenizer vocabulary; ultra-long context -> use a hybrid architecture.

This decision-making process is divided into two stages. Planning: Map your constraints (from "Why") to specific model specifications; Verification: Test your choices through systematic experiments (ablation experiments).

The article points out two key characteristics of successful LLM training teams:

  • Iteration Speed: Training an LLM is a "learn while training" process. Teams that can iterate and train new models quickly and frequently (e.g., quarterly instead of annually) will progress faster.
  • Data Management: The best teams are those that are "obsessed with high-quality data". The impact of data quality far exceeds the choice of architecture.

The article also suggests that the pre-training team does not need many people at the beginning (2 - 3 people are enough). The key is to be equipped with sufficient computing power and maintain rapid iteration.

Every Large Model Starts with a Small Ablation

Before starting to train an LLM, a series of key decisions need to be made (architecture, optimizer, data combination, etc.). People often think that these decisions are made through careful consideration, but reasoning alone is not enough because the behavior of LLMs is often counterintuitive.

A typical example is that using the seemingly "highest quality" arXiv scientific paper data may actually damage the performance of the model (especially small models) because it is too specialized and lacks the diversity of general text.

Since pure thinking doesn't work, the answer is to "run a large number of experiments" like an empiricist (i.e., ablation experiments).

The complete process of setting up an ablation experiment:

Choose Your Baseline

Don't start from scratch. Instead, choose a verified and mature architecture (such as Llama 3.1, Qwen3, Gemma3) as a starting point, so that you can inherit all known optimization and stability experiences.

Although the baseline is good, it is not tailored to you, so it needs to be modified. However, "any change in the architecture comes with risks". Therefore, you must abide by the "risk mitigation" discipline, that is: "Don't change anything unless you have tested that it really helps."

The difficulty of modification lies in the large number of components and their interactions. You can't test all combinations. The correct method is to test only one potential change at a time. If it works, integrate it to make it the new baseline, and then test the next change.

Choose a Training Framework

This is a key technical decision that requires a trade-off between functionality, stability, and throughput.

The article compares several mainstream frameworks:

  • Megatron-LM / DeepSpeed: Powerful and battle-tested, but with a large and complex codebase.
  • TorchTitan: More lightweight, easy to get started and experiment with, but relatively new.
  • nanotron (self-developed by the author): Provides complete flexibility, but requires a large amount of investment in development and testing.

Design the Ablation Experiment

The experiment must be fast enough (for rapid iteration) and reliable enough (the results can be extrapolated to the final model). There are two main methods:

  • Full-size model with a small amount of data: Use the size of the final model (e.g., the 3B model for SmolLM3), but train on fewer Tokens (e.g., 100B instead of 11T).
  • Small proxy model: If the target model is too large (e.g., 1T parameters), use a scaled-down proxy model (e.g., a 3B model) for experiments.

Next, the article introduces its baseline ablation settings (a 1B Llama model trained on 45B Tokens) and shows the key parts of the configuration file (data, model, optimizer, etc.).

Understand What Works: Evaluation

The article points out that when evaluating experimental results, it is unreliable to only look at the training loss (Loss). For example, a lower Loss when training on Wikipedia does not mean the model has stronger capabilities; changing the tokenizer also makes the Loss values not directly comparable. Therefore, more fine-grained downstream evaluations must be used.

A reliable evaluation task should meet four criteria: monotonicity, low noise, performance above random, and ranking consistency.

Especially in the early experiments, the "Cloze (CF)" format is superior to the "Multiple Choice (MCF)" format because the latter (e.g., MMLU) performs close to randomly in the early stages of model training and cannot provide effective early signals.

The real value of ablation experiments lies not only in building good models but also in providing confidence for future debugging: when the main training inevitably goes wrong, the systematic experimental results can help the team quickly locate problems.

However, the cost of this value is extremely high. Taking SmolLM3 as an example, the GPU time consumed in ablation and debugging exceeds half of the main training run.

Model Architecture Design

This part details the complete decision-making process of designing and determining the LLM architecture, from high-level goals to specific component selection and hyperparameter settings.

The article takes a 3B (3 billion parameter) model called SmolLM3 as an example, systematically showing how to build a "blueprint" for a model from scratch.

The article delves into the core architectural choices that make up modern Transformers and points out that today's models (such as Qwen3, Gemma3) share the Transformer foundation but solve specific problems (such as memory, stability) through component improvements (such as GQA, positional encoding).

  • Attention Mechanism: This is the main bottleneck during inference, and the key lies in the KV cache. The article compares MHA (standard, high memory), MQA (extreme compression, may lose performance), and GQA (Grouped Query Attention). Ablation experiments confirm that GQA is comparable to MHA in performance but greatly saves the KV cache, making it the final choice for SmolLM3.
  • Long Context: The article explores two strategies. First is document masking, which prevents the model from paying attention to other irrelevant documents in the sequence when training "packed" data. This has been proven crucial for long-context extension. Second is positional encoding. The standard RoPE has limited extrapolation ability on long sequences. SmolLM3 adopts a hybrid strategy of NoPE (actually RNoPE), that is, alternating between RoPE layers (for short contexts) and NoPE layers (for long-distance retrieval). Ablation experiments show that this method lays the foundation for long contexts without sacrificing short-context performance.
  • Embedding Sharing: For a small model like SmolLM3, the embedding layer accounts for a relatively large proportion. The article proves through ablation experiments that using parameters to increase the model depth (more layers) is more effective than "unbinding" the input and output embedding layers. Therefore, SmolLM3 adopts embedding sharing.
  • Stability: To prevent large-scale training from crashing, the article tests techniques such as Z-loss and QK-norm. Finally, SmolLM3 adopts the technique of OLMo2, that is, removing the weight decay of the embedding layer to improve stability.

The article compares three architectures: dense, MoE (Mixture of Experts), and Hybrid (hybrid model). MoE uses less computation to achieve greater capacity through sparse activation (only activating some "experts"), but has extremely high memory usage. Hybrid (such as Mamba) solves the computational bottleneck of Transformers in long contexts through linear attention or SSM. SmolLM3 adheres to the dense architecture due to its "edge deployment" goal (memory-constrained).

Subsequently, the article turns to the often underestimated Tokenizer. Choosing a tokenizer involves the vocabulary size (affecting the compression rate and the size of the embedding matrix) and the algorithm (BPE is the most commonly used).

The article introduces "Fertility" (average number of Tokens per word) and "consecutive word ratio" as evaluation metrics. By comparing Llama3, Gemma3, Qwen3, etc., SmolLM3 finally chooses the 128k vocabulary of Llama3 because it strikes the best balance between the target language and the model size.

Next, the article explores the core elements that determine the training process: optimizer, learning rate, and batch size. The article points out that directly borrowing the hyperparameters of other models is simple but may not be optimal because these values are optimized for specific architectures, data, and constraints.

Finally, it reviews the classic trade-off between model scale (number of parameters N) and data volume (number of Tokens D).

The Art of Data Management

This part details the "art of data curation", emphasizing that in LLM training, data is the key factor determining what the model "learns", and its importance even exceeds the model architecture.

The model architecture determines how the model learns, while the data determines what the model learns. If the data quality is poor or the "mixing ratio" is inappropriate, no good architecture or hyperparameters can save the situation.

The article points out that building an excellent dataset is not just about collecting good data but designing a training mix.

For example, excessively increasing the proportion of code data ("upsampling") will implicitly reduce the proportion of other data, which may damage the model's general ability.

In addition, for extremely long training like SmolLM3, which requires 11T Tokens, using only "the highest quality" data will lead to serious data duplication, which is harmful to model performance.

To solve these balance problems, modern LLM training has evolved from "static mixing" (such as GPT - 3) to multi-stage training (such as Llama3, SmolLM2). This method dynamically changes the data mixing ratio during the training process.

The core insight is that the final behavior of the model is deeply influenced by the data it sees at the end of training. Therefore, the strategy is:

  • In the early stage of training, use rich, diverse but slightly lower-quality data (such as web text).
  • At the end of training (especially in the "annealing stage" of learning rate decay), introduce scarce, high-quality data (such as professional mathematics and code datasets) to maximize its influence.