HomeArticle

Peking University collaborates with Llama-Factory to launch DataFlex: An industrial-grade dynamic data training system

机器之心2026-04-15 17:39
Build a data infrastructure to support next-generation AI applications.

As large model training enters the deep - water zone, the key to competition is no longer just "how to adjust model parameters," but is gradually shifting to a more core and difficult - to - systematically - solve problem: What data does the model actually see during the training process, in what proportion, and which samples should be learned more frequently?

These factors are increasingly directly determining training efficiency, generalization ability, and the final performance of the model.

Regarding data selection, data mixing, and sample re - weighting, the academic community has proposed many methods. However, for a long time, most of these methods have been scattered in independent code repositories: inconsistent interfaces, inconsistent training processes, high reproduction thresholds, and difficult horizontal comparisons.

More importantly, many methods rely on embeddings, model scoring, gradients, or intermediate inference signals. The real difficulty has never been "proposing a method," but rather stably and reproducibly integrating these methods into the mainstream training process and incorporating them into a unified training closed - loop.

Recently, Professor Zhang Wentao and Academician E Weinan's team from Peking University, in collaboration with institutions such as the LLaMA - Factory Team, OpenDataLab, and Shanghai AI Lab, have launched the DataFlex, a data - center dynamic training framework for the large model training process.

It is not a simple stack of a single algorithm or several scripts, but a unified training infrastructure built on LLaMA - Factory: truly integrating the three core capabilities of dynamic sample selection, dynamic data mixing, and dynamic sample weighting into the training process, upgrading "how data participates in training" from empirical configuration to a controllable, optimizable, and reproducible system capability.

In other words, what DataFlex aims to solve is not just whether a certain training technique is effective, but a more fundamental system problem: How to make data, like model parameters, a core object that can be continuously scheduled and optimized during the training process.

This enables it to serve both as a research platform for systematically comparing different data - center training algorithms and as a practical system directly serving scenarios such as large model pre - training, post - training, and domain adaptation.

After the release of DataFlex, it quickly gained wide attention in the Hugging Face Daily Papers list and ranked first on the monthly list. What this attention reflects is, in essence, the community's milestone - like recognition of the transition of "data - center dynamic training" from theory to an engineering closed - loop.

DataFlex is not just an algorithm repository, but a set of data - center training infrastructure

  • Reproducible research platform: Systematically compare data - center training methods such as dynamic data mixing, sample selection, and sample weighting under a unified training framework, covering online and offline scenarios, and significantly reducing the cost of research reproduction and method comparison;
  • Optimization system for real - world training: Truly integrate data selection, data ratio adjustment, and sample weight adjustment into the training closed - loop, transforming data from "static input" into an "optimization object that can be continuously scheduled," thereby improving training efficiency and the final model effect.

Technical report: https://arxiv.org/abs/2603.26164

Official documentation: https://opendcai.github.io/DataFlex - Doc/

Github repository: https://github.com/OpenDCAI/DataFlex

DataFlex: The last piece of the puzzle for industrialized data invocation in large models

Design philosophy: Say goodbye to static feeding and turn "data scheduling" into an out - of - the - box system capability

1. Core concept: Data - Centric Dynamic Training System

The core of DataFlex is not just repeating the old saying that "data is important," but directly addressing the most painful point in the industry: How to solidify the metaphysical experience of "what data the model sees, in what proportion, and which samples to prioritize" into a standardized system capability that is configurable, schedulable, and reproducible. It not only focuses on the gradient update of parameters but also closely monitors the real participation of data in each step of training.

1.1 From "force - feeding static input" to "active data scheduling"

In traditional large model training, data is often regarded as pre - prepared static input: the dataset is determined first, the sampling method is fixed in advance, and during the training process, it is mainly the model parameters themselves that are continuously optimized. However, when the scale of training data becomes larger and the sources become more complex, what really determines the effect is no longer just "whether there is more data," but "whether data can be used more intelligently during training."

The core idea of data - centric dynamic training is to elevate data from "passive input" to "active scheduling object." The system not only needs to decide which data the model sees but also dynamically decides how to proportion different data sources, which samples should be learned first, and which samples should have their weights reduced.

The value of DataFlex lies in advancing the capabilities that were originally scattered in different methods and codes into a unified and standardized training mechanism.

1.2 A unified framework for zero - cost migration

A good system should not be a burden on developers. In addition to dynamic scheduling, DataFlex further solves system - level problems: how to unify the originally scattered data selection, data ratio adjustment, and data re - weighting methods into the same set of training infrastructure.

On the one hand, DataFlex is built on LLaMA - Factory, trying to reuse existing model management, data processing, and training components; on the other hand, it introduces unified data - center control capabilities at the training layer, enabling different data strategies to be implemented, compared, and extended in the same training closed - loop.

Therefore, DataFlex is not a simple collection of several data algorithms, but a unified data - center dynamic training system for the large model training process.

2. Three design principles

Unity: The system unifies three representative paradigms in data - center training into the same training framework;

Compatibility: The system can be integrated into the existing large - scale model training infrastructure without introducing an additional workflow;

Scalability: Researchers can implement and compare new data - center algorithms with relatively low engineering costs.

Overall architecture

DataFlex continues the easy - to - use and clear design idea of LLaMA - Factory, but makes key upgrades to the overall architecture. Without disrupting the existing training ecosystem, it truly turns data - center training into a unified, scalable, reproducible, and implementable system capability. The entire system can be roughly divided into three layers:

  • Base Layer: This layer is inherited from LLaMA - Factory and is responsible for general training capabilities such as model management, data processing, and optimizers. While trying to maintain the original training process and usage habits, the system focuses on expanding data - center training itself, reducing the threshold for users to migrate from the existing training process to DataFlex.
  • Trainer Layer: Instead of using a single original trainer, it abstracts the training process into three data - center training modes, corresponding to data selection, data mixing, and sample weighting respectively. This layer expands the trainer from only being responsible for parameter update to being responsible for both data decision - making and parameter optimization.
  • Component Layer: Specific algorithm components, such as different selectors, mixers, and weighters, are mounted here. They encapsulate the strategy logic of different methods and expose a unified interface to the trainer.

This architecture achieves lightweight replacement rather than reconstructing everything. DataFlex does not wrap a complex orchestration system outside LLaMA - Factory but focuses on replacing the training layer and only makes minimal extensions to modules such as data loading when necessary.

For users, this is close to a "plug - and - play" enhancement: the existing models, datasets, and training parameter configurations can be retained, and only by adding DataFlex - related configurations can they switch to the data - centric dynamic training mode.

In addition, DataFlex also uniformly encapsulates the intermediate signals of the model that data - center methods generally rely on, such as embedding extraction, model inference, and gradient calculation. The real difficulty in implementing many data selection and data weighting methods is not the complexity of the ideas, but the high cost of obtaining the intermediate signals they rely on and the heavy engineering coupling. DataFlex abstracts these shared capabilities, reducing the implementation and expansion threshold and providing a basis for subsequent large - scale training.

Core functions

1. Three core trainers

Corresponding to the three typical optimization directions in current data - center training, DataFlex supports three types of core trainers:

  • Dynamic Select Trainer: Dynamically screen more valuable training samples during the training process, reducing the consumption of training budget by low - value or redundant samples, thereby improving training efficiency.
  • Dynamic Mix Trainer: In the scenario of multi - source and multi - domain training data, dynamically adjust the sampling ratio of different data sources during the training process, enabling the model to more reasonably allocate training attention according to the current learning state.
  • Dynamic Weight Trainer: Assign different training weights to different samples, enabling the model to more effectively learn more critical, difficult, or representative samples, thereby improving model performance and generalization ability.

2. Algorithm integration and scalability

DataFlex integrates representative methods such as LESS, DoReMi, ODM, and Loss Reweighting for the three types of trainers. All methods are implemented in the form of plug - and - play components under a unified interface, enabling fair comparison under controlled conditions.

Many representative dynamic training methods in scientific research either lack official repositories or have problems with difficult reproduction in their official implementations. DataFlex, through systematic reconstruction, enables these algorithms in a "disconnected" or "semi - stagnant" state to regain industrial - level productivity.

The three types of trainers have different divisions of labor, but they follow the same data - model interaction logic: First observe the current model state, then make new data decisions, and then feedback these decisions to subsequent training.

DataFlex abstracts this common interaction mode into a unified interface, allowing different algorithms to share the training process, basic capabilities, and expansion methods. The configuration file of DataFlex continues to use the YAML - based format of LLaMA - Factory to specify the model, dataset, and training hyperparameters.

The only new addition is a short dataflex configuration section, which is used to tell the framework which data - center strategy to adopt and how to schedule it.

Usage method

DataFlex is fully compatible with the configuration and usage method of LlamaFactory:

  • Configuration compatibility: Add DataFlex parameters on the basis of the LlamaFactory configuration;
  • Consistent commands: Use dataflex - cli instead of llamafactory - cli;
  • Function maintenance: Support all the original functions of LlamaFactory;
  • Seamless switching: You can fall back to the original training mode by setting train_type: static.

For environment configuration, parameter description, and the access method of custom components, you can further refer to the official documentation. In addition, we also provide two video tutorials, demonstrating the specific operation processes of dynamic data mixing and dynamic data selection respectively, which is convenient for users who are new to the system to quickly understand.

  • Official documentation: https://opendcai.github.io/DataFlex - Doc/
  • Github repository: https://github.com/OpenDCAI/DataFlex
  • Video tutorials:

- Automatic data selection and dynamic training: https://b23.tv/BV1pHrKBoE6s

- Automatic optimization of data ratio: https://b23.tv/LYYx1hG

Experimental results

To verify the effectiveness of DataFlex, the team conducted systematic experiments in three aspects: sample selection, data mixing, and system efficiency, covering 7 data selection methods, 2 data mixing methods, and 1 data re - weighting method. The overall results show that DataFlex can not only reproduce different data - center methods in a unified manner but also bring stable improvements in model effect and training efficiency.

Data selection and sample weighting: Dynamic methods are generally better than static training

Experiments on the Open - Hermes - 2.5 subset show that, whether on Mistral - 7B or Llama - 3.2 - 3B, most dynamic data - center methods are better than the static full - scale training baseline. This indicates that when the model capacity is limited, the dynamic selection strategy that can perceive the model state in real - time is crucial for reaching the performance ceiling.

3.2 Data mixing: Dynamic ratio is better than the default ratio

Under the 6B and 30B settings of SlimPajama, both the DoReMi and ODM data mixing algorithms show obvious advantages. At the 6B token scale, the dynamic data mixing method has shown obvious advantages: ODM has a higher accuracy rate in the general ability evaluation than the default static ratio, while DoReMi has achieved better results in the overall perplexity, indicating that dynamically adjusting the ratio of different data domains can indeed bring better training benefits.