Training time is slashed by 80%. The University of Hong Kong and Kuaishou jointly created an AI alchemist: It specifically selects “nutritious” data, achieving 50% of the results with just 20% of the data.
Imagine if a master chef were asked to cook with moldy ingredients and expired seasonings. Even with the highest - level cooking skills, they couldn't create a delicious dish. The same principle applies to AI training.
1. Data is like ingredients, and quality determines the finished product
Today's AI image - generation models, such as Stable Diffusion and FLUX, need to scrape millions of images from the internet to learn. However, the quality of these images varies widely: some are blurry, some have repeated content, and some are even just advertising background images. Naturally, AI trained with these "ingredients" won't perform well.
A research project led by Ding Kaixin from the University of Hong Kong, in collaboration with Zhou Yang from South China University of Technology and the Kling Team from Kuaishou Technology, has developed an AI system called "Alchemist". It's like a picky chef, capable of precisely selecting the most valuable half from a vast amount of image data.
What's even more surprising is:
- The model trained with this half of the carefully selected data actually performs better than the one trained with all the data.
- The training speed is 5 times faster.
- Using only 20% of the selected data can achieve the same effect as 50% of the randomly selected data.
2. Teach AI to "self - evaluate"
2.1 Limitations of traditional methods
Traditional data - screening methods are like using a sieve to sift rice grains, filtering only according to a single criterion:
- Only considering image clarity.
- Only looking at text - matching degree.
- Only evaluating aesthetic scores.
The problem with these methods is: They don't know which data truly helps AI learn.
2.2 The wisdom of Alchemist
Alchemist is more like an experienced food judge, capable of considering multiple dimensions simultaneously:
- Not only looking at the appearance of the "dish".
- Also tasting the texture.
- Even considering the nutritional balance.
Core idea: Teach AI to observe its own learning process.
Alchemist trained a special scorer model. This scorer is like a senior art teacher, able to judge the value of each image for the entire learning process.
Judgment criteria:
✅ If an image enables the AI model to learn new knowledge and improve rapidly → Good data
❌ If an image makes the model learn for a long time without much progress → Useless data
This is like observing students' expressions and progress speed when doing exercises to judge whether the exercises are suitable for them.
3. The simplest option isn't always the best
3.1 The unexpected truth
The research team discovered a counter - intuitive phenomenon:
Those seemingly "simplest" images, such as product images with a pure - white background:
- Although they can make the AI converge quickly.
- They don't help much in improving the model's capabilities.
- It's like always doing the simplest addition problems. You won't make mistakes, but it won't help improve your math skills.
On the contrary, images with rich content and a bit of challenge are the real "nutrients".
3.2 Scientific verification
The research team tracked the training dynamics of images in different score ranges:
4. Technical highlight: Shifted Gaussian sampling strategy
Based on the above findings, the team proposed the "Shift - Gsample" (Shifted Gaussian sampling) strategy.
4.1 Traditional methods vs. Alchemist
Traditional Top - K method:
- Simply selecting the data with the highest scores.
- ❌ But these data are often too simple and lack nutrients.
Alchemist strategy:
- ✅ Avoiding the "simple" data with overly high scores.
- ✅ Focusing on selecting the "nutritious" data with above - average scores.
- ✅ Retaining a small number of simple and difficult samples to maintain data diversity.
This is like formulating a fitness plan:
- ❌ Not choosing overly easy exercises (no training effect).
- ❌ Not choosing overly difficult exercises (prone to injury).
4.2 Multi - granularity perception mechanism
To better evaluate data quality, Alchemist also designed a "multi - granularity perception" mechanism:
- Individual level: Evaluating the quality of a single image.
- Group level: Considering the combination of an entire batch of data.
It's like a nutritionist not only paying attention to the nutritional value of individual ingredients but also considering the nutritional balance of the whole meal.
5. Experimental results: Let the data speak
5.1 Comparison of main results
On the LAION - 30M dataset:
Key findings:
- The model trained with 50% of the selected data outperforms the one trained with 100% of the full - volume data.
- Using 20% of the selected data can achieve the same effect as 50% of the randomly selected data.
- The training speed is increased by 5 times.
5.2 Cross - model generality
Alchemist is effective on models of different scales and architectures:
5.3 Cross - dataset adaptability
Performance on different types of datasets:
HPDv3 - 2M dataset (mixed real and synthetic data):
- 20% retention rate: FID from 35.55 → 32.27 ✅
- 50% retention rate: FID from 20.21 → 18.15 ✅
Flux - reason - 6M dataset (pure synthetic inference data):
- 20% retention rate: FID from 23.66 → 22.78 ✅
- 50% retention rate: FID from 19.35 → 18.59 ✅
6. Visual analysis: Seeing is believing
6.1 Data distribution characteristics
The research team conducted a visual analysis of the screened data:
0 - 20% high - score area (simple but nutritionally insufficient):
- White or solid - color backgrounds.
- Simple product images.
- Visually clean but limited in information.
30 - 80% medium - score area (the most valuable "golden middle"):
- Rich in content.
- Clear in theme.
- Distinct in actions.
- The area that Alchemist focuses on selecting⭐
80 - 100% low - score area (too chaotic):
- Noisy images.
- Chaotic scenes with multiple objects.
- Visually dense areas.
- Unclear content.
6.2 Comparison of training dynamics
Comparison of training stability:
The data selected by Alchemist shows:
✅ Stable and continuous performance improvement.
✅ Faster convergence speed.
✅ Less training fluctuation.
The randomly selected data shows:
❌ Large fluctuations in the early stage of training.
❌ Slow performance improvement.
❌ More epochs are needed to converge.
7. Technical depth: Meta - gradient optimization framework
7.1 Two - layer optimization problem
The core of Alchemist is a two - layer optimization framework
Outer - layer optimization: Learning how to score.
- Goal: Finding the optimal scoring strategy.
- Judgment criterion: Performance on the validation set.
Inner - layer optimization: Training the proxy model.
- Goal: Training the model with weighted data.
- The weights are determined by the scorer.
7.2 Meta - gradient update mechanism
- The system updates the scores by observing the performance differences between the two models:
- Score update ∝ Validation set loss of the proxy model.
Core idea:
If a sample improves the validation performance → Increase its score.
If a sample only reduces the training loss but doesn't improve the validation performance → Decrease its score.
8. Q&A session
Q1: How does Alchemist determine which image data is more valuable?
A: Alchemist determines data value by observing the "reaction" of the AI model during the learning process:
✅ Good data: Can enable the model to learn new knowledge and improve rapidly.
❌ Bad data: Makes the model learn for a long time without progress.
This is like observing students' expressions and progress speed when doing exercises to judge whether the exercises are suitable.
Technical details:
- Monitoring changes in training loss.
- Tracking gradient dynamics.
- Comparing performance improvement on the validation set.
Q2: Why does the model trained with half of the data perform better than the one trained with all the data?
A: Because not all data is valuable. The key lies in quality rather than quantity.
Analogy:
- When teaching a child to draw, selecting 5000 high - quality works
- Is more effective than showing them 10000 messy scribbles.
Scientific principle:
1. Redundant data consumes resources but doesn't improve performance: Such as repeated simple samples and blurry noisy images.
2. Nutritious data promotes real learning: Such as samples with rich content and medium difficulty, and diverse scenarios and objects.
3. Avoiding overfitting: If only simple data is used, the model will "memorize by rote". Appropriate - difficulty data should also be used to develop generalization ability.
Q3: Can Alchemist's data - screening method be used on other AI models?
A: Yes! Research shows that this method has good generality and cross - model applicability.
<