StartseiteArtikel

Google DeepMinds neue Methode: "Gold aus der Mistgrube schürfen". Selbst giftige Daten aus dem Darknet können für das Training eines freundlichen Modells genutzt werden.

新智元2025-09-17 11:07
Die Forschungsergebnisse des Google DeepMind-Forschungsteams aus vor einem Jahr wurden erst letzte Nacht enthüllt. Dabei wurde eine neue Methode namens GDR vorgeschlagen, die den traditionellen Ansatz, verunreinigte Daten bei der Trainingsphase zu eliminieren, auf den Kopf stellt. Die Methode verwandelt Daten, die böswillige Inhalte enthalten, in nutzbare Ressourcen. Die verarbeiteten Datensätze eignen sich sogar besser für das Training als Modelle, die durch bloßes Entfernen verunreinigter Daten trainiert wurden, sozusagen "reif aus der Schmutzdecke hervorgegangen" und "das Gute daraus ausgewählt".

Data is the fuel for AI. Just like humans can't go a meal without feeling hungry, only with sufficient data supply can models perform at their best.

The powerful models we use today are trained with a vast amount of data from the internet.

Due to hardware and cost limitations, researchers have gradually realized that simply amassing more data is no longer sustainable. The key to future performance lies in how well we can utilize the data.

However, there are three tricky problems that have been difficult to solve:

First, the available data on the public network is gradually drying up and is expected to be exhausted within a decade.

Second, although a large amount of user-generated content exists, it contains private information, offensive language, or copyrighted content and cannot be used directly.

Third, while synthetic data generation is a potential solution, it often suffers from issues such as lack of diversity and a large gap from real data.

To address these problems, Google DeepMind's research team published a research paper yesterday: "Generative Data Refinement: Just Ask for Better Data".

Paper link: https://arxiv.org/pdf/2509.08653

The first author of this paper is Minqi Jiang, a Chinese researcher who recently jumped from DeepMind to the controversial Meta Superintelligence Labs.

Back to the paper. This paper proposes a new method: Generative Data Refinement (GDR).

Its core idea is to not directly generate brand - new data, but use large models to "purify" and rewrite the original data while retaining useful information and removing private or harmful parts.

In other words, GDR is like a "data cleaner" that can clean up dirty data while maintaining its original knowledge value.

The basic idea of GDR

Traditional synthetic data generation relies on repeated sampling by large models, which tends to produce homogeneous outputs and lacks diversity.

However, GDR takes an approach that subverts the traditional thinking:

It uses real - world data (such as code, conversations, and web content) as input, a large model as a generator for processing, and rewrites the data according to preset rules (such as removing privacy and reducing toxicity). The final output is a refined dataset that is both safe and retains the original diversity.

The paper details the specific workflow of GDR:

Step 1: Input data

This includes original text, code, conversations, or web data.

The data may contain PII, toxic language, or other content that cannot be used for training.

Step 2: Prompt construction

Design a prompt for the large model to tell it what to do:

For anonymization tasks, the prompt requires "identifying and replacing sensitive information with safe placeholders".

For detoxification tasks, the prompt requires "deleting offensive expressions while retaining factual content".

The prompt can be zero - shot, or examples can be added, and the model's ability can even be enhanced through fine - tuning.

Step 3: Generation and rewriting

The model generates a new version for each input sample based on the prompt. The goal of the output is to be safe, reasonable, and retain context information.

Step 4: Verification and screening

Verify the generated results (such as running another PII detection or evaluating with a toxicity classifier) and filter out unqualified results to ensure the safety of the dataset.

The final step: Obtain a refined dataset D′, which can be used repeatedly as training data.

The data diversity is still maintained and is even better than directly synthesized data.

This method has three major advantages:

Inherit the diversity of real data because each synthetic data is "anchored" to a real sample.

Avoid mode collapse, unlike pure synthetic data, which tends to converge to a few stereotyped expressions.

Adapt to different tasks. Simply change the prompt or fine - tune the model to target different scenarios such as anonymization and detoxification.

Of course, GDR comes at the cost of additional computation. In the worst - case scenario, it is equivalent to training the model 1/3 more times.

However, once clean data is obtained, it can be used repeatedly, which is very cost - effective in the long run.

To verify the effectiveness of GDR, the paper conducted experiments from three different perspectives.

Experiment 1: Code anonymization

Code repositories often hide sensitive information such as email addresses, passwords, API tokens, and private URLs.

If this information enters the training data, it not only poses a risk of leakage but may also cause the model to "recite" private information in its output.

The traditional approach is the DIRS service, which discards the entire file as soon as it detects possible PII. However, this "better safe than sorry" approach may waste millions of lines of valuable code.

The researchers compared GDR with DIRS on 1.2 million lines of code from 479 open - source libraries:

Line - level annotation results show that GDR can more accurately find PII and replace it with placeholders.

DIRS has a high false - positive rate, and a large amount of harmless data is mistakenly deleted.

Although GDR has a small number of false alarms (such as replacing safe variable names), most of these can be detected and fixed through static analysis.

The experimental results show that GDR is far superior to traditional methods such as the DIRS service in maintaining data usability and is a viable solution for large - scale code anonymization.

Experiment 2: Conversation detoxification

Harmful content such as hate speech, gender discrimination, and vulgarity is ubiquitous on the internet.

Training directly on such data may cause the model to learn wrong values and even output dangerous content.

The research team selected the notorious 4chan /pol/ discussion area (a malicious - content - filled internet community somewhat similar to Sun Xiaochuan's bar in China) dataset, extracted 100,000 conversation pairs (pol100k), and then used Gemini Pro 1.5 with zero - shot prompting for GDR detoxification.

Perspective API toxicity score: The score of pol100k is 0.19, which drops to 0.13 after GDR refinement, even lower than SyntheticChat (0.14) generated by the same model.

UMAP visualization shows that the distribution of the refined data is still close to that of real data, while pure synthetic data shows obvious mode collapse.

After the researchers fine - tuned the model on the detoxified data, they found that it can still retain world knowledge and its generation style is closer to that of humans. The detection system even has a 31% chance of not being able to distinguish it from human conversations.

The experimental results show that GDR can clean up harmful data while retaining the knowledge it contains, "rising above the mire without getting tainted" and "choosing the good and following it".

Experiment 3: Diversity comparison

The researchers used ROUGE - 2 and embedding cosine distance metrics to compare pol100k, the refined version, and SyntheticChat.

The data refined by GDR has higher diversity than SyntheticChat and slightly exceeds the original data.

The experimental results show that GDR not only plays a role in safety filtering but also enhances data diversity, achieving multiple goals at once.

GDR: The "alchemy" to turn waste into treasure

GDR is like a "water purifier" in the data world, filtering out impurities while keeping the nutrients intact.

It turns dirty data into "usable fuel", providing a continuous supply of clean energy for the development of large models.

It is the "golden touch" that can turn waste into treasure in the AI era.

The Midas touch

Facing the dual challenges of data depletion and privacy risks, GDR provides a way out.

The continuous evolution of future large models depends on the ingenuity and hard work of humans.

References:

https://arxiv.org/abs/2509.08653 

https://x.com/MinqiJiang/status/1967685550422598067 

https://www.linkedin.com/in/minqi-jiang-585a6536 

This article is from the WeChat official account "New Intelligence Yuan". Author: New Intelligence Yuan. Republished by 36Kr with permission.