HomeArticle

Synthetische Daten ≠ Generative Modelle: Ein Artikel, um das neue Paradigma synthetischer Daten zu verstehen

新智元2026-04-16 14:43
Eine neue Studie stellt ein Klassifizierungsframework für synthetische Daten vor, das die Grenzen von Generativmodellen überwindet und Inversions-, Simulations- und Verstärkungsmethoden umfasst. Es wird in vier Szenarien eingesetzt: Datenzentrum-AI, Modellzentrum-AI, vertrauenswürdige AI und embodied AI.

With the continuous expansion of foundation models, the limitations of real data in terms of cost, privacy, quality, and controllability are gradually becoming the decisive bottleneck for the further development of AI.

Especially in high - quality scenarios such as medicine, real data is even difficult to obtain, and the paradigm of "natural data supply" fails.

Under these circumstances, synthetic data is transforming from an "addition to real data" into a "central mechanism for actively creating high - quality training and evaluation data".

Based on a systematic analysis of over 300 representative publications, researchers from Nanyang Technological University, Tsinghua University, Sichuan University, and Sun Yat - sen University have proposed a unified How/Why/Where framework, which redefines the methodological boundaries of synthetic data and prescribes a more comprehensive development path at the application level.

Link to the study: https://www.techrxiv.org/users/1016218/articles/1378802-synthetic-data-beyond-generative-models-a-comprehensive-survey-of-how-why-and-where

Study resources library: https://github.com/Egg-Hu/Awesome-Synthetic-Data-Generation

First: How should synthetic data methods be classified?

Many works assume that "synthetic data = generative models". This review redefines the methodological boundaries of "data synthesis" and breaks with the one - sided view that "synthetic data = generative models". That is, synthetic data is not equivalent to "generating data with generative models". Methods such as inversion, simulation, and enhancement should also be included in the scope of synthetic data.

The following table shows the entire classification framework:

Second: In which core scenarios are synthetic data applied?

In contrast to previous approaches that classify by specific tasks or fields, this article organizes the applications of synthetic data at a higher level as a step - by - step progressive capability path.

In this framework, the most fundamental level is Data - centric AI. Its core goal is to solve the problems of real data scarcity, high acquisition costs, and limited privacy. Through synthetic data, the training dataset is expanded and the data quality is improved to provide a stable data basis for model training.

In addition, with increasing data accessibility, the focus of research gradually shifts to Model - centric AI. At this stage, synthetic data is not only used to supplement data but also for capability injection, such as improving the inference, encoding, and alignment capabilities of the model and creating controllable evaluation standards.

Furthermore, with the improvement of model capabilities and the increasing requirement for system reliability, Trustworthy AI has developed. In this phase, synthetic data is widely used for data protection, security protection, improving fairness, and analyzing model interpretability.

Finally, the application of synthetic data moves from the digital space to the real world, which corresponds to Embodied AI. Its goal is to support perception, interaction, and generalization capabilities so that agents can make decisions and take actions in complex physical environments. The following table shows the overall structure (the detailed details can be found in the original study):

Furthermore, the article divides the above four application areas into over 30 specific machine - learning tasks to build a systematic mapping from macroscopic classification to specific problems.

As shown in the following figure, each application area is divided into several typical problems. For example, Data - centric AI includes tasks such as zero/few - shot learning, federated learning, data - free learning, and data distillation. In Model - centric AI, the tasks are further divided into improving general model capabilities and enhancing specific capabilities such as inference, encoding, and command alignment, including also model evaluation based on synthetic data. In Trustworthy AI, the focus is mainly on tasks such as data protection, model attacks, security protection, long - tail learning, and interpretability. In Embodied AI, the tasks are further extended to perception, interaction, and generalization across different scenarios in the real world.

Finally: What challenges and opportunities does synthetic data present?

Although considerable progress has been made in method development and application implementation, synthetic data is still in a stage of rapid development, and there are still a number of key challenges to be solved.

  • With the increasing dependence of models on self - generated data for training, a core risk emerges: model collapse. If a model is repeatedly trained on its own generated data, this can lead to a gradual contraction of the distribution and a loss of data diversity, which affects the model performance and generalization ability.
  • In practical applications, the question of how to balance data use and data protection remains a long - term problem, known as the utility–privacy tradeoff. Excessive data protection restrictions can reduce data availability, while too high data fidelity can bring potential data protection risks.
  • When synthetic data is used for model evaluation, new sources of bias can be introduced. For example, generation–evaluation bias refers to the fact that a model performs better on test data generated by similar generation mechanisms, leading to inaccurate evaluation results and affecting the assessment of the model's true capabilities.
  • At the methodological level, there are still many cutting - edge directions to be explored. For example, active data synthesis emphasizes dynamically generating the most valuable data according to model requirements to improve data use efficiency. Multi - modal data synthesis, on the other hand, focuses on how to generate high - quality data with consistent semantics and inter - modal alignment, which is particularly important for the development of multi - modal models.
  • Finally, the fundamental question that has not been fully resolved is: How can the quality of synthetic data be systematically evaluated. This includes not only the utility and diversity of the data but also data protection and security. Currently, there is a lack of a unified and standardized evaluation system.

The following figure shows the overall framework of this review. The detailed details can be found in the original study.

The most interesting thing about this review is not only the summary of existing methods but above all the change in our view of synthetic data: Synthetic data is no longer just regarded as an application direction of generative models but has become a new infrastructure that connects data, models, evaluation, and interaction with the real world.

If in the past AI competitions the key point was "who has more real data", in the future it will probably be about "who can generate high - quality data more efficiently, safely, and controllably".

Source: https://www.techrxiv.org/users/1016218/articles/1378802-synthetic-data-beyond-generative-models-a-comprehensive-survey-of-how-why-and-where

This article is from the WeChat account "New Intelligence Yuan", edited by LRST, published by 36Kr with permission.