The super hurricane of physical AI enables "fake" data to achieve a real overtaking.
In 2026, the widespread popularity of AI-generated secondary creations made "creating what you want" the norm. From static objects in photos speaking to generating New Year greeting videos by inputting a few prompts, from creating content, images to personas, AI has demonstrated amazing entertainment potential in the consumer market.
While the public's attention was still focused on these bizarre "digital toys," Jensen Huang recently proposed that physical AI will be the next wave of artificial intelligence. This means that the training data required for AI in the future needs to strictly follow physical laws and be infinitely close to the real world.
From robots folding clothes to autonomous driving, low-altitude economic aircraft, and surgical robots, the real industries with a market scale of trillions all need the acceleration of physical AI. And AI synthetic data is the last piece of the puzzle for physical AI to empower all industries. The logic of "virtual is real" is reconstructing the entire chain of AI training, manufacturing, risk control, and R & D.
This is not just an academic concept confined to the laboratory, but a super hurricane that has already erupted and is expected to trigger a new round of industrial revolution.
01
Synthetic Data: The "Infinite Fuel" for AI
To understand the trillion-dollar value of synthetic data, we first need to understand the "food crisis" faced by the AI industry. For many vertical industries today, obtaining real data is extremely difficult.
The development of autonomous driving technology in the automotive industry has been accompanied by comprehensive tests. In essence, it is the evolution of the AI system's ability to recognize the real world. In the past, car companies had to form large test fleets to collect road information day after day around the world to achieve iterations.
On this basis, what really determines the safety ceiling of autonomous driving is the "long-tail scenarios" with extremely low probability but extremely serious consequences. For example, a series of rear-end collisions of the vehicle in front, side slips caused by weather, or sudden "ghost peeks" of jaywalking pedestrians. To test the reaction ability of intelligent driving in extreme scenarios in reality, car companies need to invest incalculable costs to reproduce the rare and dangerous road conditions.
Take the emergency braking test of autonomous driving as an example. To capture real data in scenarios such as "rainy night + water reflection + oncoming high beams + a pedestrian in black crossing the road," car companies not only have to spend huge amounts of money in a closed test field but can only collect dozens of valid data sets a day, resulting in extremely high testing and depreciation costs.
At the 2025 World Intelligent Connected Vehicle Conference, Lei Jun said that Xiaomi's total investment in the first phase of combined assisted driving reached 5.79 billion yuan, and the size of its intelligent driving team exceeded 1,800 people, which has touched the ceiling of economy and efficiency.
In highly sensitive and closed industries such as healthcare, the dilemma stems from the ineffectiveness of Internet data tools.
In the early days, training high-precision cancer recognition AI required a large amount of high-quality electronic medical records and multi-modal images of patients. However, inputting patient information into large models also poses a risk of privacy leakage. The US AI healthcare company Confidant Health once leaked 5.3TB of private data such as personal information and medical records of psychological patients due to improper server configuration.
Facing the significant risks such as patient privacy leakage, hospitals have gradually tightened their control over data.
The global healthcare system generates an amazing amount of data every year. However, due to privacy red lines and institutional barriers, most of it is locked behind the data walls of hospitals, leaving top AI companies with powerful algorithms but "unable to cook without rice." Lacking core clinical and pathological data for training, AI's empowerment in the medical field has been difficult.
In the financial field, the evaluation of customers' personal information, investment data, and loan risks takes a long process. Taking the risk control AI of a single bank as an example, many transactions are "normal local transactions," making it difficult to form a macro and rapid evaluation of customers. Therefore, anti-fraud and combating black industries highly depend on cross-institutional transaction data.
However, banks are restricted by financial regulations and trade secrets and cannot share real customer information. As a result, the risk control AI model can only operate with partial data and is difficult to deal with overall financial crimes.
When vertical industries are in trouble due to various factors, the emergence of synthetic data is like a timely rain. It is not "meaningless noise" randomly generated or fake data simply spliced together, but a "statistical mirror image" generated through deep learning after analyzing the underlying distribution laws of real data.
On the one hand, synthetic data has all the statistical characteristics and business logic of real data. The effect of training models with it is highly consistent with that of real data, and it can even eliminate the noise in the original data. On the other hand, it cuts off the connection with real natural persons at the source, perfectly bypassing strict data privacy regulations and making the "forbidden data" in industries such as healthcare and finance easily accessible.
In the virtual engine, the cost of batch-generating specific data is exponentially lower than that of physical collection in the real world. The Palmyra X 004 model of the AI startup Writer almost completely relies on high-quality synthetic data for pre-training and fine-tuning. It finally ranked among the top in multiple enterprise-level logic benchmark tests, but its R & D and training costs are only a fraction of those of traditional methods.
It can be said that synthetic data has far exceeded the scope of "data substitution." It gives enterprises the privilege of unlimited trial and error in the digital space. When the AI models of all industries are no longer restricted by real data but have access to an inexhaustible "customized data granary," the evolution logic of the industry will also be rewritten.
02
Hardcore Implementation: "Fake" Data, Real Overtaking
Currently, the application of synthetic data is no longer just a verification of a certain direction but has turned into real commercial value. Enterprises that were the first to hoard data in the "virtual world" are beginning to use their achievements to conduct a dimensionality reduction attack on traditional models in real competition.
In 2024, Siemens spent $10.6 billion to acquire Altair Engineering, the leading industrial simulation software company, as a major move to develop a synthetic data generation engine. Currently, the four core tracks of autonomous driving, high-end manufacturing, financial risk control, and pharmaceutical R & D have also witnessed the hardcore implementation of technologies.
Not long ago, XPeng Motors released its second-generation VLA large model. Most of the nearly 100 million video clips used for its training were generated through simulations in the virtual world. The huge amount of data is equivalent to the sum of extreme scenarios that a human driver could encounter in 65,000 consecutive years of driving, increasing the target recognition accuracy of the model in rainy night scenarios to 98.7%.
In the high-end manufacturing field, the implementation of AI has long been restricted by its dependence on manual experience. Taking leading enterprises such as Baoshan Iron & Steel Co., Ltd. as an example, in the past, the control of the blast furnace temperature and the maintenance of process parameters for special steel highly relied on "old masters." When the skills of workers were not uniform, problems such as furnace temperature fluctuations, high energy consumption, and poor product stability were likely to occur.
In 2024, Baoshan Iron & Steel Co., Ltd. cooperated with Huawei to develop the world's first dedicated large model for blast furnaces based on the Pangu large model and conducted a large amount of training with synthetic data. By 2025, Baoshan Iron & Steel Co., Ltd. had launched nearly 300 AI application scenarios, enabling high-precision and high-timeliness perception of internal states, and the prediction accuracy of key indicators such as furnace temperature also reached 90%.
Recently, Suochen Technology demonstrated key technologies such as anti-positioning systems and integrated low-altitude wind field and electromagnetic systems at the World Physical AI Model Conference. In the demonstration, it only took a few hours to complete the design, simulation, and finalization of a fluid fan, and it achieved benchmarks against the products of excellent companies in the industry in terms of core technical indicators such as noise control, operating efficiency, and energy consumption level.
By digesting this synthetic data, the high-end manufacturing industry has quickly crossed the long period of experience accumulation. It can not only predict unplanned equipment shutdowns in advance, significantly reducing maintenance losses, but also automate the optimization of complex process parameters. For large manufacturing enterprises, even a 0.1% increase in the yield rate represents an incremental net profit of tens of millions of RMB.
The financial industry has a strong desire for data and also has concerns about compliance. The cooperation between Huaxing Bank and Tencent's Hunyuan large model provides a classic example for the industry.
In the due diligence investigation of corporate business, credit approval, and insurance underwriting of financial institutions, they need to process complex and highly sensitive customer asset and business data. By introducing synthetic data technology, it can generate a large "virtual customer group" with credit characteristics, transaction habits, and default probabilities extremely similar to those of real customers but without any real sensitive information.
After training, the Hunyuan large model helped Huaxing Bank reduce the time for generating loan due diligence reports from 10 days to 1 hour. Without touching the red line of real customer data, the automation review efficiency of risk control business has doubled, and the comprehensive cost of compliance testing and external data procurement has also been reduced by nearly 70%.
The "double ten" rule of spending ten years and one billion US dollars in the R & D of innovative drugs has been a cost bottleneck that the industry has been unable to break through for decades. According to the statistics of Evaluate Pharma in 2020, the R & D cost of oncology drugs usually reaches 2.6 billion US dollars, and the cycle lasts up to 13 years.
Last year, Eli Lilly and NVIDIA became the pioneers on this difficult track. In the most time-consuming and labor-intensive stages of target discovery and molecular screening, they used AI to synthesize hundreds of millions of "virtual molecular structures" in the computer, greatly improving the efficiency of target recognition, shortening the traditional drug R & D cycle to 2 - 3 years, and reducing the failure rate by 50%.
In this model, AI not only perfectly avoids the medical ethics dilemma, saves the most precious life-saving time for terminally ill patients, but also saves pharmaceutical companies hundreds of millions of R & D funds. On January 12, Eli Lilly and NVIDIA jointly announced an investment of one billion US dollars to establish an artificial intelligence drug laboratory because they saw the ultimate potential of synthetic data in pharmaceutical R & D.
With the continuous explosion of synthetic data, a new format of data banks may soon take shape. Enterprises no longer need to take high costs and risks to obtain real data. They only need to purchase "customized synthetic data sets" certified by authoritative institutions and with implicit compliance watermarks from the "data bank" to complete most of the training cost - effectively.
By then, "turning fiction into reality" will no longer be just an entertainment joke for the public but the absolute productivity for the accelerated evolution of all industries.
This article is from the WeChat official account "Mingxi Yewang". Author: Mingxi Yewang. Republished by 36Kr with permission.