The change in models triggers a qualitative change in data. The 2025 Bund Summit explores new paradigms for data processing.
The amount of human data available for large model training is decreasing, and the Scaling Law is gradually losing its effectiveness. How can we break through the ceiling of intelligence again?
On September 12th, at the "Data meets AI: The Dual Engines in the Intelligent Era" Insights Forum of the 2025 Inclusion·Bund Summit, many authoritative experts from the academic and industrial circles presented new solutions: Data has driven the development of AI, and AI has also brought about a new round of evolution for data. The integration and drive of the dual engines is the direction of evolution.
The forum was jointly hosted by the Chinese Association for Artificial Intelligence, Shanghai Jiao Tong University, and Ant Group.
01. Building high-quality data has become a new breakthrough for the development of large models
As the primary engine in the intelligent era, data is transforming from an auxiliary role to a core driving force.
Professor Xiao Yanghua from Fudan University pointed out that the current development of large models is facing a severe "data wall" dilemma. The contribution of unlabeled corpora to the improvement of model performance is diminishing, and the cost - effectiveness of performance improvement from larger - scale data is significantly reduced compared to the required training overhead. He believes that the data science of large models needs to develop from the stage of expert experience to quantitative science and then to the self - evolution stage. "The data practice of large models requires research like that of Tu Youyou, extracting the key components that determine model capabilities from a vast amount of messy data."
Professor Xiao Yanghua from Fudan University
Xiao Yanghua shared the practice of screening high - quality corpora through grammatical complexity indicators and cumulative distribution sampling methods. Experiments showed that by screening only 20% of high - quality data from 10 billion tokens of financial corpora for continuous pre - training of the model, the accuracy in domain question - answering tasks increased by 1.7% compared to continuous pre - training with the full - scale data.
Professor Zhai Guangtao, a specially - appointed professor at Shanghai Jiao Tong University, emphasized that whether it is refined data or synthetic data, quality should be prioritized. Data quality analysis should start from the "quality of experience", considering both human and machine experiences, and further improve the performance of large models under the data - centric paradigm.
Li Ke, the CEO of Haitian Ruisheng, shared the development trends of the global AI data industry from the perspective of industrial practice. He believes that the data industry is undergoing a major transformation from labor - intensive to technology - intensive and knowledge - intensive. Through multiple practical cases such as motion capture data, autonomous driving annotation, and thinking chain datasets, Li Ke demonstrated how high - quality data can serve various industries.
Shan Dongming, the chairman of Shanghai Kupasi Technology Co., Ltd., said that changes in models lead to "qualitative changes in data". He stated that high - quality datasets should meet the VALID² (Vitality, Authenticity, Large Samples, Integrity, Diversity, High Knowledge Density) requirements and detailed the systematic reconstruction exploration of corpus data in three aspects: methodology, infrastructure, and industry ecosystem.
02. Technological innovation promotes the release of data value
As the second engine, AI technology is profoundly changing the way data is processed and utilized.
Yang Haibo, the president of Guanglun Intelligence, said that the data demand for embodied intelligence is thousands of times that of large language models and autonomous driving. Synthetic data is an important foundation for realizing the Scaling Law of embodied intelligence. He emphasized that synthetic data must meet four essential conditions: real physical interaction, human - in - the - loop demonstration, sufficiently rich scenarios, and data closed - loop verification. Yang Haibo believes that "one can't learn to swim by standing on the shore." Robots need to enter a physically interactive environment to obtain feedback from the physical world to optimize the model.
Zhao Junbo, the head of the Data Intelligence Laboratory at Ant Technology Research Institute, believes that the next - generation RL training rules should shift from "right or wrong" to "good or better". The new "Rubric is Reward" mechanism he explored can build an efficient RL loop using only 5k data and 10,000 scoring criteria, getting rid of the dependence on a large amount of SFT data and achieving "taste alignment". He said that this method can achieve stylized generation in fields such as humanities, creativity, and emotions, eliminating the "machine flavor".
Xu Lei, the CTO of LanceDB, shared the innovative practice of an open - source multimodal data lake. He introduced that different from traditional formats such as Parquet and ORC, the newly designed Lance format is both a file format and a table format, with two core features: zero - copy data evolution and efficient point - query. Xu Lei cited the example of Runway ML. After the company imported PB - level video data into Lance, it can manage the data as easily as using SQL, enabling more than 30 AI engineers to conduct parallel feature engineering iterations on the same main table.
Chen Chuan, the senior director of NVIDIA's Internet Solutions Architecture, introduced the innovative and efficient data processing for driving generative AI and shared GPU - accelerated solutions from text to multimodality.
During the round - table discussion session, experts had in - depth discussions on the reconstruction and opportunities of Data Infra. The experts unanimously agreed that with the change of computing paradigms, data processing technologies, whether actively or passively, need to be reconstructed and re - defined. Reconstruction is to solve the existing problems, and re - definition is to focus on the future and solve potential problems.
This forum demonstrated the latest achievements in the collaborative development of data and AI dual engines, providing references and practical paths for the construction of data infrastructure in the intelligent era. The participating experts said that only by achieving the in - depth integration of data and AI, establishing a complete data standard system and quality evaluation framework, can we truly unleash the huge potential of intelligent technologies and promote the intelligent era to a higher level.