Xiongan Attracts Talent | Tsinghua Team Open-Sources the First General Large Model for Structured Data
As a key artificial intelligence industry innovation carrier in the Xiongan New Area, the Xiongan Artificial Intelligence Industrial Park focuses on the core technology fields of artificial intelligence, constructs an innovation ecosystem integrating technology R & D, industry incubation, and enterprise cultivation, and is accelerating the aggregation of a group of cutting - edge technology enterprises like Wenzhun Intelligence to form an industrial innovation cluster, injecting strong impetus into the high - quality development of the region.
On August 29, 2025, the general large - scale model for structured data, "LimiX", jointly developed by the team of Professor Cui Peng from the Department of Computer Science at Tsinghua University and Wenzhun Intelligence, was officially announced to be open - sourced. This release marks a crucial step in China's technological breakthrough and ecological opening in the field of intelligent processing of structured data, and will significantly lower the threshold for industries to apply structured data AI technology. Especially in the general industrial field where structured data dominates, the "LimiX" large - scale model will help AI deeply integrate into the entire process of industrial production, solve the problem of mining the value of industrial data, provide key support for intelligent manufacturing and new - type industrialization, and promote industrial technological transformation and optimization and upgrading.
In the general industrial field, structured data is the core asset. Industrial production parameters, equipment operation data, quality inspection data, scientific research experiment data, etc., are all presented in the form of structured data. Its intelligent processing ability directly affects industrial efficiency and scientific research breakthroughs, and is also the key breakthrough point for AI to empower industrial manufacturing. Although the general large - language model (LLM) has been widely applied in content creation, dialogue interaction and other fields with its powerful text understanding and generation capabilities, it has obvious shortcomings when dealing with structured data such as tables and time series: it is prone to errors in basic tasks such as numerical comparison and calculation, and is even less capable of complex tasks such as data classification, prediction, and attribution. The accuracy rate can hardly meet the real - world industry requirements. Therefore, the current industrial structured data processing still relies on the traditional paradigm of private data + dedicated models. Due to the poor generalization and non - universality of dedicated models, multiple dedicated models need to be trained for different scenarios, which is costly and ineffective, and it is difficult to exert the multiplier effect of data element aggregation, seriously restricting the implementation path of AI in industrial scenarios.
The general large - scale model for structured data (Large Data Model, LDM) specifically addresses this pain point. Different from the LLM that focuses on text, the LDM integrates structural causal inference and pre - trained large - scale model technology. It can capture the internal associations of structured data and has strong generalization ability, and can be adapted to multiple types of tasks across industries. The "LimiX" large - scale model can support up to 10 types of tasks, including classification, regression, high - dimensional representation extraction, causal inference, etc. In scenarios such as industrial time - series prediction, abnormal data monitoring, and material performance prediction, its performance reaches or even surpasses the optimal dedicated models, achieving a breakthrough in the universality of a single model adapting to multiple scenarios and tasks, and providing a One - For - All solution for AI to empower the industry.
From technical performance to industrial implementation, the core advantages of the "LimiX" large - scale model have been fully verified. The results of more than a dozen tests on more than 600 datasets show that the "LimiX" large - scale model can reach or exceed the proprietary SOTA models in key indicators such as accuracy and generalization without secondary training. At the industrial application level, the "LimiX" large - scale model has been successfully implemented in multiple real industrial scenarios. Its features of no need for training, low deployment cost, high accuracy, and strong universality have been highly recognized by partner enterprises, becoming a practical technical solution to promote the transformation of industrial data value, and is accelerating the formation of a real intelligent foundation for the core business scenarios of general industrial vertical industries.
1. R & D Team
The core R & D force of the "LimiX" model was led by Professor Cui Peng from the Department of Computer Science at Tsinghua University. The team brings together the dual advantages of academic research and industrial implementation. Behind its technological breakthroughs lies profound scientific research accumulation and forward - looking direction layout.
As the core of the team, Professor Cui Peng is a top scholar in the field of data intelligence in China. He is not only the recipient of the National Science Fund for Distinguished Young Scholars, but also won the Second Prize of the National Natural Science Award twice for his outstanding achievements. At the same time, he was named an ACM Distinguished Scientist, and his academic influence is widely recognized in the international academic community. In the field of basic research, Professor Cui Peng pioneered the new paradigm of "causality - inspired stable learning", breaking through the performance limitations of traditional machine learning in scenarios with data distribution shifts, and laying an important theoretical foundation for the research on the reliability and generalization of AI models.
After OpenAI launched ChatGPT in 2022, triggering a wave of large - scale model technology, Professor Cui Peng keenly perceived the development potential of large - scale model technology in the direction of structured data, and quickly expanded his research direction from causal stable learning to the field of general large - scale models for structured data (LDM). Relying on the existing theoretical accumulation, the team overcame core problems such as structural causal data synthesis, model structure design, and cross - scenario generalization, and finally achieved performance breakthroughs of the "LimiX" model in multi - field tasks, laying a key technical foundation for this open - source release.
2. Introduction to the LimiX Large - Scale Model
The "LimiX" large - scale model integrates multiple capabilities into the same basic model, including classification, regression, missing value imputation, data density estimation, high - dimensional representation extraction, data generation, causal inference, causal discovery, and out - of - distribution generalization prediction, etc. While having excellent performance in structured data modeling, it greatly improves the universality of the model.
During the pre - training phase, the "LimiX" large - scale model learns the causal relationships in the data based on massive amounts of causally synthesized data. Different from dedicated models that memorize data feature patterns during the training phase, the "LimiX" large - scale model can directly capture causal variables in different context information and learn the joint distribution of the data through conditional mask modeling to adapt to various downstream tasks such as classification, regression, missing value prediction, data generation, and causal inference. During the inference phase, LimiX can directly conduct inferences based on the provided context information and can be directly applied to various application scenarios without training.
Model Technical Architecture
The "LimiX" large - scale model adopts the Transformer architecture and optimizes it for structured data modeling and task generalization. The "LimiX" large - scale model first embeds the features X and the target Y in the prior knowledge base respectively. Then, in the main module, the attention mechanism is used in the sample and feature dimensions respectively to focus on the key features of key samples. Finally, the extracted high - dimensional features are passed into the regression head and classification head respectively to support different functions.
Training Data Construction
Different from traditional tree models and LLMs based on the Transformer architecture, the "LimiX" large - scale model uses entirely generated data during the training process and does not rely on any real - world data sources. To make the data generation process efficient and controllable, the team uses a data generation method based on the structural causal graph: the sampled initial data propagates on the directed acyclic graph, and different causal dependencies in the real world are simulated through complex edge mapping and node interactions. By sampling the generated data on the causal graph, the features X and the target Y in the training data are finally obtained. The data generated by this method not only achieves diversity in causal structure but also ensures data controllability.
Model Optimization Objectives
The general large - scale model for structured data (LDM) needs to be universal in various tasks in various application scenarios and have the data modeling ability without training. Therefore, it is necessary to model the joint distribution of the data to improve the universality of the model and enhance the ability to model feature interaction patterns. For this reason, the "LimiX" large - scale model adds a mask reconstruction mechanism to the model optimization objective design: during the training process, by masking random feature values, the model will reconstruct the missing features using the observed features according to the causal dependencies between features. By introducing mask prediction, the model can learn the joint distribution of data features, learn clearer and more robust decision boundaries, and improve the ability to represent and learn feature dependency relationships. To be closer to the missing patterns in real scenarios, the "LimiX" large - scale model performs mask operations in three dimensions, namely:
Sample - dimension masking: For each sample, some of its features are randomly masked.
Feature - dimension masking: For all samples, one of the features is randomly masked.
Semantic - dimension masking: Focusing on the high - dimensional correlation, some of the features with high semantic relevance are randomly masked.
In addition, the "LimiX" large - scale model takes the feature missing ratio into account. By designing training objectives for missing values in each row or each subset, it stabilizes the inference performance of the model under different missing degrees and improves the robustness to various missing patterns.
Model Inference
In the inference application stage, the "LimiX" large - scale model has strong scenario adaptability and task flexibility. The model can directly receive multi - form structured data inputs such as tables, time series, and graphs without additional training for specific scenarios or tasks. Users only need to specify specific task types such as classification prediction, regression prediction, missing value completion, data generation, causal inference, and causal discovery, and the model can automatically complete data parsing, logical modeling, and result output, truly realizing the plug - and - play mode and efficiently covering various structured data processing needs.
In addition, the "LimiX" large - scale model also supports efficient fine - tuning of the model for datasets, enabling the model to learn more comprehensive causal relationships in the data, and further improving its performance in prediction.
3. Model Effect
The "LimiX" large - scale model has achieved excellent performance in multiple core tasks of structured data such as classification and regression without special training for the dataset.
In terms of model evaluation, authoritative datasets in various fields were selected as benchmarks. For example, the open - source dataset Talent, which contains hundreds of real datasets, is one of the largest and most representative benchmarks in the current field. In the classification task, comparing "LimiX" with 21 commonly used baseline methods in the field, the performance of the "LimiX" large - scale model significantly surpasses other models, achieving the best results in AUC, ACC, F1 Score, and ECE.
In the regression task, the "LimiX" large - scale model has achieved the average best in the R2 and RMSE indicators, showing obvious advantages compared with other baseline methods. And when there are interfering or invalid features in the dataset, the performance advantage is even more obvious.
4. Model Implementation and Application
Currently, with its superior general modeling ability, the "LimiX" large - scale model has effectively solved the bottleneck of traditional dedicated models in industrial scenarios with "scarce data, uneven quality, and heterogeneous environments", and has been successfully implemented in multiple key industrial scenarios.
In the field of industrial operation and maintenance, the "LimiX" large - scale model has been successfully applied in industries such as steel, energy, and power, playing the role of an "equipment health steward" and providing core support for tasks such as equipment operation monitoring, fault early - warning, and health assessment. Taking a steel enterprise as an example, its complex production lines have long faced the problem of ineffective early - warning caused by the difficulty of accurately capturing atypical abnormal signals from massive amounts of sensor data, posing a huge hidden danger to safe production. After the deployment of the "LimiX" large - scale model, the accuracy of equipment fault prediction was increased by 15% compared with the original dedicated model, meeting the application - level requirements, and promoting the transformation of its maintenance mode from "post - maintenance" to "predictive maintenance", significantly improving production safety and operation efficiency.
In the field of process optimization, the "LimiX" large - scale model has become a "production think - tank" in industries such as chemical engineering, manufacturing, and biology. In a material R & D enterprise, accurately identifying key factors from massive physical and chemical features is the core bottleneck for improving material design efficiency. The "LimiX" large - scale model successfully screened out a small number of core optimization factors. On the premise of ensuring no information loss (R^2 exceeds 0.95), it increased the regulation efficiency by 5 times, providing a scientific decision - making basis for the enterprise's cost - reduction, efficiency - improvement, and green production.
Industry experts said that the successful implementation of the "LimiX" large - scale model not only verifies the applicability of general modeling technology in industrial scenarios but also provides a standardized solution for solving the pain points of industrial data application, and is expected to promote the intelligent upgrading of more industrial fields.
5. Open - Source Address
Project Homepage