Tsinghua University Team Open-Sources and Releases First General Large Model for Structured Data

Tsinghua University's open-source structured data large model LimiX breaks through the bottleneck of industrial AI applications.

On August 29, 2025, the general large model for structured data, "LimiX", jointly developed by the team led by Professor Cui Peng from the Department of Computer Science at Tsinghua University and Wenzhun Intelligence, was officially announced to be open - sourced. This release marks a crucial step in China's technological breakthrough and ecological opening in the field of intelligent processing of structured data, significantly lowering the threshold for industries to apply AI technology for structured data. Especially in the general industrial field where structured data dominates, the "LimiX" large model will help AI deeply integrate into the entire industrial production process, solve the problem of mining the value of industrial data, provide key support for intelligent manufacturing and new - type industrialization, and promote industrial technological transformation and optimization and upgrading.

In the general industrial field, structured data is the core asset. Industrial production parameters, equipment operation data, quality inspection data, scientific research and experimental data, etc., are all presented in the form of structured data. Its intelligent processing ability directly affects industrial efficiency and scientific research breakthroughs, and is also the key breakthrough point for AI to empower industrial manufacturing. Although general large language models (LLMs) have been widely applied in fields such as content creation and dialogue interaction due to their powerful text understanding and generation capabilities, LLMs have obvious shortcomings when dealing with structured data such as tables and time series: they are prone to errors in basic tasks such as numerical comparison and calculation, and are even less capable of handling complex tasks such as data classification, prediction, and attribution. The accuracy rate hardly meets the real - world industry requirements. Therefore, the current processing of industrial structured data still relies on the traditional paradigm of private data + dedicated models. Since dedicated models are difficult to generalize and not universal, multiple dedicated models need to be trained for different scenarios, resulting in high costs, poor results, and difficulty in leveraging the multiplier effect of data element aggregation, which seriously restricts the implementation path of AI in industrial scenarios.

The general large model for structured data (Large Data Model, LDM) addresses this pain point specifically. Different from LLMs that focus on text, LDM integrates structural causal inference and pre - trained large model technology. It can capture the internal relationships of structured data and has strong generalization ability, making it adaptable to multiple types of tasks across industries. The "LimiX" large model can support up to 10 types of tasks, including classification, regression, high - dimensional representation extraction, and causal inference. In scenarios such as industrial time - series prediction, abnormal data monitoring, and material performance prediction, its performance reaches or even surpasses the best dedicated models, achieving a breakthrough in universality where a single model can adapt to multiple scenarios and tasks, and providing a One - For - All solution for AI to empower the industry.

From technical performance to industrial implementation, the core advantages of the "LimiX" large model have been fully verified. The results of more than a dozen tests on over 600 datasets show that the "LimiX" large model can reach or exceed the state - of - the - art (SOTA) dedicated models in key indicators such as accuracy and generalization without secondary training. At the industrial application level, the "LimiX" large model has been successfully implemented in multiple real industrial scenarios. Its features of no need for training, low deployment cost, high accuracy, and strong universality have been highly recognized by partner enterprises. It has become a practical technical solution for promoting the transformation of industrial data value and is accelerating the formation of a real intelligent foundation for the core business scenarios of general industrial vertical industries.

1. R & D Team

The core R & D force of the "LimiX" model was led by Professor Cui Peng from the Department of Computer Science at Tsinghua University. The team combines the dual advantages of academic research and industrial implementation. Behind its technological breakthroughs lies profound scientific research accumulation and forward - looking direction planning.

As the core of the team, Professor Cui Peng is a top scholar in the field of data intelligence in China. He is not only the recipient of the National Science Fund for Distinguished Young Scholars but also has won the Second Prize of the National Natural Science Award twice for his outstanding achievements. He is also recognized as an ACM Distinguished Scientist. His academic influence is widely recognized in the international academic community. In the field of basic research, Professor Cui Peng pioneered the new paradigm of "causality - inspired stable learning", breaking through the performance limitations of traditional machine learning in scenarios with data distribution shifts and laying an important theoretical foundation for the research on the reliability and generalization of AI models.

After OpenAI launched ChatGPT in 2022, triggering a wave of large - model technology, Professor Cui Peng keenly perceived the development potential of large - model technology in the field of structured data. He quickly expanded his research direction from causal stable learning to the field of general large models for structured data (LDM). Relying on the existing theoretical accumulation, the team overcame core problems such as structural causal data synthesis, model structure design, and cross - scenario generalization. Finally, they achieved performance breakthroughs of the "LimiX" model in multi - domain tasks, laying a key technical foundation for this open - source release.

2. Introduction to the LimiX Large Model

The "LimiX" large model integrates multiple capabilities into the same basic model, including classification, regression, missing value imputation, data density estimation, high - dimensional representation extraction, data generation, causal inference, causal discovery, and out - of - distribution generalization prediction. While having excellent performance in structured data modeling, it greatly improves the model's universality.

During the pre - training phase, the "LimiX" large model learns the causal relationships in data based on massive amounts of causally synthesized data. Different from dedicated models that memorize data feature patterns during the training phase, the "LimiX" large model can directly capture causal variables in different context information and learn the joint distribution of data through conditional mask modeling to adapt to various downstream tasks such as classification, regression, missing value prediction, data generation, and causal inference. During the inference phase, LimiX can directly conduct inferences based on the provided context information and can be directly applied to various application scenarios without training.

Model Technical Architecture

The "LimiX" large model follows the Transformer architecture and has been optimized for structured data modeling and task generalization. The "LimiX" large model first embeds the features X and the target Y in the prior knowledge base respectively. Then, in the main module, the attention mechanism is used in the sample and feature dimensions respectively to focus on the key features of key samples. Finally, the extracted high - dimensional features are passed into the regression head and classification head respectively to support different functions.

Training Data Construction

Different from traditional tree models and LLMs based on the Transformer architecture, the "LimiX" large model uses only generated data during the training process and does not rely on any real - world data sources. To make the data generation process efficient and controllable, the team used a data generation method based on structural causal graphs: the sampled initial data propagates on a directed acyclic graph, and different causal dependencies in the real world are simulated through complex edge mapping and node interactions. By sampling the generated data on the causal graph, the features X and the target Y in the training data are finally obtained. The data generated using this method not only achieves diversity in causal structure but also ensures data controllability.

Model Optimization Objective

General large models for structured data (LDMs) need to be universal in various tasks in different application scenarios and have the ability to model data without training. Therefore, it is necessary to model the joint distribution of data to improve the model's universality and enhance the ability to model feature interaction patterns. For this reason, the "LimiX" large model incorporates a mask reconstruction mechanism in the model optimization objective design: during the training process, by masking random feature values, the model will reconstruct the missing features using the observed features based on the causal dependencies between features. By introducing mask prediction, the model can learn the joint distribution of data features, learn clearer and more robust decision boundaries, and improve the ability to represent and learn feature dependencies. To be closer to the missing patterns in real scenarios, the "LimiX" large model performs mask operations in three dimensions:

Sample - dimension masking: For each sample, some of its features are randomly masked.

Feature - dimension masking: For all samples, one of the features is randomly masked.

Semantic - dimension masking: Focusing on high - dimensional correlations, some of the features with high semantic relevance are randomly masked.

In addition, the "LimiX" large model takes the feature missing ratio into account. By designing training objectives for missing values in each row or each subset, it stabilizes the model's inference performance under different degrees of missing values and improves the robustness to various missing patterns.

Model Inference

In the inference application stage, the "LimiX" large model has strong scenario adaptability and task flexibility. This model can directly receive structured data inputs in multiple forms such as tables, time series, and graphs without additional training for specific scenarios or tasks. Users only need to specify the specific task types such as classification prediction, regression prediction, missing value completion, data generation, causal inference, and causal discovery, and the model can automatically complete data parsing, logical modeling, and result output, truly realizing the plug - and - play mode and efficiently covering various structured data processing needs.

In addition, the "LimiX" large model also supports efficient fine - tuning of the model for datasets, enabling the model to learn more comprehensive causal relationships in data and further improving its performance in prediction.

3. Model Effect

The "LimiX" large model has achieved excellent performance in multiple core tasks of structured data such as classification and regression without special training for datasets.

In terms of model evaluation, authoritative datasets from various fields were selected as benchmarks. For example, the open - source dataset Talent, which contains hundreds of real datasets, is one of the largest and most representative benchmarks in the current field. In the classification task, when comparing "LimiX" with 21 commonly used baseline methods in the field, the performance of the "LimiX" large model significantly surpasses other models, achieving the best results in AUC, ACC, F1 Score, and ECE.

In the regression task, the "LimiX" large model reaches the average best in the R2 and RMSE indicators, showing obvious advantages compared with other baseline methods. Moreover, when there are interfering or invalid features in the dataset, the performance advantage is even more obvious.

4. Model Implementation and Application

Currently, with its superior general modeling ability, the "LimiX" large model has effectively solved the bottleneck of traditional dedicated models in industrial scenarios with "scarce data, uneven quality, and heterogeneous environments" and has been successfully implemented in multiple key industrial scenarios.

In the field of industrial operation and maintenance, the "LimiX" large model has been successfully applied in industries such as steel, energy, and power, acting as an "equipment health manager" and providing core support for tasks such as equipment operation monitoring, fault early warning, and health assessment. Take a steel enterprise as an example. Its complex production line has long faced the problem of ineffective early warning due to the difficulty of accurately capturing non - typical abnormal signals from massive sensor data, posing a huge hidden danger to safe production. After the deployment of the "LimiX" large model, the accuracy of equipment fault prediction was increased by 15% compared with the original dedicated model, meeting the application - level requirements and promoting the transformation of its maintenance mode from "post - maintenance" to "predictive maintenance", significantly improving production safety and operation efficiency.

In the field of process optimization, the "LimiX" large model acts as a "production think - tank" in industries such as chemical engineering, manufacturing, and biology. In a material R & D enterprise, accurately identifying key factors from massive physical and chemical features is the core bottleneck for improving material design efficiency. The "LimiX" large model successfully screened out a few core optimization factors. On the premise of ensuring no information loss (R^2 > 0.95), it increased the regulation efficiency by 5 times, providing a scientific decision - making basis for the enterprise's cost reduction, efficiency improvement, and green production.

Industry experts said that the successful implementation of the "LimiX" large model not only verifies the applicability of general modeling technology in industrial scenarios but also provides a standardized solution for solving the pain points of industrial data applications, and is expected to promote intelligent upgrading in more industrial fields.

5. Open - Source Address

Project homepage: https://limix - ldm.github.io

Technical report: https://github.com/limix - ldm/LimiX/blob/main/LimiX_Technical_Report.pdf

Github: https://github.com/limix - ldm/LimiX

Huggingface: https://huggingface.co/stableai - org

Modelscope: https://modelscope.cn/organization/stable - ai

6. Conclusion

In the current wave of artificial intelligence development, large language models (LLMs) have achieved a "general world model in the semantic space" through large - scale pre - training. However, how to build a "general world model in the data space" for the unique attributes of industrial data has become a key proposition for AI to penetrate deeper into industries. Driven by this goal, it is imperative to develop general large models for structured data (LDMs) that can span scenarios, tasks, and environments. With its rich industrial data resources and diverse application scenarios, China is expected to build unique "asymmetric competitiveness" in the field of LDMs. The "LimiX" large model open - sourced by the Tsinghua University team this time is an important breakthrough in

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

A team from Tsinghua University has open-sourced and released the first general large model for structured data.