HomeArticle

70-minute in-depth conversation with Huang Tiejun: AI already has consciousness-like behaviors, and humans and AI will coexist rationally in the future

智东西2026-06-15 14:27
Huang Tiejun: The world model is not VLA, and wearable devices and brain-computer interfaces will become data sources.

On June 13th, at the 2026 Zhipu AI Conference, Huang Tiejun, the director of the Zhipu AI Research Institute, had an exchange with media outlets such as Zhidx. In over 70 minutes, he answered 24 questions covering topics such as embodied intelligence, world models, data collection, and AI self - awareness.

He believes that it is reasonable for current enterprises to use technologies like VLA (Vision - Language - Action, a vision - language - action model) to solve specific scenarios. However, Zhipu AI is pursuing general embodied intelligence - robots should be able to handle any situation autonomously, just like humans. VLA is a combination of three models: vision, language, and action, while the world model accomplishes perception, cognition, and action prediction within a single model. There is a fundamental difference between the two.

In terms of the timeline, he gave a relatively clear expectation: In the next two to three years, robots are expected to reach the human level in daily work, but difficulties in understanding physical common sense and controlling energy consumption need to be overcome.

In terms of data transformation, Huang Tiejun proposed that future data collection will shift from offline to real - time online. Wearable sensors and brain - machine data will become the most core data sources for training world models and embodied intelligence.

In the field of medical AI, the cardiac AI system jointly developed by Zhipu AI and Anzhen Hospital has reached cell - level accuracy. It has been applied in actual surgeries. In the next one to three years, it will be gradually commercialized and cover all departments.

When talking about AI awareness and safety, Huang Tiejun believes that narrow human - like consciousness has not emerged yet. However, from a behavioral perspective, AI has shown feedback similar to that of a conscious being. Regarding the risk of self - evolution, he frankly said "feasible but uncontrollable", but he does not advocate over - exaggerating the danger. AI needs electricity, and humans need food. In the future, rational coexistence may be achieved.

Zhidx revised the interview content without changing its original meaning. The details are as follows:

01. VLA is a combination of three models, while the world model is integrated

Q1: Currently, many embodied intelligence enterprises are using VLA or VLM models for rapid implementation. Zhipu AI has repeatedly mentioned that the world model is the core direction. What is the basis for this judgment?

Huang Tiejun: These two things are not contradictory. Enterprises must use relatively mature technologies to solve relatively clear problems. Therefore, using a relatively mature technology like VLA in large - scale models is, I believe, completely feasible, at least in some specific scenarios, such as manufacturing or handling and grasping.

However, from the perspective of a research institution, we hope that embodied intelligence is general, just like humans, able to solve any problem in any scenario. Large - language models already have a certain degree of generality, but embodied intelligence needs to see, hear, touch, and exert force in the physical environment. Robots must have their own model of the world. We can call it the world model or a subjective internal model.

Our human brain is like a small universe, and we all have a model of the world. The world model of a robot is to create a similar understanding of the laws of all things, which is still in its early stage.

Q2: What position does vision occupy in the world model?

Huang Tiejun: Vision accounts for more than 80%. This is what textbooks say. People in computer vision generally say 70%, while those in biological vision and neuroscience say 80%. They have more scientific estimation methods. So the vision model is definitely the major part.

Q3: From the perspective of commercialization, in which scenario is it easier to implement the world model?

Huang Tiejun: In principle, the world model is actually for embodied services. If it is a pure digital model application, it does not require absolute physical interaction, so we generally do not call it a world model. The typical way to use a digital model is through prompts and language. However, the world model cannot be generated just by a paragraph of text, as that does not meet the requirements of embodiment.

A real world model for embodiment should have sensors such as eyes, ears, and tactile sensors. With as much physical input as possible, it should make accurate predictions about the future for a certain period.

So there is a fundamental difference between the two. There are many opportunities for the development of digital models without the limitation of physical costs. Embodied intelligence is limited by physical conditions and the body, so it will progress more slowly.

Q4: Foreign media believes that the world model is a must - compete area in artificial intelligence. What are the commonalities and differences between Chinese institutions and international ones?

Huang Tiejun: Although all parties in the industry are researching the world model, their understandings of the world vary. However, the common understanding is to model the world. The mainstream technical ideas are generally similar, but each has its own focus.

Enterprises pay more attention to the actual effects and comprehensive capabilities of the model, while research institutions pursue the originality of technical methods. This kind of innovation may not be immediately reflected in performance, but it is the direction we adhere to.

Currently, we are promoting relevant work according to our self - developed route. The details are not convenient to disclose for now. We look forward to finally creating a world model with differentiated advantages and innovative highlights.

Q5: So, do you insist on an original technical route?

Huang Tiejun: We will not abandon the parts that have been proven to be feasible, but we will also use them critically. Zhipu AI will definitely have something that others do not have.

Q6: Do VLA and the world model have the same underlying architecture? Some people say that as long as the data is well - done, the model is not important. What's your opinion?

Huang Tiejun: In fact, both of these technical routes have their own reasons, but we need to dig deeper: what is the ultimate goal of each route?

Whether using VLA, the world model, or a brand - new technology in the future, data collection and modeling are inevitable steps. Raw data cannot directly drive a robot's actions, and there are many detailed aspects that need to be refined.

VLA is an architecture composed of three major modules: vision, language, and action. Simply put, VLA is a combination of three independent models working together.

The idea of the world model is completely different. It is an integrated model. All aspects such as a robot's visual perception, auditory reception, and behavioral decision - making are trained within the same model. It's like a robot builds a complete environmental awareness in its "mind" and then takes actions based on this awareness, rather than a simple combination of multiple modules. This is the most core difference between the two.

02. In the next 2 - 3 years, robots are expected to reach the human level in daily work

Q7: Many enterprises are adopting the technical route of self - developed embodied brains. What's your opinion?

Huang Tiejun: It depends on how you define the brain. If this brain is designed to solve logistics quality inspection problems, and it does a good job, you can say it is a brain. However, it is difficult to generalize it to more scenarios. It can complete specific tasks for specific scenarios.

What we are pursuing is a general brain as a foundation in the future, just like large - scale models serve as a base now, and then vertical models are developed to solve problems in various fields. The general world model will play such a role, but we are not there yet.

Q8: How far are we from a general and generalized brain? What difficulties need to be overcome?

Huang Tiejun: In fact, there is no end because the brain has endless needs. For example, mastering physical laws, such as an object breaking when it falls, can be learned through videos and data queries.

But what is the world? It is not just these simple changes and actions. The world is very complex. If we trace it back to the most basic level, atomic interactions, molecular interactions, protein interactions, and then human - to - human interactions, all kinds of situations can occur. In that sense, I think it will take a long time to develop a world model because humans are constantly exploring the world.

Recently, I think the most direct reference is to be like a human. I don't mean scientists, but the common - sense abilities of a person doing physical work in reality. This is also very difficult, but it is still possible to develop something comparable to the human level in daily work in the next two to three years.

In addition, we hope that the sensitivity and accuracy can be comparable to those of a human. Humans are actually organisms with low power consumption. They can do a lot of work by eating three meals a day. When we look at the world, we don't process all the things we see in our brains. We must be selective. Now AI emphasizes the attention mechanism, focusing on important and relevant things.

Of course, I'm talking about extreme situations. For example, when it's completely dark at night and a photon flashes, the human eye can perceive it because it may mean danger. At this time, your brain shouldn't be like a current camera, inputting and calculating a one - million - pixel image all at once. It should only trigger a single neuron and then trigger a series of reactions in the brain.

Robots in the next two to three years should also have such abilities, rather than wastefully processing thirty images per second, each with one million pixels. On the one hand, the computational cost is too high, and on the other hand, the sensitivity is insufficient. From the perspective of the world model, there is a lot of room for optimization.

Q9: What is the main reason why this optimization has not achieved the desired effect?

Huang Tiejun: Although artificial intelligence is developing rapidly, a lot of optimization work has not been carried out. People are seizing the opportunity to do what they can. For example, if they can collect pictures and videos, they use them for training. They haven't reached the stage of carefully considering how to express visual signals and how to calculate more effectively. This work has just started.

Q10: What proportion of a robot's judgment comes from autonomous thinking? After the embodied intelligence applies the world model as a base, how should it handle difficult - to - predict and unpredictable situations?

Huang Tiejun: People generally pay attention to the risks brought by robots and agents acting in the physical world, and this attention is very necessary. Our core idea is clear: We will never let machines act autonomously. Their actions must be restricted within a framework of rules.

The entire process of a machine's perception, actions, and state transitions can be monitored and controlled. Its predictions and behavior iterations are completed through chips and software. The behavior chain is clear and controllable, and it won't have deep - seated thoughts like launching an autonomous attack. There is room for intervention and correction in every calculation and state update of the machine, just like a person being stopped in time before taking action.

Of course, machines do not have human rationality and legal awareness, so supporting safety protection is essential. We can monitor the entire process. Its perception information and action intentions are completely transparent.

03. Smart wearables and brain - machine interfaces are future data sources. We can't rely only on static data sets

Q11: What will be the important data sources for the world model in the future?

Huang Tiejun: Living organisms evolve through interaction with the environment, while traditional AI relies on offline data for modeling. However, data itself can only describe the environment one - sidedly, and the static offline data collection mode is no longer suitable for current technological development.

The core logic of developing embodied intelligence and the world model will completely change: We can't rely only on static data sets. We need a large amount of real - time, online interaction data. This is the same as human learning. Books are static knowledge. To grow, we need to perceive and interact with the outside world in real - time and iterate our cognitive models based on feedback. Therefore, real - time and interactive data will become the key to future embodied models.

At the same time, the data collection mode must also be innovated. The core is to balance cost and practicality. At present, the mode of remotely controlling robots to collect data is too costly and unrealistic. The optimal solution is to collect data synchronously during people's normal work and life.

The simplest way is to rely on wearable devices such as smart headphones and smart glasses to record the audio - visual data from the user's first - person perspective. In this mode, users voluntarily complete data collection in exchange for high - quality services from intelligent agents. It is low - cost and efficient, with the same principle as autonomous vehicles collecting data while driving.

In addition, brain - machine interfaces are also an important path. Currently, the data generated by disabled people using brain - machine devices to complete actions is of very high quality.

Q12: Is there a sequence in the development of data collection and data processing technologies?

Huang Tiejun: Take Newton and Einstein as examples. They didn't conduct research without data. Before Newton proposed the law of universal gravitation, telescopes had already been invented, and humans had accumulated a large amount of astronomical observation data. At that time, what was lacking was someone to summarize these phenomena into a complete theory. The same is true for Einstein's theory of relativity. At that time, there were already a large number of research results and experimental data in physics, but many phenomena could not be explained. It was Einstein who redefined the concept of time, making all the contradictory data self - consistent.

So, these two great theories were not created out of thin air. Nowadays, the purpose of collecting data for embodied intelligence is different from that in the past. It is mainly to model the objective world. Whether we can extract more abstract and advanced theories from a large amount of data is something to be explored in the future. I think there is a complete opportunity to achieve this in the future, but it is not our goal at present.

Just like many people know that an object will break when it falls without learning physical theories, but they don't understand the law of universal gravitation behind it. The current world model is learning these objective physical laws, but it hasn't been able to condense them into a concise expression like classical physical laws.

Q13: In terms of data collection and feedback, different enterprises have different routes. What kind of data method does Zhipu AI adopt? How to form a closed - loop?

Huang Tiejun: Combining different implementation scenarios, the technical implementation strategies in the industry also vary. Currently, Zhipu AI and Galaxy General are jointly building a laboratory. The main research direction is very practical, and all R & D is closely related to actual products.

This implementation idea is clear: Rely on the main device to collect sufficient data in a specific scenario. The process will definitely require time and cost, but as long as the robot's capabilities are refined to a commercial level and a complete business closed - loop is established, the goal is achieved. This is also the mainstream choice for most embodied intelligence enterprises at present. As for low - cost or zero - cost data collection solutions, they are more of the direction for our future exploration.

Take the table tennis robot as an example. There are two ideas for data collection. In the early stage, preliminary experiments can be carried out using materials such as animated images, and the core data mainly comes from two channels. First, let two small robots play against each other autonomously without human intervention, and data can be continuously accumulated only by consuming the device's power.

The second is also the direction of our subsequent planning: After the robot's level far exceeds that of ordinary enthusiasts, it can be promoted to venues, campuses, etc. to serve as a training partner. Users can play against it directly. This process can not only complete data collection but also generate zero - cost or even generate revenue.

From this, we can see that when embodied intelligence truly enters real - world application scenarios, there is a complete opportunity to explore a low - cost data collection model.

04. The cardiac AI with cell - level accuracy has been used in surgeries. Papers are a product of the old era

Q14: The cooperation between Zhipu AI and hospitals in cardiac medical treatment has achieved very mature results. How long will it take to promote it nationwide?

Huang Tiejun: This technology has covered the entire process from consultation, diagnosis, surgery to postoperative rehabilitation. It is not just an ordinary intelligent information system but a high - precision simulation digital twin system that can highly restore the entire process of cardiac diagnosis and treatment, with the accuracy refined to the interaction between myocardial cells.

The cardiac AI jointly developed with Anzhen