HomeArticle

New work by Turing Award winner Sutton: The next step for AI is to move towards "generative cognition"

机器之心2026-06-02 17:41
The world itself is its best model.

From the ultra-long text processing of LLMs, the realistic video generation models, the increasing maturity of Agents' autonomous planning and execution, to the entry of VLA and world models into the physical world, AI is constantly expanding its capabilities.

The iteration cycle of model versions is constantly shortening, and the popularity of industry news and technical discussions remains high. In this forward - moving atmosphere, we seem to be very close to AGI.

However, a question remains to be answered: Do these AIs in servers really "understand" the world? Or rather, is the intelligence they exhibit fundamentally the same as the cognition shown by living organisms in the real physical world?

Recently, scholar Banafsheh Rafiee and the father of reinforcement learning, Richard S. Sutton, co - wrote a paper. They systematically reflected on and criticized the "passive representation" approach relied on by current mainstream artificial intelligence (including large language models, pure vision models, and even traditional symbolic systems), and introduced the "Enactive Cognition" framework from cognitive science into the field of AI.

This research argues that perception, cognition, and action are an inseparable and mutually - constructed whole. It explores how AI can transform from a passive information - processing system relying on static data to an intelligent agent that can gain experience through environmental interaction, embodied action, and self - evaluation.

Paper title: Toward Enactive Artificial Intelligence

The world itself is its best model

A significant part of the current mainstream AI development still follows a classic concept called "representationalism".

In the traditional artificial intelligence paradigm, whether it is the early symbolic systems or today's deep - learning models, perception is usually understood as a linear process of "first input, then process, and then act": the system first receives external signals, then processes these signals into internal representations, then conducts reasoning and decision - making based on these representations, and finally outputs actions.

From this perspective, an intelligent system is like a central processing unit. It needs to build an as accurate as possible "copy of the world" internally. The success of perception depends on whether this internal model can accurately reproduce the external reality.

However, Rafiee and Sutton pointed out that this approach has fundamental limitations. The real world is open, dynamic, and infinitely complex. No finite internal model can fully capture all its states. The world is not a set of static features waiting to be encoded, but a space of possibilities that changes constantly with the actions of the agent, the context, and the interaction history.

Therefore, the paper introduces a famous quote from roboticist Rodney Brooks: "The world itself is its best model."

This means that the most reliable, up - to - date, and richest information is not inside the agent but always in the external world. The agent should not try to completely replace reality with internal representations but should maintain continuous interaction with the environment, adjust actions, calibrate expectations, and form understanding in real - time feedback.

AI should not only "see the world" but also "understand the world in action"

"Enactive Cognition" comes from enactivism in cognitive science. Its core idea is that cognition is not an internal replication of a pre - existing objective world but is generated in the interaction between the embodied subject and the environment.

It absorbs the ideas of phenomenology, Gestalt psychology, and ecological psychology. Phenomenology emphasizes that perception is not a reconstruction of the world in the mind but a direct encounter with the world in the subject's life experience. Gibson's ecological psychology proposes the concept of "affordance", suggesting that whether an object in the environment is "graspable", "climbable", or "passable" depends on its relationship with the specific physical abilities.

That is to say, the world is not passively presented to the agent in the form of abstract features but becomes meaningful in the actions that the agent can take.

Introducing these ideas into AI, Rafiee and Sutton extracted four key pillars: experience, the inseparability of perception and action, autonomy, and embodiment. They all point to the same judgment: intelligence is not a static representation of the world but a process of acting, getting feedback, and self - maintaining in the environment.

Experience

In the enactive cognition framework, experience is not equivalent to data. Real experience comes from the continuous, real - time, and mutually - influential interaction between the agent and the environment. The agent does not passively receive existing data but continuously acquires skills through action, feedback, failure, and correction.

This also reveals the limitations of current mainstream machine learning. Supervised learning relies on data pre - collected and labeled by humans. The model only learns the traces left by experience, not the experience itself. In contrast, reinforcement learning is closer to the requirements of enactive cognition: the agent continuously generates new data and abilities in the interaction by actively exploring the environment, receiving feedback, and adjusting strategies.

In other words, a truly autonomous system cannot always rely on the static datasets prepared by humans but must be able to expand its capabilities through its own experience.

The inseparability of perception and action

Enactive cognition opposes splitting perception and action into two independent modules. Perception is not a preparatory step before action; perception itself is an action ability.

Humans do not passively receive images. We constantly change the input through the movement of our eyes, head, body, and hands, and then judge the space, sound, texture, and object shape. That is to say, perception is not waiting for information to enter the brain but revealing the environmental structure through purposeful actions.

This is especially important for today's video generation models. A pure observation system may learn a large number of visual rules, such as predicting the movement of objects or the sequence of traffic light changes, but this does not mean a real understanding of the physical world. Once the environment has an anomaly, they often lack the ability to actively intervene, try, and correct.

Enactive cognition emphasizes exactly this point: the agent should not only predict how the world changes but also be able to change the world through actions and form understanding in the feedback.

Autonomy

Enactive cognition holds that an agent is not a machine that simply responds to external stimuli but a self - organizing and self - maintaining system. The things in the environment are meaningful because they are related to the agent's own goals, needs, and continuous existence.

This means that the agent needs to have some internal standards of success and failure. Food, obstacles, and energy are important not because they are inherently important but because they affect whether the agent can continue to act, maintain its state, or achieve its goals.

From this perspective, many current AI systems still lack true autonomy. Supervised learning relies on external labels, large language models mainly imitate human data patterns, and the goals of traditional planning systems are mostly pre - set by humans. Although reinforcement learning introduces behavior evaluation through the reward mechanism, most reward functions are still specified by external designers and do not naturally emerge from the agent's self - maintaining process.

Therefore, current AI is still far from true autonomy.

Embodiment

The last key of enactive cognition is embodiment. The body is not just an execution tool used after the intelligent system completes reasoning but a prerequisite for perceiving and understanding the world.

The shape of the body, the position of sensors, the movement ability, and the action mode directly determine how the agent explores the environment and how the world presents meaning to it. The same chair is "sittable" for humans, a huge obstacle for ants, and for a robot, it depends on whether it has the corresponding height, joint structure, and control ability.

This explains why many mainstream AIs are still "disembodied". They can process a large amount of text, images, and videos but do not have the ability to change the perceptual input through their own movement and cannot actively explore and adapt to changes in the real environment.

Even in the field of robotics, many systems still split perception, planning, and control into independent modules. The body is just a hardware platform for executing strategies, not a core condition for shaping cognition itself.

The next step for reinforcement learning?

In the four dimensions of experience, perception - action, autonomy, and embodiment, Rafiee and Sutton made a clear judgment on the current AI paradigm: mainstream AI, especially large language models and pure vision models, still mainly stay at the level of passive representation and pattern prediction.

They can generate extremely realistic text, images, or videos and show strong reasoning and planning abilities in complex tasks. However, as long as they lack continuous interaction with the environment, evaluation based on the consequences of their own actions, and a truly embodied exploration process, there is still a key gap between them and "understanding the world".

In contrast, there is a stronger structural resonance between reinforcement learning and enactive cognition. RL emphasizes action, feedback, exploration, adaptation, and long - term evaluation, which makes it the AI branch closest to the concept of enactive cognition.

However, this closeness does not mean equivalence. Current reinforcement learning still has three deficiencies: First, most reward functions are specified externally rather than coming from the agent's self - maintaining and organizational structure; second, perception and action are still split into relatively independent steps in many systems; third, embodiment is often regarded as an engineering constraint rather than the basis for cognitive formation.

Therefore, reinforcement learning also needs to evolve further: from external rewards to more internal self - evaluation, from task - driven to continuous survival and adaptation, and from simply optimizing strategies to truly generating embodied experience.

This article is from the WeChat official account "MachineHeart" (ID: almosthuman2014). The author is concerned with academia. It is published by 36Kr with authorization.