HomeArticle

"The father of reinforcement learning" Sutton teams up with "the father of Doom" Carmack: Let robots enter the real world to play games

机器之心2026-06-21 15:13
Robots are also coming to play games in the real world...

At the beginning of 2026, in a shopping mall in Chengdu, a humanoid robot performing on stage accidentally collided with an on - looking elderly person. Both of them fell to the ground, and the elderly person was immediately sent to the hospital, where they were diagnosed with a soft - tissue contusion. After the incident, Fu Sheng, the chairman of Cheetah Mobile, publicly commented that this was neither the first time a humanoid robot had injured someone, nor would it be the last. Given the current capabilities of large models, it would be difficult to properly address the safety issues of humanoid robots within the next two or three years.

In fact, robots have been encountering continuous problems in real life. Such incidents indicate that the smooth demonstrations in laboratories and product launches often become unpredictable once they enter the real world.

Behind this lies a deeper rule: teaching an AI to perform a task in a simulator is completely different from getting it to perform the same task stably in the real world. The gap is often greater than expected.

Even with the same algorithm and task, any slight differences between the simulated and real environments (such as lighting, ground friction, and the tolerances of the robot's body itself) can cause a well - trained strategy to fail instantly.

During the same period when the humanoid robot industry was repeatedly paying the price for the problem of "standing stably", Keen Technologies, led by legendary programmer John Carmack, in collaboration with researchers from the University of Alberta and the Openmind Institute, published a paper that addressed this issue from a more fundamental perspective: Can reinforcement learning algorithms learn independently in the real world for an extended period without human supervision and without the expectation of immediate success?

To answer this question, they built a system specifically for "playing Atari games". This system is called Physical Atari.

The "Real - World" Challenge of Reinforcement Learning

Atari games are well - known in the AI community. As early as 2013, DeepMind used a deep reinforcement learning algorithm to learn to play Atari games in a simulator. This was regarded as one of the landmark moments in the rise of deep reinforcement learning. Since then, a series of classic algorithms such as Rainbow and MuZero have also used Atari games as a standard testing ground. However, all these testing grounds are simulators: the game world waits patiently for the algorithm to make a decision before proceeding.

The real world is completely different. For example, when you are driving and encounter a situation ahead, the car keeps moving forward even if you are still thinking about whether to step on the brake - the world doesn't wait for you.

The paper refers to this "the world doesn't wait for you" setting as "real - time reinforcement learning" and points out that this is exactly the real situation faced by robots.

Currently, there are mainly three ways to train AI in the robotics field:

The first is training in a simulator and then applying it to real robots. This is also the mainstream approach of most humanoid robot manufacturers today. However, the differences between the simulator and the real world are the root causes of the aforementioned falling incidents.

The second is collecting a large amount of demonstration data by remotely controlling robots with human operators and then training offline with this data.

The third, and the least - traveled path, is letting robots learn directly in the real world while performing tasks.

The third approach eliminates the cost of building a simulator and hiring people to collect data, and fundamentally avoids the long - standing problem of "the difference between the simulator and the real world". However, the price is that you need a robot that is durable enough, inexpensive enough, affordable for ordinary researchers, and can withstand weeks of continuous and high - intensity operation.

Physical Atari is the answer to this gap.

Team Introduction

The first author of this team is Khurram Javed, a research scientist at Keen Technologies.

The names of two great figures also appear in the author list: John Carmack and Richard S. Sutton.

Carmack is the co - founder of id Software. He led the development of epoch - making games such as "Doom" and "Quake" and was included in computer graphics textbooks for inventing multiple 3D graphics algorithms. After joining Oculus as CTO in 2013, he turned virtual reality from a concept into a mass - produced product.

In 2022, he left Oculus and founded Keen Technologies, targeting general artificial intelligence (AGI).

The following year, he invited Richard S. Sutton, one of the founders of the reinforcement learning field and a professor at the University of Alberta, to join Keen Technologies. Since then, the two have been focusing on researching agents that can continuously learn and adapt in the real world.

Sutton himself is also one of the authors of this paper. This means that the current robotic hand is not only a practical project of the engineering team but also directly reflects the judgment of this founder of reinforcement learning theory on "how agents should learn".

Physical Atari is a specific implementation of this concept: Rather than talking about "agents should learn in the real world" in papers, it's better to build the hardware first and let the algorithm run in practice.

How to Build a "Robotic Hand" to Play Games

The entire system actually has only two core components. One is called Atari Devbox, which is essentially a Raspberry Pi 5 enclosed in a 3D - printed case. It is connected to a 5 - inch screen and runs the classic Arcade Learning Environment simulator, rendering Atari game screens at a speed of 60 frames per second.

The other is called Robotroller, a robotic hand specifically designed to press real joysticks. It doesn't touch any circuits or codes. Instead, it holds an unmodified Atari CX40+ joystick just like a human, and controls the up - down, left - right movement of the joystick and the fire button through three servo motors respectively.

A camera captures the game screen, and a computer running the reinforcement learning algorithm makes decisions based on the screen. Then, it sends instructions to the Robotroller, which is responsible for translating these decisions into real hand movements.

The key to this design concept is to allow the AI to interact with the game through the most basic human - like way of "watching the screen and moving the joystick", without any shortcuts. Therefore, it can directly reuse the game's native mechanisms without the need to build additional simulation interfaces.

Although it sounds simple, a large part of the paper actually focuses on "how to ensure that a robotic hand doesn't break down within weeks".

The first problem the researchers encountered was that the screws would loosen. The solution was to use thread - locking glue. Then, they found that the plastic gears inside the servo motors would wear out, so they replaced them with servo motors with metal gears. Later, they discovered that the joystick itself was "worn out" by the robotic hand. After investigation, it was found that the motor's movements were too "violent", causing unnecessary stress on the joystick. So, the team readjusted the control parameters to make the movements more gentle.

The most interesting fix was that the researchers added a "high - current reflection" mechanism to the servo motors. Once the current of a motor exceeds the set threshold (usually indicating that it is stuck or has reached a hard limit), the system immediately stops the motor, releases the torque instantly, and then re - locks it, just like the human body's tendon reflex automatically contracts muscles when over - stretched, preventing the motor from burning out under excessive load.

This mechanism may seem insignificant, but it is a crucial part of ensuring that the entire system can run continuously for weeks without failure.

As for the "reward signal" (game score), the team didn't secretly transmit it via network cables or code. Instead, they made the Devbox screen display a set of AprilTags visual markers synchronously, and the camera directly "reads" whether the score is increasing or decreasing.

In other words, the way this robot perceives the world, from the game screen to the score, is all accomplished through the single channel of the camera, which is essentially the same as how humans play games.

The cost of the entire hardware is controlled within $1000. The parts that need to be purchased for the Robotroller itself (servo motors, bearings, screws, etc.) cost about $400, and the customized parts can be printed with an ordinary consumer - grade 3D printer in about 12 hours.

A Real Robot Played Games for 145 Hours

The researchers let the system learn to play six games: Pong, Seaquest, Ms Pacman, Assault, Asterix, and Kangaroo, for five and a half hours each. Each game was repeated 4 to 5 times.

In total, these experiments ran for nearly 145 hours without any human intervention - no one helped it up, and no one restarted it. The robotic hand pressed the joystick repeatedly and gradually learned how to increase the game score.

Another set of experiments is even more worthy of attention: The researchers first let an agent learn on a Robotroller for 6 hours, and then deployed the trained strategy on the original robot and another robot "built according to the same blueprint" for testing.

The result is that even if two robots are built with exactly the same design blueprint and parts, the strategy always performs significantly worse on the "unfamiliar body".

In the Pong game, which requires precise timing, this gap is particularly obvious: The strategy deployed on the new machine can recognize the direction of the ball and move the paddle in the correct direction, but always misses the ball by a little bit. This is because even the slight tolerances between parts of the same model are enough to misalign the originally well - timed actions.

Game screens of Pong and Kangaroo

The researchers then let the agent continue to learn on the "unfamiliar body", and the performance of the strategy gradually recovered, approaching the level before the body change.

This set of control experiments indirectly confirms a judgment repeatedly emphasized in the paper: Even a small difference like "switching to another robot of the same model" between training and deployment is enough to drag down the performance. Continuously learning directly on the target body is the most direct way to correct this deviation.

The end - to - end response delay of the entire system is about 165 milliseconds, which is roughly within the range of human reaction speed. This indicates that the "reaction ability" of the hardware itself is not the bottleneck, and the problem indeed lies in the matching between the strategy and the body.

Conclusion

Physical Atari doesn't aim to teach robots to walk or fold clothes. It addresses a more fundamental question: If you want to verify whether robots can learn independently in the real world, at least now there is a set of experimental equipment that is inexpensive, durable, and can be replicated by anyone. Compared with demonstrating a well - tuned set of actions on a product launch stage, running continuously for 145 hours in the real world without human intervention may be a more straightforward standard for testing the reliability of a reinforcement learning algorithm.

This article is from the WeChat official account “MachineHeart” (ID: almosthuman2014). The author is "Gamer". It is published by 36Kr with authorization.