Real Machine RL: The Most Powerful VLA Model π*0.6 Arrives! Robots Open a Coffee Shop in the Office

Train with the VLA's own behavior. The new method significantly improves the success rate and processing efficiency of embodied intelligence.

The new method has significantly improved the success rate and processing efficiency of embodied intelligence.

What level of capabilities does embodied intelligence trained entirely with real - world data possess?

This week, the US - based embodied intelligence startup Physical Intelligence (abbreviated as PI or π) released its latest robot foundation model, π*0.6.

PI is a robotics and AI startup headquartered in San Francisco. Its mission is to bring general artificial intelligence from the digital world to the physical world. Their first general foundation model for robots is named π₀, which enables a single set of software to control multiple physical platforms to perform various tasks.

In 2024, PI received over $400 million in financing, and its valuation exceeded $2 billion, making it one of the most prominent players in the embodied intelligence field.

PI's technical approach emphasizes the "Vision - Language - Action" (VLA) model. Through large - scale robot perception and action data training, it develops strategies with generalization capabilities, allowing robots to execute tasks flexibly in unknown environments rather than being limited to pre - set actions.

Sergey Levine, a well - known expert in the fields of machine learning and decision - making control, an associate professor at UC Berkeley, and a co - founder of Physical Intelligence, said that robots equipped with this model can already make lattes, Americanos, and Italian coffees in the company's office.

Sergey Levine said that by fine - tuning the π*0.6 model, it can perform excellently on various tasks. Except for tasks related to handling clothes, it can achieve a 90% success rate, and the task - processing efficiency has also been greatly improved.

In a blog post on Physical Intelligence, engineers detailed the mechanism and performance of π*0.6.

Think about it. What steps are needed to assemble a cardboard box?

As humans, to complete this task quickly and efficiently, first, you would probably ask someone to teach you some basic knowledge: which methods work, what the common mistakes are, and what the correct techniques are. Second, an excellent teacher not only demonstrates how to operate but also guides you and corrects the mistakes you make when operating on your own. However, guidance alone is not enough. The third step to becoming a master at cardboard box assembly through practice makes perfect is to practice repeatedly until you master it instinctively.

In the past year, many remarkable achievements we've seen in the field of robot learning have only used the first step - training robots through human - provided demonstrations. With just this step, it's not difficult for a robot to successfully complete half of the tasks, but it's very difficult to make it succeed every time, let alone achieve human - level efficiency in complex real - world tasks. This is a big problem because real - world robot tasks require a system that can operate reliably and quickly.

Based on this thinking, Physical Intelligence developed a method called Recap (Reinforcement Learning with Experience and Correction based on Advantage - Conditioned Policies), which implements all three steps: training robots through demonstrations, guiding robots through error correction, and enabling them to improve from autonomous experience. The authors used Recap to improve the latest version of the Vision - Language - Action (VLA) model π(0.6), enabling it to perform complex tasks robustly and efficiently, such as making espresso, assembling cardboard boxes, and folding various types of clothes.

The model trained through reinforcement learning is called π*(0.6). Training π*(0.6) on autonomous experience using Recap can more than double the throughput of some of the most difficult tasks and reduce the failure rate by a factor of 2 or more. This enables π*(0.6) to reach the robustness level required for practical applications: it can continuously make espresso for a whole day, fold various types of clothes continuously for hours in a new home, and assemble cardboard boxes required for actual packaging in a factory.

Imitation Is Far from Enough

We may wonder why VLA has difficulty achieving continuous success when relying solely on supervised learning (i.e., imitation), while supervised learning works well in LLMs and other machine - learning systems. The reason for this problem is actually well - understood, but there has been a lack of practical solutions.

When a VLA trained through imitation controls a robot, it will make some small mistakes like any model - it may place the gripper in a slightly wrong position, fail to grasp, or knock over an object.

Since the robot interacts in a real physical environment, these errors create situations slightly different from the training data, and in these situations, errors accumulate. The robot is more likely to make another bigger mistake. Small errors can be fixed, but accumulated errors lead to failure.

For AI systems that produce static outputs (such as LLMs), this is not a big problem. But when the model serves as a control strategy that continuously interacts with the external environment (such as a real - world robot), it becomes a specific problem. In fact, this means that while it's relatively easy to make VLA complete a certain task occasionally, it's very difficult to make it reliable and stable in achieving success.

If we use additional data from the VLA's own behavior, essentially allowing it to correct the actual mistakes it makes in the real world, just as humans can improve through practice on a certain task, by allowing VLA to practice repeatedly, the problem of error accumulation can be solved.

But for this type of experience, what can be used as a real label? If we train the strategy to just copy what it has done before, we're just teaching it to keep making the same mistakes. The key to enabling the model to learn from experience is to extract good training signals from poor - quality experience data.

Corrective Guidance and Reinforcement Learning

Recap enables us to obtain good training signals from "poor - quality" experience data through two ways:

Corrective Guidance (coaching with corrections): Experts show the robot how to fix errors or do better.

Reinforcement Learning: The robot determines on its own which behaviors are better or worse based on the final result of the entire task process and reinforces good behaviors and avoids bad ones through iterative learning.

For corrective guidance to work, expert remote operators need to provide correction signals to show how to recover from the errors the robot actually makes in the real world.

In practice, this means running the current strongest strategy and manually taking over control remotely (teleoperation) when the robot makes an error. This intervention can be used as a supervision signal, but unlike the demonstrations used to train the original strategy, this intervention targets the states the strategy actually brings the robot into, thus solving the problem of error accumulation.

However, relying solely on corrective guidance has limitations: the quality of this type of supervision is limited by whether humans can timely determine when to intervene and whether they can provide high - quality corrections. For obvious or serious errors, this method can work, but to achieve optimal performance - that is, to complete tasks quickly, reliably, and consistently - the robot must be able to learn autonomously.

The core challenge of learning through reinforcement learning from task results lies in credit assignment: understanding which actions performed by the robot lead to good results and which lead to bad results.

If the robot grabs the portafilter of an espresso machine in the wrong way, it may have difficulty inserting it. The error doesn't occur at the insertion stage but at the initial grabbing action. A correct credit - assignment method should attribute the failure to the grabbing error, even though the failure only manifests in a later step.

The foundation model trained only through imitation learning has difficulty inserting the portafilter into the espresso machine. The error leading to failure may occur at an earlier stage.

Credit assignment is a key challenge in reinforcement learning. Recap solves this problem by training a value function.

For example, in a game like chess, an agent only gets a reward when it wins the game. Then the value function predicts the probability of the agent winning based on the current game situation. Actions that increase the value function are good actions that should be encouraged, while actions that decrease the value function should be suppressed.

The following figure shows the predictions made by the value function during task execution.

Value function predictions at different time points in a round. This value function predicts the (negative) number of steps to complete the task. Note that the prediction increases when the robot makes progress and remains stable when the progress is small.

After training the value function, we need to use it to obtain a better strategy. There are multiple ways to achieve this, but what we need is a scalable method that can be used in combination with large - scale VLA models.

In Recap, Physical Intelligence adjusts the VLA based on value changes: using all training data (including good and bad actions) while telling the VLA which actions are good or bad. Since models usually have the best generalization ability when they have a large amount of data, retaining all data in training and using only value - change annotations as input is a very attractive option.

In reinforcement learning, this "value change" is called an advantage. During the execution phase, we simply let this advantage - conditioned VLA select actions with high advantages to obtain a better strategy than the training data itself.

For Real - World Tasks

Physical Intelligence uses Recap to train the π*(0.6) model to perform multiple real - world applications. π*(0.6) is trained based on the π(0.6) model, and π(0.6) is an improved version of the earlier π(0.5) model.

It uses a slightly larger backbone network and can handle more heterogeneous prompts and conditional information, as shown in the following figure. For a more detailed description of the π(0.6) model architecture, please refer to the model card.

https://website.pi-asset.com/pi06star/PI06_model_card.pdf

Physical Intelligence studied three application scenarios: making Italian coffee drinks, folding various types of clothes, and assembling cardboard boxes for packaging. In the first stage of Recap, the π*(0.6) model is pre - trained using offline reinforcement learning (offline RL), which contrasts with the standard supervised learning method used by the basic π(0.6) and π(0.5) VLAs. On this basis, π*(0.6) is then fine - tuned at the task level through demonstration data, and then continues to be trained through reinforcement learning using additional data collected by the robot in the real environment, including corrections provided by experts (to fix major errors) and feedback from rewards (to further improve based on autonomous experience).

The following chart compares the performance of the model at different stages: the basic π(0.6) model trained by supervised learning; the basic π*(0.6) model pre - trained using offline reinforcement learning (i.e., the first stage of Recap); the π*(0.6) model fine - tuned for each task through demonstration; and the final π*(0.6) model fine - tuned by combining the robot's real execution experience. For each task, Physical Intelligence measured the throughput (the number of successfully completed tasks per hour) and the success rate.

Notably, for some of the most difficult tasks (such as making Italian coffee), after incorporating the robot's real execution experience, both the throughput and the success rate have more than doubled.

Recap significantly improves the throughput in all tasks and usually also brings a substantial increase in the success rate.

In terms of quality, after learning by combining demonstration data and the robot's own experience, the final π*(0.6) model can master each application task proficiently. The following video shows some highlights of the evaluations of these tasks.

Qualitative examples of π*(0.6) in each real - world task. π*(0.6) can handle various conditions and recover from errors.

Each task contains many challenges, making it difficult to achieve high - throughput autonomous execution.

The cardboard box assembly task requires highly complex physical operations - folding the box lid while maintaining the box structure. In addition, this task needs to be performed repeatedly and handle various edge cases, as shown in the above video: sometimes flat cardboard boxes stick together, causing the robot to grab multiple boxes at once, and it must then put the extra boxes back; sometimes it needs to refold the box after an error.

The clothes - folding task needs to handle a high degree of diversity and generalize between different initial conditions and different types of clothes. This is very difficult because not only do different clothes require different folding strategies, but different fabrics also have different dynamic characteristics.

Finally, the Italian coffee - making task requires handling a very long sequence of operations. The new model uses a high - level language strategy similar to the previous π(0.5) model. This task also involves pouring liquids, determining when the coffee grinder and espresso machine have finished working, and cleaning the machine with a cloth after making the coffee.

These steps are extremely challenging for current state - of - the - art VLA models, and π*(0.6) can complete these tasks with a success rate of over 90%.

What's Next?

Currently, robot foundation models mainly rely on demonstration data collected manually (such as through remote operation). This method makes the training process simple and straightforward, but it also brings a serious obstacle: the data requires a large amount of manual input, the speed and reliability of the model are limited by human operation levels, and the robot itself cannot continuously improve through experience. Methods like Recap can theoretically solve these limitations because they can also learn directly from the robot's own experience.

As robots are more widely deployed in the real world, learning from experience may become an important data source and an indispensable part of achieving high - performance models.

Just as humans grow through a combination of "guidance - coaching - practice", robots will also learn from multiple different data sources. However, these data sources will play different roles: expert demonstrations are

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Real machine RL, the most powerful VLA model π*0.6 is here. Robots have opened a coffee shop in the office.

Imitation Is Far from Enough

Corrective Guidance and Reinforcement Learning

For Real - World Tasks

What's Next?