HomeArticle

The 60-year Evolution History of the Robot "Brain": Five Generations of Evolution of Foundation Models and Three Closed-source Schools

硅谷1012026-01-15 11:42
How far are we from having robots that can truly do work?

In 2025, the demos released by robotics companies were somewhat surreal:

First, Figure AI released its third - generation robot in October. It can perform various household chores, and its demo was quite impressive. However, there were many doubts about the success rate of its tasks, and the face design had a rather severe uncanny valley effect.

Another star company, 1X, released its demo at the end of October. Its overall facial design was much cuter, giving the impression that people would be more willing to have it at home. But the robot named Neo relies on remote control and has been criticized as having “fake intelligence,” along with various privacy issues.

Meanwhile, although Tesla's robot also had various demo updates, including a very smooth running demo in December, it was obvious that its mass - production plan faced great challenges in 2025. The company had to suspend production and redesign the hardware.

In our robot series, we've already discussed dexterous hands and the annual review of the embodied intelligence industry in 2025. In this article, we'll have an in - depth discussion about a core technology in this industry: the robot foundation model. We'll try to answer the question: Why did 2025 suddenly become the “starting year” of the robot foundation model?

We also visited cutting - edge robotics companies and laboratories in Silicon Valley. The foundation model section will be divided into two parts: “closed - source” and “open - source,” systematically dissecting how the “brains” of current mainstream robots are trained, how they are connected to the real world, and the technical and business logics behind different approaches. It will help you understand how the brains of robots in the era of large models are developed. In this article, we'll first talk about the current favorite in the capital market: the closed - source system.

01 Robot Foundation Model: The Paradigm Revolution from the 1960s to 2025

If we want to explain the robot foundation model in one sentence, the simplest analogy is: If GPT is the “talking brain,” then the robot foundation model is the “hands - on brain.”

However, it took humans a full 60 years to develop this “hands - on brain.” Let's first review the four major robot paradigms before the emergence of large models.

Chapter 1.1 The First Generation: Programmable Robots (1960s - 1990s)

In 1961, the world's first industrial robot, Unimate, “started working” in a General Motors factory. Its job was simple: pick up hot metal parts from the production line and place them on another production line.

From today's perspective, it was quite “stupid” because it relied entirely on programming. Engineers used code to tell it:

Step 1: Move the arm 30 centimeters to the left.

Step 2: Close the gripper.

Step 3: Move the arm 50 centimeters upward.

Step 4: Rotate the arm 90 degrees to the right.

Step 5: Open the gripper.

It sounds stupid, right? But at that time, it was a revolutionary breakthrough. The problems with this approach are obvious: Zero fault tolerance and zero flexibility.

If the position of the part was off by 1 centimeter, the robot couldn't pick it up. If a part of a different size was used, the code had to be rewritten. Not to mention dealing with unexpected situations - for example, if the part fell on the ground, the robot wouldn't know what to do at all.

However, in a highly controllable environment like a factory, this method worked for decades. Even today, many welding robots in automobile factories still use this “programmable” logic.

Chapter 1.2 The Second Generation: SLAM - Based Methods (1990s - 2010s)

In the 1990s, roboticists realized that programming alone was not enough, and robots needed to be able to “perceive” the environment. So, technologies such as SLAM (Simultaneous Localization and Mapping) and motion planning emerged.

The core idea here is: First, use sensors to “see” the surrounding environment and create a 3D map. Then, plan a path on the map and finally execute the action. The most successful application of this method is the vacuum - cleaning robot.

The popular Roomba works like this: It uses lidar to scan the room and create a map. Then, it plans a path to cover all areas. It moves along the path and avoids obstacles.

This method is very successful in “navigation” tasks. Early self - driving cars, drones, and logistics robots basically followed this pattern. However, it doesn't work well in “manipulation” tasks because manipulation tasks are too complex. For example, to make a robot fold a towel, the traditional method has four steps:

1. Use vision to identify the four corners of the towel.

2. Calculate the 3D coordinates of each corner.

3. Plan the movement trajectory of the arm.

4. Execute grasping, folding, and putting down.

It sounds reasonable, but there are many pitfalls in actual operation. The towel may be wrinkled, making it impossible to identify the “four corners.” The towel is flexible, and once grasped, it deforms, causing the 3D coordinates to become invalid immediately. Each step may go wrong, and if one step fails, the whole process collapses.

In 2010, a research team at the University of California, Berkeley, conducted an experiment: They used this “perceive - plan - execute” method to make a robot fold towels. On average, it took 24 minutes to fold one towel.

Even in the era of AI today, folding towels is still a very core task that requires a foundation model to drive robots to tackle.

Chapter 1.3 The Third Generation: Behavior Cloning (Mid - 2010s)

Since manually designing rules doesn't work, can robots directly “learn” from humans? This is the idea behind behavior cloning (also known as imitation learning).

Taking towel - folding as an example again, a robot using imitation learning would do the following: Let humans demonstrate how to fold towels many times. Record the visual input and action output of each frame. Train a neural network to learn the mapping from input to output. When the robot sees a towel, it directly outputs the action to be taken.

In 2015, a team at Google Brain used this method to enable a robot to learn to grasp various objects. They collected hundreds of thousands of grasping data and trained a neural network, which promoted the progress of “vision - action” learning in robot grasping tasks.

This was a huge improvement! For the first time, robots didn't need manually written rules and could learn from data.

However, this method has a fatal flaw: Low data efficiency. It requires hundreds of thousands of grasping data for training, and this is just for the “grasping” action. If it wants to learn to “fold towels,” perhaps a million demonstrations wouldn't be enough.

What's even more problematic is that this method has poor generalization ability. The model trained with data collected from a Type - A robot can hardly be used on a Type - B robot.

Chapter 1.4 The Fourth Generation: Reinforcement Learning (Late 2010s)

In 2016, AlphaGo defeated Lee Sedol, proving the power of reinforcement learning. Robot scientists wondered: Can robots also use reinforcement learning to figure out how to complete tasks on their own?

The core idea of reinforcement learning is: Without human demonstrations, let the robot try on its own. Reward it for correct actions and punish it for wrong actions. Gradually, the robot will learn how to obtain the most rewards.

At that time, Boston Dynamics' robots began to introduce reinforcement learning into their movement control systems, enabling them to walk, jump, and do backflips on various complex terrains.

But reinforcement learning also has a big problem: It's too slow. To learn to play Go, AlphaGo played tens of millions of games against itself in a simulated environment. However, it's difficult for robots to practice manipulation tasks in a simulated environment because the environment is too complex to set up, and it differs greatly from the real physical world, resulting in inaccurate simulations.

What about testing on a real robot? It's too slow, too expensive, and too dangerous. Imagine making a robot learn to fold towels. It may need to try millions of times, and most of the time, it will encounter situations like missing the grasp, throwing the towel on the ground, tearing the towel, or getting the arm stuck. How long will it take to learn?

Moreover, reinforcement learning has a more fundamental problem: It doesn't know “common sense.” Humans know that a towel is soft, can be folded, and has a certain amount of friction. But a reinforcement - learning robot needs to “discover” these common - sense through countless trials and errors, which is extremely inefficient.

Chapter 1.5 The Fifth Generation: VLA Model (Mid - 2020s - Present)

The emergence of large language models changed everything. In 2022, ChatGPT came out of nowhere, and people discovered that large language models contain a large amount of “common sense” about the human world: It knows what a towel is, what folding means, and what to do first and what to do later. It has reasoning ability, planning ability, and generalization ability.

The first reaction in the industry was whether large language models could be combined with robots. Thus, the VLA (Vision - Language - Action) model was born. The revolutionary aspect of the VLA model is that it unifies three things in one neural network:

Vision: See the current scene; Language: Understand the task goal and common sense; Action: Output specific control instructions.

For example, if you tell a robot, “Help me put the apple on the table into the basket.” The traditional method requires four steps:

1. Use vision to identify the “apple” and the “basket.”

2. Plan the trajectory for “grasping the apple.”

3. Plan the trajectory for “moving to the basket.”

4. Plan the “putting down” action.

What about the VLA model? It's an end - to - end neural network that directly outputs “what action to take next” from “language instruction + visual input.”

What's even more amazing is that it can perform “common - sense reasoning.” For example, if you say, “Help me prepare breakfast,” in a home environment, it knows to take eggs from the refrigerator, handle the eggs carefully to avoid breaking them, and put bread in the toaster.

You don't need to program these common - sense rules one by one, and the robot doesn't need to “discover” them through millions of trials and errors because they are already in the large language model.

York Yang

Co - founder of Dyna Robotics:

We use VLA at the architectural level. Simply put, we use the VLM in the large - model field as the so - called backbone, but when outputting the final result, we convert it into actions that can be used in the robotics field. Actions can be intuitively understood as commands like moving the arm to a certain coordinate point.

The most criticized aspect of VLA is: Why do we need L (Language)? In traditional robot algorithms, many are purely based on vision. But if you think carefully, your brain actually generates something similar to language to tell you what to do first and what to do next in a long - term task.

The role of L is that for very complex tasks, it can use the logical knowledge trained in large language models. For example, if you want to drink water, it knows that you need to find a cup or a bottle. This is something that can be directly obtained from the large language model.