HomeArticle

OpenClaw can't fry rice, and Ropedia releases human experiences. Here comes the "textbook" for robots.

新智元2026-03-17 15:24
LeCun and Fei-Fei Li are betting on world models. Ropedia has released the human experience dataset Xperience-10M.

[Introduction] When LeCun and Fei-Fei Li each bet $1 billion on world models, a more fundamental question emerged: Who will provide truly usable data for Physical AI? The answer Ropedia gives is not more videos, but a structured "Encyclopedia of Experiences" from the real world.

Recently, the hottest thing in the AI circle is playing with OpenClaw.

OpenClaw seems to be omnipotent: writing code, generating reports, modifying plans, and searching for information.

But there is one thing OpenClaw can't do - it can't help you fry a plate of fried rice with eggs in the kitchen.

However, when AI tries to enter the physical world, a long-hidden problem begins to surface:

Where is the data for robot learning, and how can human experiences and encounters in the real world be transformed into "high-quality data"?

In 2026, the wind vane of the AI world started to turn in the same direction: How to make intelligence truly enter the real physical world.

Provide Physical AI with a real "experience foundation"

Regarding the debate on the development path of AI, the persistence of Turing Award winner Yann LeCun represents an attitude, and the capital has also given a response.

AMI Labs, founded by him after leaving Meta, raised $1.03 billion in a seed round at a valuation of $3.5 billion - breaking the record for the largest seed round in European AI startup history.

Jeff Bezos, NVIDIA, Samsung, and Eric Schmidt, half of Silicon Valley, are on his list of investors.

LeCun said bluntly: "World models will become the next buzzword. Six months later, every company will claim to be a world model to raise funds."

Just two weeks before AMI Labs' official announcement, "AI godmother" Fei-Fei Li's World Labs just completed a $1 billion financing, and its valuation soared to $5 billion.

Two major chip giants, AMD and NVIDIA, entered the game at the same time, and Autodesk even invested $200 million in strategic investment at once.

Fei-Fei Li repeatedly emphasized a judgment in an interview at the beginning of the year: Spatial intelligence is the next frontier of AI.

Two of the most influential figures in academia have placed the same bet: Let AI understand the real physical world.

This is a signal of the times.

From language intelligence to physical intelligence: There is a "data bridge" in between

In the past decade, the leap of AI has been built on the text, images, and videos on the Internet scale.

Large models have learned to understand language, recognize scenes, and generate content. For the first time, intelligence has entered people's lives on a large scale.

However, when AI tries to further step into the physical world, the problems become completely different.

Robots not only need to "see" but also "act"; they not only need to recognize what a kitchen looks like but also understand how people move, operate, and interact with objects in it, as well as what physical consequences each action will bring.

This means that the next-generation intelligent systems - including spatial intelligence, embodied intelligence, world models, and Physical AI - need not just more videos, but experience data (Experience) that is closer to the real human action process.

The problem is: Such data hardly exists.

There are a large number of videos on the Internet today, but most of them are just "passive viewing" materials - lacking in-depth information, spatial structure, hand interaction trajectories, and causal relationships between actions and consequences.

For AI that wants to perform tasks in the physical world, a thousand hours of YouTube videos are far less useful than one hour of structured real human interaction experience.

The EgoScale research released by NVIDIA in February this year trained the VLA model with more than 20,000 hours of first-person human videos and discovered an almost perfect logarithmic linear Scaling Law - every time the scale of human data doubles, the model performance steadily improves.

This is the first time to prove with hard data that: Large-scale human experience data is a predictable source of supervision for robots to learn dexterous operations.

Whoever can continuously produce high-quality structured human experience data will hold the fuel for the next intelligent era.

However, the nourishment for machines should not be limited to the boring "factory operation guides" but should be a comprehensive "Encyclopedia of Human Experiences".

Just like the multiple versions of Neo the savior in "The Matrix"

The interactions in the real world are full of vivid complexity and diversity.

Ten million pieces of Human Experience: An "Encyclopedia of Human Experiences"

Against this background, Ropedia officially released a dataset of ten million pieces, about 10,000+ hours of Human Experience - Xperience-10M, and will announce it to researchers in an open way.

Now, Xperience-10M has been open-sourced on Hugging Face.

Hugging Face link: https://huggingface.co/datasets/ropedia-ai/xperience-10m

Just as the name "Ropedia" carries ambitions and romance -

Write a panoramic Encyclopedia for robots, transforming the flowing human life experiences into a universal foundation for AI to cross the era.

This is not a traditional set of originally collected data.

What they want to do is not to "upload a batch of videos" but to build an "Encyclopedia of Experiences" for machines to access the physical world.

Why is it called an "encyclopedia"?

Because for embodied intelligence and world models, what is really lacking is not a certain type of signal but a complete set of multi-dimensional, multi-modal data like an encyclopedia - it needs to have physical information, three-dimensional spatial information, interaction intentions, behavior trajectories. All these dimensions put together can form a vivid and realistic picture, rather than a video collection.

In the same trajectory, Ropedia provides data in five core dimensions:

  • Visual stream information (Continuous RGB first-person observation, 360° first-person collection)
  • Spatial information (Depth, spatial structure, environmental topology)
  • Action information (Whole-body movements, dexterous hand operations, interaction trajectories)
  • Interaction information (The interaction relationships between people and objects, people and scenes, and people and tasks)
  • And Semantic information (Task descriptions, state changes, atomic actions, behavior intentions)

More importantly, these dimensions are not "put together" later but are unified and aligned on the same time axis and the same structural framework.

Vision and actions are naturally synchronized, semantics can correspond to physical changes, behavior paths and spatial structures can be traced, and the entire task execution process can be replayed, modeled, and learned.

This is the most essential difference between Ropedia and a large number of datasets on the market: What it delivers is not a pile of raw materials but a set of structured intelligent data that can be directly fed into the model training process.

What is the 4D Physical World?

In Ropedia's definition, 4D is not just "3D plus time".

It really points to a more complete framework: 3D + Time + Interaction + Consequence.

Breaking it down, it is a closed loop of four dimensions -

Space (Where): Where is the intelligent agent, and what is the structure of the surrounding environment?

Action (How): How does it move, contact, and manipulate?

Interaction (With What): What objects does it interact with?

Consequence (What Changes): What real and observable physical changes do these actions bring?

These four dimensions form the minimum closed loop for machines to understand the physical world.

The reason is simple: without interaction, time is just a video; without consequences, actions are just a trajectory.

Only when the information of "how actions change the world" is written into the data itself, can Physical AI truly have the basis to learn about reality.

In this sense, what Ropedia releases is not just a dataset, but more like defining a new data standard for embodied intelligence.

HOMIE: Make human experience collection as natural as wearing glasses

Having a data standard is not enough. The more core question is: Where does such data come from?

There is no shortage of efforts in data collection in the industry today.

Tesla's Optimus project recruited a large number of people to repeatedly perform actions such as washing dishes and folding clothes in a dedicated facility while wearing motion capture suits. The number of cameras increased from four to six and then to eight.

Figure asked operators to wear Vision Pro for remote operation data collection.

These solutions each have their own advantages, but they all face a common limitation: They rely heavily on professional equipment and controlled environments.

Data collection can only take place in Tesla's data factory, Figure's model room, or the controllable scenarios in the laboratory.

Once the environment changes - the ability of the entire data closed loop drops sharply.

True generalization requires completing the data closed loop in thousands of end scenarios in the real world.

For this reason, Ropedia released its end-to-end collection platform - HOMIE as early as 2025.