"World Model" Produced by Pengcheng Laboratory Raises Hundreds of Millions of Yuan in Financing

How should we develop the "world model" in a real and efficient way?

In today's artificial intelligence competition, Mark Zuckerberg and his Meta might be the most "radical" players, without a doubt.

Over the past year, Zuckerberg has been spending lavishly and recruiting talent from all corners, attempting to assemble the world's most powerful AI product team. He often offers a "job - hopping bonus" of $100 million to those who have worked at top - tier companies like OpenAI and Anthropic. The biggest expense was on Wang Tao. To get this young genius to join Meta and lead the AI team, Zuckerberg spent a staggering $14.8 billion to directly acquire Scale AI, which Wang Tao founded, taking the whole company under Meta's wing.

In addition, Zuckerberg has also targeted the venture capital fund NFDG under the leadership of Daniel Gross, the CEO of SSI and a former partner at Y Combinator. Subsequently, he invited two partners of NFDG, Daniel Gross and Nat Friedman, the former CEO of GitHub and the host of the famous tech podcast "Hacker Medley", to join Meta, with plans to establish Meta's first corporate venture capital (CVC) fund since its inception.

More importantly, Zuckerberg has an ace up his sleeve: Meta's Chief Scientist Yann LeCun.

Who is Yann LeCun? He is the recipient of the Turing Award, the highest honor in the computer science field. He is a direct disciple of Geoffrey Hinton, the "father of AI", and the senior fellow of Ilya Sutskever, the former chief scientist of OpenAI. He is also the proposer of the back - propagation learning algorithm for neural networks. One could say that if there were a "martial arts world" for AI, Yann LeCun would be the grandmaster capable of founding a new school for Meta.

However, just when things seemed to be all set for a big push, this ace card surprised everyone: Yann LeCun announced that he would leave Meta by the end of the year to start his own company. In his view, "current large language models are dumber than cats and have a poor understanding of the physical world", claiming that the current path is a "dead end". To create true "Artificial General Intelligence (AGI)", one needs to focus on an alternative technical route: "world models".

As soon as this news was released, it quickly set off a storm in the global tech circle, and "world models" instantly became a hot topic. Countless people are discussing "what exactly are world models?" and "what are the differences between world models and the well - known large language models?"

In China, there is also a group of scientists pondering the same questions. They are not confined to Yann LeCun's theoretical framework and are trying to come up with their own solutions. According to Touzhongwang, recently, 'Tuoyuan Intelligence', incubated by Pengcheng Laboratory and focused on the research and development of 'physical space intelligent models', announced the completion of a series of Pre - A round financing worth hundreds of millions of yuan. It has attracted multiple strategic and industrial investors, including listed company Dongfang Seiko, Xingchen Technology, Detao Capital, an affiliated fund of Jinpai Home, and Shixi Capital, as well as heavy - weight state - owned investment platforms like Yueke Venture Capital and research institution funds such as Pengcheng Vision and Redbird Sailing Fund. Shenlan Capital serves as the long - term exclusive financial advisor.

It is reported that the funds from this round of financing will mainly be used for the R & D of physical space intelligent models, enhancing the model's physical reasoning and cross - scenario migration capabilities, building an embodied ecosystem, and accelerating the commercialization of related products.

What are "world models"?

Why is the large language model a dead end? Yann LeCun, who has studied the human brain for a lifetime, believes that humans can reason and plan because they can remember things, have intuition, and possess common sense. The working principle of large language models is actually to infer the next most logical token, while image/video models infer the next most logical pixel.

In other words, although these models have shown remarkable reasoning abilities, they are only limited to the dimensions of "tokens" and "pixels" and do not truly understand the three - dimensional world. Take a simple real - world scenario as an example: Given the description "the door is 80cm wide, the table is 50cm wide, and a person's shoulder width is 55cm", current language models often compare the numbers one by one and conclude that "they are all narrower than the door, so they can pass through together", completely ignoring basic physical laws such as the combined width when the two are side by side, the projection changes caused by rotation, the constraints of posture adjustment, and the impenetrability between objects. Such mistakes are not just due to a lack of knowledge but a lack of true understanding of the physical space, highlighting the fundamental reason why current AI cannot be a reliable participant in the physical world.

To be more practical, although large language models have made breakthroughs in text reasoning and knowledge processing, they still have fundamental flaws in understanding the real physical space, planning continuous actions, and interacting with the environment in real - time. These flaws not only make the realization of AGI seem far - off but also directly limit the expansion of AI technology to more practical application scenarios such as embodied intelligence.

For example, because the model cannot accurately understand the spatial structure and geometric relationships, robots often fail in simple tasks, such as "misaligning, failing to grasp, being unable to avoid obstacles, and not moving straight". In a grasping task, the robotic arm may miss the target multiple times due to misjudging the target position or slightly collide with table corners or walls when moving, indicating a misjudgment of distance, reachability, and obstacle - avoidance conditions. In more complex scenarios, the model may even generate action plans that violate physical laws, such as asking the robotic arm to pass through obstacles, making the mobile platform drive into an impassable narrow gap, or outputting an unstable trajectory on an inclined plane. Moreover, these systems are highly dependent on the training scenarios. When the lighting changes, the object position shifts slightly, or the perspective deviates, their performance will decline significantly, and the execution results of the same instruction may vary greatly in different scenarios.

In short, to enable AI to truly have human - level learning ability, we need to help large models truly understand our "physical world", and this route is called "world models". Yann LeCun said, "A world model is your mental model of how the world works. You can imagine a series of actions you might take, and your world model will allow you to predict what impact this series of actions will have on the world."

Renowned Chinese - American scientist Fei - Fei Li also shares the same view. She believes that the main technological direction for AI in the next decade should be "world models" with spatial intelligence. The criterion for judging whether a model has "spatial intelligence" is the ability to generate a world that conforms to physical laws and is spatially consistent, process multi - modal inputs from images to actions, and predict how these worlds will evolve or interact with them.

Of course, the so - called "route dispute" doesn't emerge because other large - model developers don't recognize the value of "world models". Instead, there are difficulties.

To transition from the digital world to the real world, a basic ability is to judge actions in the real world and interact with it. Currently, the mainstream large - model architecture, the Vision - Language - Action (VLA) model, has two unavoidable flaws that are difficult to completely resolve even with the introduction of world models:

First, VLA usually compresses visual input into the language token space. This process naturally loses crucial geometric, topological, and physical quantity information in the continuous space, making it difficult for the model to understand precise positional relationships, resulting in deviations in action control and even outputting operation sequences that violate physical constraints.

Second, the generalization ability of VLA is extremely limited. The real world is highly complex and diverse, and embodied intelligence is extremely sensitive to changes in perspective, environmental layout, object occlusion, and dynamic conditions. When these factors are combined, the VLA model may perform well in the training scenarios but fail to transfer to new environments. Once the background changes, the lighting is different, or the object position shifts slightly, the model's perception - reasoning - action chain may completely break down.

One could say that these two bottlenecks directly lead to a serious lack of AI capabilities in the physical space, making the current "road to AGI" seem like a bottomless pit. As an example, in October 2025 - just one month before Yann LeCun announced his departure to start a business - Zuckerberg publicly stated that to maintain industry competitiveness, Meta's expenditure next year would exceed $100 billion. This statement directly triggered people's anxiety about the high cost of large - model development and severely tested investors' patience. In late October, Meta's stock price plummeted by 12.6%, and its market value evaporated by nearly $240 billion.

"VWA", Making "World Models" More Feasible?

So, how can we develop "world models" in a real and efficient way? Fei - Fei Li and Yann LeCun are thinking about it, and Chinese scientists are also pondering the same question. Tuoyuan Intelligence is one of them.

Tuoyuan Intelligence is one of the first companies in the intelligent computing ecosystem construction of Pengcheng Laboratory. Its core startup team consists of top - notch AI scholars from home and abroad, including Dr. Wang Guangrun, a young leading scientist in the AI field (the highest - level recipient of Huawei's Genius Youth Program), Dr. Wang Keze, a national - level young talent (winner of the Wu Wenjun Artificial Intelligence Science Award), and Dr. Liang Xiaodan, the head of the Sun Yat - sen University - Tuoyuan Joint Laboratory (winner of Alibaba's Qingcheng Award).

The answer they came up with is "VWA", the Vision - World - Action model, a brand - new architecture different from the VLA model.

The Tuoyuan team believes that the key bottleneck restricting the improvement of current large - model capabilities is the general lack of generalization in existing models. To break this bottleneck, the overall capabilities need to be decoupled into two major modules: "physical modeling" and "spatial modeling". Through this split, the model can obtain highly general and cross - environment stable physical modeling capabilities, while the part that truly affects generalization only exists in the spatial modeling of specific scenarios. This mechanism is highly consistent with human behavior when operating robots in an unfamiliar environment: humans do not naturally have "generalization ability" but rely on quickly adapting to the spatial layout in a new environment to complete tasks.

VWA is designed based on this idea. Different from VLA, which must compress visual information into the language token space, the VWA model can directly reason and make decisions in the physical space, perform multi - step roll - outs in the continuous physical space, and predict future state changes, thus taking a crucial step in planning, safety assessment, and stable control.

The core of realizing the VWA model architecture is the Physical Autoregressive Model (PAR) developed by Tuoyuan. The PAR model encodes video frames and robot actions together as "physical tokens", enabling the model to predict the next video and action step - by - step in an autoregressive manner, forming a closed - loop of "predict - execute - re - predict". More importantly, without pre - training on actions, the PAR model can effectively learn the dynamic laws of the physical world, achieving a 100% success rate in the PushCube task of the ManiSkill robotic operation benchmark and performing comparably to strong baseline models that require action pre - training in multiple tasks. This achievement significantly advances the technical path of transferring from large - scale video pre - trained models to real - world robotic manipulation capabilities and lays an important foundation for building embodied intelligence with general physical common sense.

Secondly, in terms of the underlying reasoning mechanism, Tuoyuan has developed a new Tweedie Framework, which significantly improves the accuracy of action control. At the same time, it has introduced an efficient Eon computing mechanism, greatly enhancing the model's operating efficiency and long - sequence modeling ability. The combination of the two lays a solid foundation for building a more reliable, intelligent, and generalizable physical space intelligence.

In terms of data, Tuoyuan Intelligence has introduced multi - source and high - quality physical data, mainly including: (1) Real human grasping and natural scene data with spatial information: Billions of levels of binocular and multi - binocular vision data collected based on real business scenarios, covering a variety of real environments and diverse task scenarios, with highly consistent spatial structure information and natural and continuous human action trajectories. Compared with the existing data mainly based on simulation or staged shooting, this real - task data has significant advantages in scale, diversity, and authenticity and supports fine - grained spatial understanding and semantic analysis of a large number of objects through rich 3D spatial clues. (2) Training - ground simulation data: Relying on the embodied intelligence training ground with virtual - real twin technology, through high - fidelity 3D physical environment reconstruction and realistic object asset construction, large - scale physical simulation data and simulated tele - operation data are generated, providing controllable, scalable, and repeatable training conditions for the model.

Relying on the new model architecture and a large amount of real pre - training data, the efficiency of model development is greatly improved. It requires very little data for adaptation (even just one example data), and the scale of the involved parameters is extremely small (for example, in a model with tens of billions of parameters, only about 4000 parameters need to be updated). More importantly, the model can quickly adapt online in new environments. Take a household robot as an example: A household robot no longer needs a long learning and adaptation process but can be put into use immediately after quickly modeling the new spatial layout.

Under such expectations, Tuoyuan Intelligence has attracted a lot of attention from the capital market since its inception. Since its establishment in 2022, Tuoyuan Intelligence has completed multiple rounds of market - based financing. The investors include market - based institutions such as Zhuoyuan Capital, Yuanshu Capital, Redbird Sailing Fund, and Ginkgo Valley Capital, as well as state - owned platforms such as Yueke Financial Group and Pengcheng Vision Fund.

The investors in this round all have deep resource backgrounds and strategic layouts in their respective fields, further confirming the capital market's recognition of Tuoyuan Intelligence's technology and development prospects. For example, one of the investors, Dongfang Seiko, is a leading enterprise in high - end intelligent equipment manufacturing. Currently, Dongfang Seiko focuses on "building a full - industrial - chain ecosystem for embodied intelligent robots and empowering the intelligent upgrade of traditional industries" as its core strategy, proactively deploying in the artificial intelligence + embodied intelligent robot track, and has formed a full - industrial layout covering robot body manufacturing, R & D of multi - modal large - model intelligent brains, and expansion of application scenarios.

Xingchen Technology is a globally leading enterprise in visual AI SoC chip design, ranking first in the global market share of visual AI SoC (in terms of shipment volume) and second in the global market share of robotic visual AI SoC. Based on the core framework of "vision + AI" and the core capabilities of "perception + calculation + connection", it focuses on providing AI SoC solutions for edge - side devices in fields such as smart vision, smart transportation, intelligent robots, smart homes, smart offices, and smart industries.

Detao Capital is the industrial investment platform of Jinpai Home and Jianpan Group. It conducts strategic investments around the "pan - home furnishing industry Internet ecosystem platform", focuses on fields such as the pan - home furnishing industry chain, artificial intelligence, robots, smart homes, and industrial Internet, and is committed to in - depth industrial development to increase value, strengthening industrial technology incubation, empowering the industrial chain, cultivating leading enterprises in industrial segments, building a pan - home furnishing industrial ecosystem, and creating a pan - home furnishing industry Internet. Currently, it manages six funds and drives industrial development through the "capital + industry + technology + platform" model.

Shixi Capital was established by a leading integrated circuit storage enterprise and an investment team. It has long focused on investments in cutting - edge fields such as hard technology and has a wide - ranging layout in fields such as semiconductors and artificial intelligence. It helps invested enterprises grow through industrial resource docking and technology empowerment. Shixi Capital manages more than a dozen funds and has invested in nearly 60 projects so far, with several of the invested enterprises successfully going public .

Fei - Fei Li once quoted the famous philosopher Ludwig Wittgenstein's words, "The limits of my language mean the limits of my world", and said, "At least for AI, the world is far more than just words." One can imagine that with more support from industrial partners and direct connections to more actual production scenarios through this round of financing, Tuoyuan can further verify the applicability of VWA, and

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Produced by Pengcheng Laboratory, a "World Model" has raised hundreds of millions of yuan in financing.

What are "world models"?

"VWA", Making "World Models" More Feasible?