HomeArticle

Yingshuo Completes Tens of Millions of US Dollars Financing: "The Physical World is 3D, and the World Model Should Also Be 3D"

晓曦2026-06-18 12:14
The world model is shifting from generating videos that "look like" to constructing "computable and interactive 3D world states".

Exclusive news from "AnYong Waves": InSpatio, a spatial intelligence company, announced the completion of a Pre-A round of financing worth tens of millions of dollars. This round of investment was participated by Photon Venture Capital, Shunwei Capital, and Sequoia China.

The underlying logic behind this financing is becoming a consensus: The physical world is 3D, and the world model should also be 3D. For AI to enter real scenarios such as robotics, autonomous driving, gaming, film and television, XR, and industrial simulation, it must first learn to understand the dynamic 3D physical world.

In the past two years, video generation has pushed the level of "looking real" to new heights. However, for physical AI, generating images based solely on the appearance of the world is far from enough. If 2D video is a projection of the world, then a 3D world model needs to capture the dynamic states behind the shadows. Just as "seeing the shadows of flying birds cannot piece together the spatial skeleton and flight mechanics of the birds."

Robots need to move autonomously in space, autonomous driving needs to predict complex environmental changes, and gaming and XR need to build interactive scenarios. What these require is not "resemblance" but "reality": a 3D native world model that can understand spatial structure, object relationships, motion changes, and physical constraints.

InSpatio has chosen the 3D native route.

InSpatio previously released and open-sourced real-time 3D/4D world models such as InSpatio-WorldFM and InSpatio-World. Recently, it announced a new world simulator, Topos 1.0, which will soon be open for beta testing.

Betting on the Dynamic 3D World Model

Large models have reached a watershed, and the world model is becoming a new direction for global AI competition.

Language models teach AI to understand and generate text, and video models teach AI to generate images. However, if AI wants to enter the physical world, images alone are not enough. It needs to understand space, remember objects, predict changes, and simulate consequences before taking action.

This is precisely the gap that embodied intelligence is currently struggling to cross.

A service robot may learn to grab a cup by watching a video, but it may fail if the angle or the table changes. The reason is that the video records "what seems to have happened" from a specific perspective, but it cannot accurately tell where the object is, how the space is connected, and whether a failed grab can be stably reproduced. For physical AI, the truly valuable data is the dynamic 3D world state that changes after an action and can be repeatedly verified.

Therefore, the competition in the world model is shifting from "whose video looks more real" to "who can generate a world that is computable, interactive, and simulatable."

A beautiful video is more like a renderer, which depicts the existing state as an image. What physical AI really needs is a simulator, which maintains the world state, simulates the consequences of actions, and allows agents to make large-scale trial-and-errors. The renderer answers "what it looks like," while the simulator needs to answer "will it fall if pushed?"

To become a simulator, the world model must solve two problems simultaneously: efficiency and consistency.

The 3D native route does not require the model to learn spatio-temporal rules from redundant pixels, significantly reducing the training and inference costs and making large-scale parallel simulation possible.

In terms of consistency, once the perspective of a 2D model becomes larger, the time becomes longer, or there is occlusion, the object position may break. In contrast, a 3D model maintains a unified spatial state, ensuring that the same object remains consistent across different perspectives and moments.

This is where InSpatio's judgment comes from: "The physical world is 3D, and the world model must also be 3D."

They do not convert videos into 3D. Instead, they start from the spatial skeleton, build the geometric structure first, and then derive perspectives, represent motion, and simulate changes. While tech giants try to approximate the real world in the pixel fog with larger models and more computing power, InSpatio significantly compresses the search space with spatial geometry.

InSpatio-WorldFM achieves real-time 3D generation and multi-perspective consistency. InSpatio-World moves towards a navigable dynamic 4D world, and Topos directly generates a high-fidelity, editable, and interactive simulation environment.

These productization capabilities point to a non-consensus judgment: 3D native representation may become the efficiency lever of the world model.

Continuously Producing Dynamic 3D Data

The Internet is not short of videos, but it lacks dynamic 3D data with geometry, scale, material, motion, and interaction relationships.

This type of data determines whether physical AI can move from "seeing" to "learning." For robots to adapt to different rooms, warehouses, and blocks and achieve generalization of capabilities, they need a large amount of reproducible, editable, and verifiable spatial states. For autonomous driving to handle long-tail scenarios with ease, it also needs to controllably generate and recombine dynamic environments.

Therefore, whoever can build dynamic 3D data with a time dimension at a low cost is closer to the upstream of physical AI.

The InSpatio team has more than 20 years of experience in the field of spatial computing, with long-term research on SLAM, 3D reconstruction, NeRF, 3DGS, graphics, and real-time systems. They can extract spatial structures from real-world observations such as pictures, videos, and depth sensor data, and then convert them into 3D representations that can be learned, edited, and reused.

This system can also generate data with physical labels through simulation, forming a data flywheel of "real collection, 3D reconstruction, generative enhancement, and model training iteration."

This is also an important reason why investors are optimistic about InSpatio. The world model is becoming the infrastructure of physical AI. InSpatio has a scarce 3D technology closed-loop and the ability to build dynamic data. These capabilities have the potential to be applied in fields such as robotics, autonomous driving, gaming, film and television, XR, and industrial simulation, forming the value of an underlying platform.

InSpatio believes that the world model is not just a pure algorithm problem. It requires generative models and also the close integration of graphics, 3D vision, physical simulation, and real-time spatial computing systems. It processes not text or images but space that changes over time.

This also means that the team working on this must understand geometry and reconstruction, be able to understand deep learning, and have the ability to deliver commercial-grade systems.

InSpatio's team happens to be in this cross-field. Dr. Zhang Guofeng, the founder, is a professor at the National Key Laboratory of CAD&CG at Zhejiang University and a National Outstanding Young Scholar. He has been deeply involved in the fields of 3D vision and spatial computing/intelligence for more than 20 years. He once served as the chief scientist of the Digital Space Business Group at SenseTime, with both top academic research and large-scale industrial implementation experience. Dr. Liu Haomin, the co-founder, once served as the research director at SenseTime and has long been responsible for the technical research and engineering implementation of 3D vision and spatial computing/intelligence. Professor Bao Hujun, a Changjiang Scholar and National Outstanding Young Scholar at Zhejiang University, serves as the chief scientist, and his leading cutting-edge research team also provides solid theoretical and technical support for the construction of InSpatio's world model and spatial intelligence.

InSpatio did not switch to 3D after seeing the trend. Instead, it is a team that has long been in the fields of spatial computing/intelligence and 3D vision, waiting for the moment when AI knocks on the door of the physical world.

In the past, 3D technology was mainly used for 3D modeling, digital twin, spatial positioning, and AR/VR. Now, as robotics, autonomous driving, and spatial intelligence all desire an interactive world model, 3D has become the foundation for the world model to graft physical attributes and simulate physical changes.

According to InSpatio, this round of financing will be mainly used for model research and development, dynamic 3D data infrastructure, computing power infrastructure, industry cooperation, and talent recruitment. Subsequently, it will accelerate the launch of the official version of Topos 1.0 and initiate benchmark cooperation in industries such as embodied intelligence, gaming, and film and television. For InSpatio, this is more like the first signal of productizing the technical route.

While video generation is still competing for visual effects, the next level of competition in the world model has already begun: whoever can enable AI to learn the dynamic 3D physical world has the opportunity to stand at the data upstream of physical AI.

This is the entry point that InSpatio is betting on.

This article is from the WeChat official account "AnYong Waves", and is published by 36Kr with authorization.