HomeArticle

"Inverse Matrix" completes over USD 100 million financing, founder says the window for general world foundational models has shrunk to 18 months

王毓婵2026-06-17 09:00
Inverse Matrix plans to release its flagship model by the end of 2026.

Text by | Wang Yuchan

Editor | Zhang Yuxin

Since 2026, the scramble for world models in the primary market has reached a white - hot stage. Instead of "casting a wide net" as in the early days, funds are now highly concentrated on top players. Among them, Physis Technology (Physis) has successively completed multiple rounds of financing.

36Kr Intelligence Emergence exclusively learned that Physis, a world model company, has completed a Seed ++ round of financing exceeding 100 million US dollars; in March of this year, it just completed a first - round financing exceeding 10 million US dollars. This round was jointly participated by institutions such as Matrix Partners China, WuYuan Capital, and Photosynthetic Ventures, and received strategic investment from Ant Group. Old shareholders Hillhouse Capital and Yanyuan Ventures continued to increase their investments.

Before and after the completion of this round of financing, Physis released the general world base model Physis - v0.1, which was summarized as a general physical world application of "One For All". This model features four capabilities: physical correctness, long - range consistency, action causality, and general generalization. One pre - training can serve multiple scenarios such as embodied intelligence, industrial simulation, game physics, and scientific prediction.

Physis plans to release its flagship model by the end of 2026 and will release open - source slices and technical reports during the process. The funds from this round will be mainly used for the pre - training R & D of the general world base model and the construction of a large - scale training system.

The team was jointly founded by young scholars from Peking University, Chen Boyuan and Ji Jiaming. Half of the team members are young scholars (including Olympic gold medalists, provincial and municipal top - scorers, and many top - conference paper authors), and the other half are senior engineering talents from first - tier technology companies. They form a flat AI - native team without hierarchical reporting and quarterly indicators. They align their directions through technical judgment rather than administrative orders, and believe in free exploration, first - principles thinking, and long - termism.

At the moment when the new round of financing was finalized, Intelligence Emergence exclusively interviewed Chen Boyuan. He answered multiple questions regarding the organizational structure, financing rhythm, technical route, industry judgment, and scenario implementation.

"The current consensus in the industry is that within 18 - 24 months, there will be a landmark leap in the capabilities of the world base model, and within 36 months, it will be applied in multiple real - world scenarios." Chen Boyuan said. "This is highly consistent with the path of language models from GPT - 3 to ChatGPT."

The following is the transcript of the dialogue between Intelligence Emergence and Chen Boyuan:

The window period for the general world base model is being compressed from three years to eighteen months

Q1: Congratulations on Physis completing a new round of financing exceeding 100 million US dollars. It's been less than two months since the last round of financing. Why can you maintain such a fast financing rhythm?

Chen Boyuan: This reflects investors' bets on the third paradigm shift in AI development.

In the past decade, AI has experienced two paradigm shifts: language models (predicting the next word) and visual generative models (predicting the next frame), which have respectively given rise to platform - level companies. The current third shift is brought about by AI moving from the virtual world to the physical world, with the core being "predicting the next physical state" in the physical space.

This paradigm of "given the current state and action, predicting how the world evolves" has appeared in sub - problems such as AlphaGo and robot control, and is now converging into a unified solution framework. However, the fundamental difference between the physical world and the virtual world is that the physical world is "partially observable". The model cannot just "do what it sees" but must understand the underlying physical constraints.

Investors are willing to quickly follow up and increase their investments mainly based on two judgments:

First, the path of the base model, which is "unified modeling of physical laws at the bottom layer and adaptation to different scenarios as needed at the upper layer", is becoming the industry consensus.

Second, the window period for the general world base model is being compressed from three years to eighteen months. Teams doing general pre - training will have more room. A leading general base model with data scaling and algorithm effectiveness will form a barrier that is difficult to catch up with.

Q2: What are the most frequently asked questions by investors during the financing process? What is the consensus on the time period for technology to be applied in the real world?

Chen Boyuan: The most frequently asked questions are: "Why should we believe that the general world base model can be successful?" and "Is the team firmly committed to building the base model?"

In our view, the key to whether it can be called a base model lies in whether it is truly built from the goal of physical prediction. So we started from scratch to solve the physical prediction goal, self - developed the underlying architecture, and saw the dawn of reasonable physical deduction outside the training distribution.

Internal experiments at Physis show that as the data and parameter scale increase, the state prediction error continues to decrease, showing exponential scaling potential similar to large language models, without the saturation inflection point of vertical models.

Regarding the implementation period, the consensus is that within 18 - 24 months, there will be a landmark leap in the capabilities of the base model, and it will achieve high scores in real - world needs; within 36 months, it will be applied in multiple real - world scenarios. This is highly consistent with the path of language models from GPT - 3 to ChatGPT. By then, each vertical scenario will directly become a caller of the base model API, forming a relationship similar to that between AWS and SaaS.

Q3: Why didn't you introduce industrial fund investment at present?

Chen Boyuan: At this stage, what we need most is to concentrate our "ammunition" in one direction to overcome the R & D and computing power thresholds of the general world base model. This is something that requires long - term and dedicated investment.

We are not in a hurry to commercialize at present. This is our value judgment at this stage. For a company building a general base model, binding the model to a vertical scenario too early for monetization may seem like picking the low - hanging fruit, but in fact, it sets a boundary for itself. Once you collect data, adjust the model, and make deliveries around a single scenario, you will gradually degenerate into "one scenario, one model".

We believe that there is a general solution to physical laws. Gravity, collision, friction, etc., follow the same laws in any scenario. The value of the base model lies in cross - scenario reuse. So not being in a hurry to monetize does not mean that we don't value commercialization. We value commercialization, but at this stage, we are more willing to first strengthen the physical understanding ability of the base model. The rhythm of commercialization will naturally unfold as the technology matures and real industrial needs emerge.

Capabilities come before commercial actions, and the organizational style remains restrained. Investors are ultimately willing to pay for repeatable and expandable capabilities, and strengthening these capabilities is the only thing we should do at present.

Q4: You are also the head of the Zhiyuan Behavior World Model Innovation Center. Is there any connection between Physis and the Zhiyuan Research Institute?

Chen Boyuan: The Zhiyuan Research Institute has always been positioned in the original innovation from 0 to 1 in the AI field, while Physis focuses more on the underlying exploration and commercial technology development of the general world base model. Both are approaching the same goal - making artificial intelligence truly understand physical laws.

The most critical watershed: truly having the potential of a base model

Q5: Will the world model have its own Scaling Law?

Chen Boyuan: The physical world must have its own Scaling Law, but it must not follow the Scaling Law of language models or video generation. There are three reasons for the failure of direct replication:

Data limitation: Physical interaction data cannot be crawled infinitely like Internet text, and the collection and screening costs are extremely high.

Pixels do not equal physics: 90% of the information in videos, such as texture, lighting, and motion blur, is visual redundancy unrelated to physical laws.

Correlation does not equal causality: Pure observation can only learn statistical correlations, while the core of physics is causality. "Actions" must be involved to distinguish laws from coincidences. Therefore, we must scale up in the "physical latent space" rather than the pixel space. This involves four key technical judgments:

Compression: Encode the world into an efficient physical latent space containing abstract representations such as force and velocity, and strip visual redundancy.

Causality: Introduce action intervention natively in the latent space to let the model understand the physical state transfer caused by actions.

Verification: The pure generative path only has generative ability but lacks verification ability, and is prone to "physical hallucinations" such as penetration and weightlessness. Therefore, we introduce reinforcement learning, such as RLVR verifiable signal rewards, to build a closed - loop alignment signal through clear physical constraints.

Generalization: The final latent space must be able to serve different scenarios (One for All) because physical laws are the same in different scenarios.

Q6: In model training, how is the mechanism designed to let the model learn from "active intervention"? What specific reward and punishment mechanisms are introduced to prevent the model from experiencing a breakdown in physical deduction when facing unseen environments?

Chen Boyuan: Physical world laws arise from interactions, not passive perception.

Therefore, we designed the model architecture from scratch and introduced actions natively in the underlying physical latent space. This is not like traditional video generation models that respond to control by grafting an engine, just like you can't weld a steering wheel on a car without one and claim it is controllable.

We inject actions, whether it is joint motion or movement residual vectors, as conditional signals to modulate the prediction process of the next physical state. In this way, the density of each piece of data is doubled, and the model learns not "what the world looks like" but "what action is taken and what transfer it causes", thus achieving a leap from correlation to causality.

Physics is naturally verifiable. For example, objects do not disappear out of thin air, racing cars cannot pass through walls, and fluids cannot be poured out like ice cubes. Therefore, we built an automated physical verification sandbox in reinforcement learning.

W0–W5 world model ability classification of Physis. Drawn by Physis.

Q7: In the "W0 - W5 world model ability classification" mentioned by Physis, which level is the model currently at? When "a robot can successfully crack an egg", which level does the model belong to?

Chen Boyuan: This classification is comparable to L0 - L5 in autonomous driving. Currently, most models are at the W0 - W1 level, able to respond to actions and generate smooth videos.

Physis is working on the leap from W1 to W2, which is the most critical watershed. W2 represents that the model truly has the potential of a base model, solving the problem of "physical authenticity" and understanding causal relationships. If the goal is just to let a robot "crack an egg", vertical training can also achieve excellent control, but it may only understand the local scenario of cracking an egg and not understand general physics.

The core of measuring whether a base model is good enough lies in "action following" and generalization ability. Just like a base model can not only crack an egg but also play with a yo - yo in a flexible material scenario. Large models have achieved a leap in general mathematical code reasoning through reinforcement learning, and the world model also needs to learn under clear physical verification signals to achieve a general exponential leap.

Q8: In the process of model development, what is the most core bottleneck: computing power, data, or algorithms?

Chen Boyuan: I think they are all very important. But if I have to choose one, I think it is the "paradigm" reflected behind data and algorithms. Because the three of them are actually unified in the change of the underlying paradigm.

Data level: We built a data pyramid. The first layer is real - world videos with strong physical interactions (learning the world state); the second layer is first - person (Ego - centric) videos and game engine data (learning the transfer caused by actions); the third layer is extremely scarce key physical mutation data (such as glass breaking, fluid fracture), and we produce this high - value data through a self - built data production closed - loop.

Computing power level: The key lies in computing power efficiency. Scaling in the physical latent space ensures that what is learned with the same computing power is all effective physical signals, not visual noise.

Algorithm level: Reinforcement learning provides an infinitely available physical teacher and introduces automated verification into the model.

Q9: Facing the current reality of expensive and scarce GPUs and extremely scarce and expensive real - world physical interaction data, how does Physis solve these problems?

Chen Boyuan: We mainly solve these problems through data cooperation and reconstructing the data acquisition paradigm.

First, in terms of data cooperation, we have established upstream - downstream cooperation relationships with some companies, which provides a large amount of real - machine data support for model training and forms a good foundation.

Second, compared with the sheer amount of data, the more critical question is "what kind of data we need to learn". The Internet generates a huge amount of videos every day. YouTube alone can generate hundreds of thousands of hours of content every day, but only about 5% of it contains real physical interactions. For learning physics, we need the scarce data with strong physical dynamic attributes, not the 95% of visual redundancy. Therefore, we built a data pyramid:

L1 layer: Learn the physical state by screening high - quality real - world videos.

L2 layer: Learn the state transfer caused by actions through first - person perspective (Ego - centric) videos and simulation engine data.

L3 layer: Through a self - built data production closed - loop, construct extreme edge states in the simulation environment, such as a cup on the verge of collapse, and actively screen data of strong physical mutations, such as glass breaking and car explosions. This sparse and mutated regular data is of high cost - effectiveness for the model to truly master real physical laws and is the most critical step towards physical correctness.

Scenario implementation: general first, then adaptation

Q10: Which vertical scenarios, such as embodied intelligence, industrial simulation, or game physics, will the flagship model you plan to release at the end of this year first target?

Chen Boyuan: Our positioning is "general first, then adaptation". The same underlying base model can serve different scenarios by splicing different plug - and - play decoders. For example, a video decoder can be used for game rendering, a motion decoder for industrial digital twins, and an action decoder for embodied intelligence control.

Within one year, we are not in a hurry to develop a world model specifically for embodied intelligence, industrial scenarios, or games because in the real physical world, they are actually interconnected. Focusing on vertical scenarios too early is prone to overfitting.

After the flagship model matures, we will first conduct verification and implementation in scenarios such as embodied intelligence and industrial simulation. The model to be released at the end of the year will focus on demonstrating its prediction ability in unseen physical scenarios to global developers and become a provider of physical world infrastructure.

Q11: At the W2 and W3 stages, how much improvement can the world model bring compared with traditional engines such as Unity and Unreal? Is it a disruptor or a complement?

Chen Boyuan: In the short term, it is a complement; in the long term, it is a disruptor. Traditional engines rely on hand - written rules and are more accurate in predicting rigid bodies, but they are blind spots in complex interactions such as flexible objects (fluid fracture, deformation). The world model learns real physical causality through interactions and has three major advantages:

Naturally supports complex physical interactions without relying on hand - written rules;

Strong generalization ability. Traditional engines need to be re - parameterized when changing scenarios, while the base model can generate millions of scenarios with real physical attributes with just one command;

High efficiency. State prediction is in seconds. When the model reaches W3, machines will change from "executing rules" to "understanding laws and making autonomous deductions".

Q12: Do you play games yourself? Which