Exclusive | Secured over USD 100 million in financing, Cao Yue from Sand.ai: Why video is the most critical path to the world model
“With each generation of models, we're betting on a non-consensus idea.”
Text | Deng Yongyi
Editor | Zhang Yuxin
Cao Yue, the founder of Sand.ai, doesn't really care which side of the consensus he stands on.
Sand.ai is a company focusing on video generation models and products, founded in January 2024. Cao Yue's story of founding Sand.ai has been told many times: After his previous startup "Light Years Beyond" came to an abrupt end, Cao Yue quickly plunged into the entrepreneurship of Sand.ai to develop video generation models.
At that time, the mainstream narrative in the market was the Diffusion route, and hardly anyone thought that the autoregressive route chosen by Cao Yue was the right solution.
In early 2025, after releasing the Magi - 1 model trained based on the autoregressive architecture, Cao Yue quickly realized that "pictures alone are not enough," so the team began to explore synchronous audio - video generation. Later, Sand.ai became one of the earliest teams to come up with a synchronous audio - video generation model, second only to Google VEO 3. Magi - 1 also maintained the top position for a long time on Google DeepMind's Physics IQ benchmark.
In November 2025, Cao Yue made another bet: He decided to lead the team to shift the model architecture from Dense to MoE. "At that time, there were probably hardly any video companies in China pushing this forward with all their might."
"After releasing the synchronous audio - video generation model Gaga - 1, we found that continuing to scale up under the Dense architecture would lead to a sharp increase in costs. There is an impossible trinity for video models: cost, speed, and effectiveness. The only way to break through it is through research, and MoE is the answer." Cao Yue said.
In Q3 2026, Sand.ai will release a new - generation video generation model, adopting the MoE architecture, which combines efficient inference with the largest parameter scale in the current open - source field. Cao Yue said: He is confident that it can reach the top - level standard and will open - source it to everyone.
△Caption: The image captured by the camera is consistent with the actions of the girl in the frame
△Caption: The video generated by Sand.ai's new model
This company has just completed two rounds of financing totaling over $100 million. The investors include Look Capital, Lollapalooza Capital (Wang Huiwen's family office), Jiukun Venture Capital, Matrix Partners China, MSA Capital, Sinovation Ventures, Xianghe Capital, Source Code Capital, Zhongke Chuangxing, Hongtai Fund, Capital Today, Huaye Tiancheng, Yunhui Capital, IDG, Baidu Ventures and other first - tier institutions. Xinghan Capital served as the financial advisor for this round of financing.
After nearly three years of entrepreneurship, whether it's betting on the autoregressive route, developing synchronous audio - video generation, or the MoE architecture, Cao Yue's underlying thinking is the same: "In the end - game scenario, everyone should be able to consume highly personalized content. On this premise, your content production cost must be reduced to a very low level." Cao Yue said.
Another thing that remains unchanged is that Cao Yue doesn't care whether he stands on the side of market consensus. "Once you care too much about others' perceptions, it's highly likely that you're not thinking from first - principles."
The same answer emerged when we asked him "What is a world model?"
"It's very noisy now," Cao Yue said. "When everyone talks about the world model, they probably don't know what they're talking about. It has become a buzzword."
The world model is one of the most nebulous AI concepts in 2026. Academic giants like Yann LeCun and Fei - Fei Li have bet on completely different directions; meanwhile, Sora, which once shocked the industry as the "world simulator," temporarily shut down in March. In China, many star startups have emerged in this field, and many companies that used to focus on 3D generation and video generation are also high - profilely shifting towards world models.
On the one hand, the world model represents people's imagination for the future model route - a unified model that integrates language, images, videos, and audio; on the other hand, in the increasingly narrow model competition landscape, this term has also become an outlet for FOMO (fear of missing out) emotions.
Cao Yue's judgment is that the world model is still in the "pre - GPT era" - the era before the emergence of GPT - 1, with insufficient data, unclear definitions, and far - from - converged technical routes.
But what he's certain of is that the video model is the most important path to that end - game. "You need to see what kind of data is closest to the world's observation and has a large enough volume. In fact, it's only video."
While continuously advancing the training of the basic model, Sand.ai has already taken steps on the application side, exploring products such as digital humans and video agents. The music agent product VidMuse, launched in January this year, has achieved an ARR of tens of millions of dollars in three months.
"If a startup doesn't have the ability to train a SOTA model, it's easy to be integrated by model manufacturers." Cao Yue isn't troubled by the trendy discussion of "whether a model company should do applications." He said that Sand.ai will continue to develop models and applications simultaneously.
Upon the completion of this round of financing, "Intelligent Emergence" had a conversation with Cao Yue about his technological judgments and application explorations over the past three years.
The following is a summary of Cao Yue's views by "Intelligent Emergence":
With each generation of models, we're betting on a non - consensus idea
From day one, we believed that autoregression is the most fundamental way to model video data.
When everyone in the market was working on pure Diffusion models, we believed that there must be a causal relationship in the time sequence of videos. Many physical laws are essentially functions that change over time - Predict Next Frame, Predict Next Second. This is the most fundamental training paradigm for video data.
We were the earliest team to explore autoregressive video generation. The Magi - 1 model we released last year ranked first on the Physics - IQ list of physical authenticity tests proposed by Google - DeepMind and maintained the lead for a long time, surpassing Nvidia's latest flagship world model Cosmos3 - Super and far outperforming other pure Diffusion models like Sora - 2.
Synchronous audio - video generation is not just a functional upgrade; it's a more complete compression of the world state.
After releasing Magi - 1, we found that pictures alone were not enough. Sound and pictures are naturally aligned, and generating them simultaneously will help each other - after synchronous audio - video generation, even just looking at the pictures, the sense of realism is significantly improved. Essentially, having both pictures and sound is closer to expressing the state of the world and has a higher dimension. So we started exploring synchronous audio - video generation in May last year and were one of the earliest teams to come up with a synchronous audio - video generation model, second only to Google Veo - 3.
There is an impossible trinity for video models: cost, speed, and effectiveness. Last year, we believed that the only way to break through was through research, and MoE is the answer.
In 2025, we decided to shift to MoE. At that time, there were hardly any video model manufacturers in the market pushing this forward with all their might.
This is because after releasing the synchronous audio - video generation model Gaga - 1, we found that continuing to scale up the Dense model would lead to a sharp increase in costs - if we wanted to achieve the same effect with the Dense architecture, the inference cost would be at least 3 to 5 times higher, and the training cost would be the same. At that time, we didn't see any company working on video MoE, but we thought it was very important: First, if you want to continue scaling up, you must figure out MoE; second, if you want more ordinary people to afford video models, you must reduce the cost under the same effect.
We've explored a new video MoE architecture and training scheme, and solved the core problems of applying MoE to video models.
The challenges faced by video MoE are different from those of language model MoE - the token sequence of videos is much longer than that of text, and the redundancy of tokens is also higher. Therefore, problems such as communication overhead, load balancing, and training stability are magnified. We've made multiple innovations in the model architecture to achieve the stable training of ultra - large - scale video MoE models for the first time.
We have a bet for each generation of models. Magi - 1 bet on autoregression, Gaga bet on synchronous audio - video generation, and the new - generation model bets on MoE.
The new model we're going to release in July is the convergence point of the capabilities accumulated by these three generations of models - using the MoE architecture to integrate general - scenario generation, synchronous audio - video generation, multi - shot narrative, and multi - reference generation into the same model, with the goal of achieving SOTA in every dimension.
Why integrate? For example, Seedance 2.0 has proven that multi - shot narrative is a necessity, which is a point we didn't think was that important before. So, similar capabilities that have been proven important in the market should ultimately be merged into the same model - they're not independent features, and they'll also help the model achieve better results together.
Video is the most important path to the world model, but it's just an intermediate stop
The term "world model" has been completely misused. When everyone talks about the world model, they may have different concepts in mind.
Each concept represents a structure. You need to understand what's behind it to discuss it with others. But now many people only have a general idea of what it is through various channels, and it has simply become a buzzword.
Currently, there are still very large differences in people's understanding of the world model; second, people's time expectations for when this thing will generate real value are not aligned.
If we have to define the world model, I think it's still in the pre - GPT era (the era before the emergence of GPT - 1).
First of all, we don't have the data. We live in a world with a 3D space and a time axis, but data such as pictures, sounds, temperature, and pressure have very high dimensions, and we don't have complete and large - scale observational data of the world.
There is also no convergence on the training path of the world model. Some people think it should be achieved through "predicting the next state," but we believe that what should really be predicted is not any human - defined hidden state, but the original observations given by the world itself.
We believe that video data is the most important data type for the world model.
First of all, video data is the largest - scale data type among the world's observational data. It encodes time, space, vision, and hearing simultaneously - it's a structured slice of the 4D physical world projected through a camera. Among all the available world observational data, it has the highest information density, the richest dimensions, and the largest volume.
Video is far more than just pictures. The information retained in videos is far more than what we intuitively think. Tactile sense, temperature, material properties, and even intentions and emotions - a large amount of information that belongs to other modalities in human perception is also encoded in the temporal changes of vision and hearing.
Some people say to "predict the next state," but no one can help the model define what "state" exactly is.
Many people think that directly predicting observations may have a lot of redundancy and be inefficient, so they hope to artificially define states to improve training efficiency.
This lesson has already been demonstrated by LLM - many people tried to explicitly model the representation of words, sentences, and paragraph structures, and it was indeed proven "efficient" at a certain stage, but in the end, on the large - scale route, it was all killed by "predict next token." We shouldn't repeat the same mistake in multi - modal modeling.
History has repeatedly proven that every time we try to disassemble the world with human prior knowledge, we're essentially underestimating its complexity. It's recommended to recite "The Bitter Lesson" in full.
We believe that what should really be predicted is not any human - defined hidden state, but the original observations given by the world itself - modeling raw data (in the case of video, pixels, frames, and video) may not be the most efficient solution at a certain stage, but it's probably the most scalable and has the highest upper limit.
If we want to define several elements for the world model, first, its core is prediction - but we need to be vigilant about using human prior knowledge to define "what to predict"; second, it needs sufficiently complete and multi - dimensional data to compress the information of the real world; that is, it should be able to directly deduce the next moment's observation from the current observation, rather than from an artificially defined hidden state.
From this perspective, many "world models" that people talk about today are actually still very early - stage things. A real world model is not just about generating a seemingly reasonable video, but about understanding a world in a 3D space with a time axis and being able to continuously predict the real observations of the next moment.
The evolution of video generation models is also a process of approaching the world model step by step.
You can imagine the evolution of video models as a child's process of getting to know the world. At first, he can only look at photos, and the world is static - this is image generation.
Then the pictures start moving, and he can watch animations - this is early - stage video generation. Then there is sound in the pictures, such as the sound of the wind, footsteps, and collisions - this is synchronous audio - video generation.
Then he finds that when he looks at the same room from a different angle, the tables and chairs are still in the same place - this is 3D spatial consistency.
Slowly, he knows that if he pushes a cup to the edge of the table, it will fall - this is a causal relationship. Finally, he can reach out and push the door, and the door really opens - this is real - time interaction.
The key point is: No one gives this child a physics textbook and tells him "the gravitational acceleration is 9.8 and the speed of sound is 340." He just figures out how the world works from the more and more complete observations he sees and hears.
The evolution of video models follows the same path - instead of artificially defining "state variables" for the model, we let it develop an understanding of the world from more and more complete observations.
As a startup, you need to figure out where your "stop" is at a certain stage.
For startups, after training a SOTA video generation model, they can engage in content production, sell tokens, and develop agents. Content production is naturally a huge direction, and its closed - loop cycle is much faster than fields like embodied AI. You can gradually reach the end - game (AGI).
We need to develop models and products
After vertical integration, a company that develops models will have better cost and user experience.
Why do we need to develop both models and products?