HomeArticle

A war with no unified name: the global model landscape of domestic tech giants

IT桔子2026-06-25 17:15
World Models: Tech Giants Place Their Bets at the "World-Building" Gambling Table

The name "world model" doesn't have a unified definition in the industry. Some call it the world model, some the world base model, some physical AI, and others hide it within the architectures of large self - driving models, VLA, or embodied intelligence systems without giving it a separate name.

Alibaba's Qwen - AgentWorld, HappyOyster, and Qwen - RobotWorld point to the language world, virtual world, and physical world respectively; Tencent's HY - World leans towards the 3D editable world; car manufacturers prefer to talk about the self - driving world model or world behavior model; Huawei and Baidu simply don't mention the term "world model" separately.

Behind the naming chaos, everyone is actually doing the same thing:

Before a machine takes real action, it first creates a dynamic environment internally that can be deduced and reviewed, reducing the infinite reliance on real - world data and compressing the real world into a data engine that can generate infinitely, make mistakes infinitely, and start over infinitely.

IT Juzi recently released a report on 33 domestic startups working on "world models", which has attracted industry attention. Today, let's take a look at the strategies of large companies in this area —

While startups are struggling with data collection rights and computing power budgets, Alibaba, Tencent, Huawei, NIO, XPeng, and Li Auto have quietly made world models a new competitive arena.

The world model represents an ambition: to enable AI to go beyond simply recognizing the world and simulate it mentally first.

Self - driving manufacturers want to use it to generate "test papers" for rainy days, snowy days, and unusual obstacles; embodied intelligence teams want to let robots fall 100,000 times in simulations before going out; game and social companies want to use it to create a parallel universe where humans can immerse themselves.

Large companies have different focuses when entering this field, but their core goal is the same: to compress the real world into a data engine that can be deduced and reviewed infinitely.

I. Internet Giants: From the Digital World to the Physical World

Alibaba's world model layout is most like "laying out the items on the shelf one by one".

In June 2026, it made three major announcements within just over ten days:

The Qwen - Robot series on June 16th, HappyOyster 1.0 on June 17th, and Qwen - AgentWorld on June 24th.

Qwen - AgentWorld is a native language world model. Instead of generating images, it generates environments — in seven environments including MCP tools, search, terminals, code engineering, the Web, operating systems, and Android, the model can simulate real interactions, learn autonomously, and refine itself through reinforcement learning. It offers two scales: the MoE architecture with a total of 35B and 397B parameters, with activated parameters of 3B and 17B respectively; the training data comes from over 10 million real - world interaction trajectories; both the model and the evaluation benchmark AgentWorldBench have been open - sourced. This is equivalent to treating the world model as a "training ground" for intelligent agents rather than a "decoration".

HappyOyster 1.0 presents a different face. It is more like a "playable movie set": users can input a sentence or a picture, and it will generate an open - world environment, allowing users to intervene freely in two modes: "world exploration" and "real - time directing". The exploration mode supports continuous real - time displacement and camera control for up to 1 minute, and the directing mode can generate real - time 480p/720p images for over 3 minutes. Alibaba positions it as an entry point for industries such as interactive games, virtual companionship, interactive short dramas, and cultural and tourism experiences.

Qwen - RobotWorld takes another direction. It is the "thinking brain" in Alibaba's embodied intelligence trio, working in tandem with the VLA operation model Qwen - RobotManip and the VLN movement model Qwen - RobotNav, aiming to give robots an internal world for pre - simulation.

Combined, Alibaba is vying for the right to define the language world, virtual world, and physical world simultaneously.

Tencent's Hunyuan takes a different path. Its HY - World series is more like building an "automatic factory for 3D games".

In July 2025, Tencent open - sourced and released the Hunyuan 3D world model 1.0 at WAIC; it was upgraded to 1.5 in December; and HY - World 2.0 was released and open - sourced in April 2026. The input can be text, single images, multiple images, videos, or even white models, and the output can be 3DGS, Mesh, or point clouds.

The 2.0 version introduced modules such as HY - Pano 2.0, WorldNav, WorldStereo 2.0, and WorldMirror 2.0, creating a closed - loop for world generation, world reconstruction, panoramic images, and real - time world generation.

Tencent's strength lies in game and social scenarios. The real users of HY - World are not those training self - driving systems, but those creating game levels, conducting virtual shoots, and doing digital twins.

ByteDance's world model project is like a "stealth mission" with the gene of short - video data.

In August 2025, The Information reported that ByteDance's Seed team, led by Zhou Chang, a former core member of Tongyi Qianwen, was developing a world model. The biggest asset of this project is the over 1 billion daily video streams on Douyin and TikTok, as well as the EX - 4D framework, which can convert monocular videos into 4D multi - perspective scenarios. It is targeting Google's Genie 3 and Meta's V - JEPA 2. The goal is not to create a beautiful video generator, but to build a "digital twin" that can simulate physical laws.

At the Volcengine FORCE Conference on June 23, 2026, ByteDance didn't directly release this world model, but introduced the Doubao Seed 2.1 series, the Seedance 2.5 video generation model, the Seedream 5.0 Pro image generation model, and a new audio generation model.

An exclusive report from 36Kr summarized ByteDance's 2026 AI strategy into four propositions: the world model should reach the global SOTA by the end of the year, Seedance should explore dynamic generation, Coding should strengthen the foundation, and Doubao should accelerate commercialization.

This means that the world model is the top priority within ByteDance. It just chooses to let Seedance and Doubao take the spotlight first while continuing to develop more advanced technology.

The Huawei Pangu world model is "low - key but powerful".

At the Developer Conference in June 2025, Huawei released the Pangu large - scale model. Based on the Pangu multi - modal large - scale model, its core ability is to generate a high - precision digital physical space from a single image. It can predict collisions, train robotic arms for grasping, and generate driving videos and LiDAR point clouds, helping Huawei's ADS end - to - end model achieve "a new version every two days".

Huawei doesn't use the term "world model", but treats it as a "training base" for smart cars and embodied intelligence. Its cooperation with GAC is a typical example: achieving pixel - level correspondence between 2D videos and 3D point clouds and restoring complex corner cases in just a few minutes.

At HDC 2026 in June 2026, Huawei upgraded the Pangu large - scale model to 7.0 and released the Ascend 910C. Yu Chengdong took charge of Pangu again, but there was no news of a new standalone version of the world model.

This approach of "the world model not existing independently but serving the industrial closed - loop" is Huawei's consistent style.

Baidu entered the self - driving field earlier. The Apollo ADFM released in May 2024 was positioned as "the world's first large - scale self - driving model supporting L4 - level autonomous driving".

Although Baidu didn't name it a world model, it essentially has the functions of a world model: understanding the physical world through an end - to - end neural network and predicting the behavior of traffic participants. In November 2025, the Wenxin large - scale model 5.0 was unveiled in its native full - modal form, with a parameter scale of 24 trillion; the official version was launched in January 2026.

Baidu's world model capabilities are integrated into a larger strategy. Baidu's approach is not to talk about the world model separately, but to make Apollo and Wenxin complement each other.

Xiaomi and SenseTime represent two "technical approaches".

Xiaomi's Xiaomi OneVL, open - sourced on May 13, 2026, unifies VLA, the world model, and latent space reasoning within a single framework, emphasizing the interpretability of the visual reasoning process. It serves as a fundamental component applicable to both self - driving and embodied intelligence.

SenseTime's Jueying Kaiwu is more like an experienced "driver" on the job. In a report by Frost & Sullivan in September 2025, it was defined as the industry's first mass - produced and interactive world model. It can generate 150 - second, 1080P, 11 - perspective driving videos and has accumulated the industry's largest generative driving dataset, WorldSim - Drive, and a scene library with tens of millions of generated scenarios.

In June 2026, Daxiao Robotics, founded by Wang Xiaogang, a co - founder of SenseTime, announced the completion of a financing of hundreds of millions of dollars. Its Kaiwu 3.0 world model ranks first in four major generative prediction lists in areas such as embodied video generation and task instruction following.

SenseTime's world model is spreading from smart cars to robots.

II. Car Manufacturers: Treating the World Model as a Driving School and Exam Venue

If the world models of internet giants are about "creating the world", then those of car manufacturers are about "using the world".

NIO is the first Chinese car manufacturer to wave the flag of the world model.

At NIO IN in July 2024, Ren Shaoqing released the NWM (NIO World Model), positioning it as China's first smart - driving world model.

It uses a multi - variable autoregressive generative architecture and does two things: "imaginative reconstruction" in space and "imaginative deduction" in time.

Given a real - world scenario, it can reconstruct a 3D world; given a three - second prompt, it can generate a future video of over two minutes. It deduces 216 trajectories every 0.1 seconds and selects the optimal one.

NIO's logic is clear: end - to - end models are not enough. A truly intelligent self - driving system needs to be able to "imagine the road conditions with eyes closed" like a human. On June 18, 2026, NIO officially pushed the new version NWM 2.0, covering over 700,000 users of all models. Even users who bought cars four years ago can upgrade for free, and the four major vehicle systems, Banyan, Cedar, and Coconut +, were updated simultaneously. The new version is the first in China to enable the self - driving model to directly output the original operation signals of the steering wheel, accelerator, and brake pedals, and upgraded the training system from "world model + closed - loop reinforcement learning" to a three - layer system of "world model + supervised fine - tuning + closed - loop reinforcement learning". The AEB covers 6.7 times more scenarios than the standard AEB, and the false braking probability is reduced to once every 100,000 kilometers.

The Shenji NX9031 chip is even described as "designed for the world model from the start".

Li Auto proposed the concept of a world model based on "reconstruction + generation" in the second half of 2024 and published DrivingSphere at CVPR 2025.

It consists of the OccDreamer diffusion model and the VideoDreamer ST - DiT, creating a high - fidelity 4D closed - loop simulation environment.

Traditional open - loop simulations can only evaluate what a model "sees", while closed - loop simulations can evaluate what a model "does". Li Auto's world model is like an exam venue that can generate an infinite number of difficult scenarios, allowing the self - driving system to become familiar with tricky situations within the chip.

At Livis Day in June 2026, Li Auto further upgraded this capability to "Mach VLA", with a native multi - modal MoE architecture that unifies perception, prediction, and planning. The dual M100 chips on the vehicle have a computing power of 2560 TOPS, and the reaction time is 0.28 seconds.

According to Li Auto's announced roadmap, it will push the new Mach VLA to AD Max users in the third quarter and aim to match Tesla's FSD V14 in the fourth quarter. Li Auto is no longer just a car company; it is shaping itself into a provider of an embodied intelligence system, Livis.

XPeng Motors' approach shows a hierarchical pattern of "first expand, then refine".

In April 2025, XPeng first revealed at an AI technology sharing event in Hong Kong that it was developing an ultra - large - scale self - driving "world base model" with 72 billion parameters.

One year later, on April 1, 2026, XPeng officially released the X - World world model technical report.

Based on video diffusion generation technology, it is modified from the latent space video generation paradigm of WAN 2.2, using 3D causal VAE and perspective - time self - attention DiT, and supporting consistent generation across seven surround - view cameras.

X - World is not a video generation tool, but a "real - world simulator" for XPeng's second - generation VLA: the number of simulation scenarios has increased from 30,000 a year ago to over 500,000, and the daily simulation test mileage is equivalent to 30 million kilometers of real - vehicle testing. It also supports online reinforcement learning and overseas data generation.

At CVPR in June 2026, XPeng also first presented a complete technical map of the world model. XPeng's ambition is reflected in its application scope: AI cars, AI robots, and flying cars. Its goal for the training data scale is 200 million clips, and a cluster of ten thousand GPUs provides 10 EFLOPS of computing power, with an iteration every five days.

Geely Automobile unveiled the WAM (World Action Model) at CES 2026 and integrated it into the Global AI 2.0 system.

The hierarchical architecture of WAM is interesting: the upper layer is a multi - modal large - scale model MLLM responsible for understanding, the lower layer is an Action Expert responsible for actions, and the middle layer is a world model responsible for deduction.

Geely's goal is not to improve the self - driving model, but to turn the entire vehicle into "a single brain" — unified scheduling of self - driving, the cockpit, the chassis, and the power system. In April 2026, the Zeekr 8X was launched and delivered, becoming the first mass - produced vehicle in China with a cabin - driving integrated super - intelligent agent. Its G - ASD 4.0 is based on WAM. The goal for 2026 is L3 on highways and L4 at low speeds.

BYD's world model is still in the early research stage. Information disclosed in January 2025 shows that it internally referred to Tesla's approach, formed a small team for rapid trial - and - error, and focused on generating data for corner cases in end - to - end self - driving.

Great Wall has also proposed the next - generation self - driving direction of VLA + world model and has moved from "strategy" to "mass production": in June 2026, Great Wall shared its VLA practice at the Smart Driving and Overseas Expansion Conference. The computing power of the Jiuzhou Supercomputing Center in Baoding has reached 5 EFLOPS, with over 10,000 GPUs. The Tank 700 will be the first vehicle equipped with the Coffee Pilot 4.0 VLA system and will be mass - produced within 2026. The over 2 million in - service vehicles generate a large amount of data every day, which is Great Wall's most substantial asset compared to new car - making forces.

III. Self - Driving Suppliers: The Hidden World Engine under the Vehicle

Beyond car manufacturers, there are a group of suppliers that have turned the world model into an "invisible engine".

Momenta officially released the R7 reinforcement learning world model at the Beijing Auto Show in April 2026 and achieved the first mass - production.

It has a three - layer architecture: world model pre - training, world model simulation,