Musk's xAI Enters the "World Model" Race: Will the "Vision Model" Be the Next "Large Language Model"?

The next battlefield in the AI competition is already clear: from the textual world to the physical world.

The next battleground in the AI competition is already clear: from the text world to the physical world. In this competition called "world models," xAI under Elon Musk has quietly entered the arena with experts from NVIDIA, competing with tech giants like Google and Meta. xAI plans to apply this technology to AI game generation first and explore its application in robotic systems. Google predicts that future video models will become as intelligent as language models.

The fire in the field of artificial intelligence is spreading from large language models to a more cutting - edge area - "world models" that can understand and simulate the real physical world. And xAI has quietly joined this competition, competing with tech giants like Google and Meta.

According to a report by the Financial Times on October 12, Musk's startup xAI hired artificial intelligence experts from chip giant NVIDIA this summer to specifically work on the research and development of world models. Different from large language models that rely on text, world models are trained on massive amounts of video and robotic data, aiming to master the physical laws of the real world.

"Future video models will become as intelligent as language models," Google researchers said in a paper. NVIDIA also said last month that the potential market size of world models could be close to the current global economic total.

01 Preparing for Battle:

xAI's Surprise in Games and Ambitions in Robotics

To gain a foothold in this competition, xAI is actively recruiting talent.

The company has hired two AI researchers from NVIDIA, Zeeshan Patel and Ethan He, who have rich experience in the field of world models. NVIDIA has always been a leader in this technology thanks to its Omniverse platform for creating and running simulations.

People familiar with the matter revealed that xAI's first commercial application of world models is in the game field, for generating interactive 3D environments. This move quickly caught the market's attention, as it is not only a clear signal of xAI's commercialization path but also highlights the great potential of world models as the next - generation AI technology.

Elon Musk himself also confirmed on social platform X that xAI will "release an excellent AI - generated game by the end of next year." In the long run, these technologies may ultimately be applied to the artificial intelligence systems of robots.

xAI's recruitment information also confirms its development direction. The company is recruiting technical personnel in the field of image and video generation for its "omni team," with a salary range of up to $180,000 to $440,000. This team is committed to "creating amazing AI experiences beyond text."

In addition, the company is also hiring "video game tutors" at an hourly rate of $45 to $100 to train its AI model Grok to make video games.

02 Paradigm Shift:

The "GPT Moment" of Visual Models

xAI's high - profile entry coincides with a key industry prediction emerging: future video models will become as intelligent as language models. A recent Google paper points out that its video model Veo 3 is showing "emergent abilities" similar to large language models (LLMs).

Just as LLMs learned additional skills such as mathematics and creative writing through the simple task of "next - token prediction," video models are also starting to unlock a series of surprising zero - shot abilities through "next - frame prediction," such as object segmentation, edge detection, and simulating tool use, all without specialized training.

Google researcher Jack Clark wrote in the paper: "We believe that just as natural language processing (NLP) shifted from task - specific models to general models, the field of machine vision may also undergo the same transformation through video models - a 'GPT - 3 moment in the visual field'."

They compared the frame - by - frame video generation process to the "chain - of - thought" in language models and called it the "chain - of - frames," believing that this enables video models to reason across time and space.

This discovery is profound, suggesting that by developing more intelligent video models, people may be able to obtain highly capable robotic "agents."

03 Prospects and Reality:

High Costs and Lack of "Vision"

Despite the attractive prospects, the road to world models is not smooth. Currently, this technology still faces huge technical challenges, the most significant of which is the extremely high cost of finding and processing enough training data to simulate the real world.

Meanwhile, there is also a sober assessment of the role of AI in the industry. Michael Douse, the distribution director of Larian Studios, the developer of the popular game Baldur's Gate 3, said on X this week that AI cannot solve the "big problem" in the game industry, which is "leadership and vision."

He added that the industry does not need "more mathematically produced and psychologically trained game loops," but more diverse expressions of the world. This represents a common view: pure technological breakthroughs alone do not guarantee the creation of commercial products that can truly touch people's hearts.

Despite the numerous challenges, xAI's entry into the arena undoubtedly adds fuel to the fire in the world models competition.

The focus of AI is irreversibly shifting from pure digital information processing to the simulation and interaction of complex physical realities. Whether visual models can replicate the glory of large language models and have their own "GPT moment" will not only determine the ownership of the next - generation AI hegemony but also reshape our fundamental relationship with the digital and physical worlds.

This article is from the WeChat public account "Hard AI," author: Long Yue. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Musk's xAI joins the "world model" race. Will the "vision model" be the next "large language model"?

01

Preparing for Battle:

xAI's Surprise in Games and Ambitions in Robotics

02

Paradigm Shift:

The "GPT Moment" of Visual Models

03

Prospects and Reality:

High Costs and Lack of "Vision"