Upset! In the First Large Model Battle, Did Grok 4 Make a "Divine Move"? DeepSeek and Kimi Eliminated Unluckily

From the chessboard to intelligence tests

AI Chess Competition? This Time, It's for Real! Google's Kaggle Launches the First Global AI Chess Championship, with Eight Top Language Models Going Head-to-Head. Victory or Defeat Hangs in the Balance of a Single Move!

The Global AI Chess Championship is here!

Right from the start, they've set the bar high: Pitting the world's eight most powerful language models against each other in a direct chess showdown:

Closed-Source Large Models: Gemini 2.5 Pro, OpenAI o4-mini, Grok 4, OpenAI o3, Claude 4 Opus, Gemini 2.5 Flash;

Open-Source Large Models: DeepSeek R1 and Kimi K2 Instruct.

The First Round Ends

At 1 a.m. today, the 8-to-4 elimination round of this championship officially kicked off:

Gemini 2.5 Pro, o4-mini, Grok 4, and o3 swept their opponents with a crushing 4-0 record and advanced to the semi-finals.

Meanwhile, Claude 4 Opus, DeepSeek R1, Gemini 2.5 Flash, and Kimi K2 couldn't even make it through the middle game and crashed out one by one.

In tomorrow's semi-finals, OpenAI's o3-mini and o3 will face off against each other, while Gemini 2.5 Pro and Grok 4 will go head-to-head.

The entire event is hosted by Kaggle, a subsidiary of Google. To this end, they've created a competitive platform specifically for general large models - the "Game Arena".

Google says that games are an ideal platform for evaluating models and agents, and a reliable measure of general intelligence. The value of games as a benchmark test is further reflected in:

Unlimited Scalability: The stronger the opponent, the steeper the difficulty curve;

Visualization of Thinking: You can fully track the model's "decision chain" and peek into its strategic thinking process.

For AI, playing a good game of chess is harder than you think.

There are a total of three matches in the competition. In the first match, DeepSeek R1 faced off against o4-mini, and Kimi-K2 faced off against o3.

The semi-finals will be held at 10:30 a.m. Pacific Time tomorrow.

Now, let's take a look back at the situation of the first round.

Kimi K2 Fouls Out, o3 Advances to the Semi-Finals Without a Fight

In the four-game series, Kimi K2 was ruled to have lost each game due to illegal moves. The shortest game lasted less than eight rounds.

At the beginning of the game, Kimi K2 could follow the opening theory for a few moves. But once it deviated from the familiar pattern, it was like it suddenly "went blind", misreading the board layout and making wrong moves.

Facing such an opponent, o3 advanced to the semi-finals without breaking a sweat.

DeepSeek R1 Disconnects in the Middle Game, o4-mini Easily Checks Mate in Two Games

This was a match that "started strong but fell apart in the middle".

If you only looked at the first few rounds of each game, you'd think that both models were flawless in the opening, like two chess masters going head-to-head.

But at a certain point, the quality of the game took a nosedive.

Once it deviated from the "opening template", DeepSeek started making frequent mistakes: targeting non-existent pieces, defending empty spaces with no threat, and even making "self-destructive moves" that backed itself into a corner.

In contrast, although o4-mini wasn't particularly impressive, it played steadily, didn't make any major mistakes, and successfully completed two checkmates, so it deserved to win.

Claude 4 Opus Fights to the End but Still Can't Beat Gemini 2.5 Pro

If Kimi K2's match was like an "automatic withdrawal", then Claude 4 Opus's defeat was a result of a valiant but ultimately unsuccessful fight.

In the first game, both sides played well in the first nine rounds. But then Claude 4 Opus rashly made the move 10...g5, opening up its defense and giving Gemini a breakthrough.

In the fourth game, there was a hilarious scene: Gemini 2.5 Pro had two queens and a total piece advantage of 32 points. It should have "wiped out" Claude in one go, but instead, it lost a few key pieces during the attack.

Despite that, victory still belonged to Gemini.

This match was also the closest to a real chess confrontation among the four games.

Grok 4 Goes on a Rampage, Precision Strikes and Targets Weaknesses

The first three matches were like training sessions. Once Grok 4 showed up, the game finally felt like a "battlefield".

Facing Gemini 2.5 Flash's frequent mistakes and undefended pieces, Grok 4 accurately identified the weaknesses and launched a decisive attack.

It wasn't just "mimicking chess"; it could really see the weaknesses and eliminate the threats, and finally ended the game with a 4-0 score.

Grok's "four consecutive outstanding performances" not only produced the most "chess-savvy" games so far but also was rated as the best performance of the tournament by many industry insiders.

Elon Musk retweeted Grok's results on X and only left a simple response:

Orders are just a side effect. xAI has hardly spent any effort on chess.

There was no boasting, no over - evaluation. It was just a casual retweet, as if this victory was just a routine function call by the system.

But in this chaotic battle where models made frequent mistakes and cognitive errors, Grok 4 was one of the few that could "see the whole game clearly and play steadily".

From the Chessboard to an Intelligence Test

The competition is just the surface; the real challenge is just beginning.

The significance of this competition has never been just about who wins or who makes the more beautiful move.

It tests not the chess skills but the overall understanding ability of AI.

Games provide an excellent basis for evaluating powerful artificial intelligence, helping us understand which methods really work in complex reasoning tasks.

Games can provide an unambiguous signal of success: it's either a win, a loss, or a draw.

They have a clear structure and measurable results, making them an ideal testing ground for evaluating models. Games force models to demonstrate a variety of skills, including strategic reasoning, long - term planning, and dynamic adaptability when facing intelligent opponents, thus providing a reliable basis for measuring their general problem - solving intelligence.

Just last month, world champion Magnus Carlsen beat ChatGPT in a game during his travels without losing a single piece. After the game, he casually said, "I sometimes get bored during my travels."

AI didn't even realize who it was up against - this is more alarming than losing the game.

Kaggle officials also revealed that the real scoring criteria are actually hidden in the leaderboard of "hundreds of unpublicized matches behind the scenes".

This current chess match is just a small opening test for general intelligence.

References:

https://www.chess.com/news/view/kaggle-game-arena-chess-2025-day-1

https://x.com/dotey/status/1952883220149657849

https://blog.google/technology/ai/kaggle-game-arena/

https://www.kaggle.com/blog/introducing-game-arena

This article is from the WeChat official account "New Intelligence Yuan". Author: New Intelligence Yuan. Republished by 36Kr with permission.