War report: Elon Musk's Grok4 dominates the AI chess competition. DeepSeek fails to outperform o4-mini, and Kimi K2 is claimed to have been wronged.
Latest battle report: In the first AI chess competition... Grok 4 from Musk's camp is "far ahead".
Yes, Google organized a chess competition for large models: the Kaggle AI Chess Competition.
After the first - day match - ups, OpenAI's o3 and o4 - mini, DeepSeek R1, Kimi K2 Instruct, Gemini 2.5 Pro and 2.5 Flash, Claude Opus 4, and Grok 4 among the participants had their first - round showdowns. The result -
Grok 4 performed the best. DeepSeek R1 showed strong performance but was defeated by o4 - mini. Kimi K2 had the worst result - even netizens called foul for it.
Seeing his Grok 4 performing so well, Musk of course wouldn't miss the PR opportunity. However, his response was a bit "humblebrag":
We didn't deliberately train it. This is just a side - effect.
To be honest, who would deliberately train for such a "nonsensical" competition?
Of course, in an AI chess battle, the process is much more important than the outcome. After all, the original intention of Google to launch this competition was to test the "emergent" ability.
The First Kaggle AI Chess Competition
This competition was launched by Google as a part of promoting the Kaggle game arena. The first competition starts with chess.
The participating "players" include OpenAI's o3 and o4 - mini, DeepSeek R1, Kimi K2 Instruct, Gemini 2.5 Pro and 2.5 Flash, Claude Opus 4, and Grok 4.
The matches will be live - streamed at 10:30 every day from August 5th to August 7th (Pacific Time).
In addition to the chess matches between top - notch models, the live - stream also invited Grandmaster Hikaru Nakamura as the commentator.
He started learning chess at the age of 7. By the age of 15, he became the US national chess champion and obtained the GM title. He is also the third - place finisher in this year's EWC Chess (the largest chess tournament to date).
After a day of competition, the models that have advanced to the semi - finals are Gemini 2.5 Pro, Grok 4, ChatGPT's o4 - mini, and o3.
The onlookers are waiting to see the "internal fight" between ChatGPT's o4 - mini and o3, as well as the match between Gemini 2.5 Pro and Grok 4.
Moreover, all the matches in the round of 16 ended with a perfect 4 - 0. The gap in strength is very obvious.
Netizens analyzed the competition results and said that Grok 4 "surpassed all other models in tactical strategy and speed" in this benchmark test.
Wait a minute. Isn't it just the round of 16 to quarter - finals? Is it too early to draw a conclusion?
Let's take a look at the specific performances of each model. What made netizens give such high evaluations?
Grok 4 vs Gemini 2.5 Flash
Grok 4 is like a fierce beast. It played chess easily like a "real GM" and was the best of the day.
On the other hand, Gemini Flash was at a disadvantage from the start, including taking the king at the opening.
OpenAI o4 - mini vs DeepSeek R1
In the match between OpenAI o4 - mini and DeepSeek R1, R1 had a strong start but was finally defeated by o4 - mini.
Both sides made many mistakes during the game, but o4 - mini was the first to seize the mistake made by R1.
The reasoning provided by R1 was confident but wrong. Its lack of insight into the board situation led it to leave pieces for o4 - mini to take first.
Gemini 2.5 Pro vs Claude Opus 4
The match between Gemini 2.5 Pro and Claude Opus 4 was the best game of the day. Both models demonstrated a high - level of chess skills.
Claude made some mistakes, while Gemini Pro showed a strong tactical vision, but its analysis was sometimes too long - winded.
Kimi K2 vs o3
This was the fastest quarter - final. Kimi K2 was "crushed" mainly because it repeatedly insisted on making illegal moves. o3 won by forfeit, and there isn't much to analyze about its performance.
However, some people felt sorry for Kimi: Since Kimi is not an inference model, it needs long - term thinking to perform better. The later the steps, the more long - term thinking is required.
Why Chess?
So why choose chess for AI battles?
Well, chess has clear rules but high complexity (10^120 possible situations), making it an ideal scenario to test AI's decision - making ability.
Although some netizens may misinterpret it as "the bigger, the better", in fact, this number far exceeds the applicable scope of the brute - force method.
Some time ago, Terence Tao mentioned in the Lex interview that some mathematical problems cannot be solved directly by brute - force calculation. For example, we still can't fully solve the number of chess arrangements with computers. But now we have AI. They don't explore every position in the game tree but seek approximate values.
In other words, having AI play chess actually tests the AI's emergent ability.
A netizen also noticed this and summarized Grok 4's performance in this competition:
This netizen said that in traditional AI, the model's strength comes from domain - specific training models (tailored for the task); while in cutting - edge AI, the model's strength comes from consistent generalization (evolving an internal world structure that can be mapped to everything). Chess is just one of the projections.
Netizens generally believe that chess is a very reliable way to evaluate AI's ability.
Some netizens also predicted the next competitive game for AI: Maybe it will be UNO? (Just joking, of course)
Which AI is the Most Promising?
Before the official start of the Kaggle AI Chess Competition, a netizen launched a vote on Manifold: Who will be the ultimate winner of this AI chess competition?
Initially, Gemini 2.5 Pro was the most popular, followed closely by o4.
However, after the round of 16 to quarter - finals, the vote changed significantly, and Grok 4 showed an overwhelming advantage.
The more this happens, the more we look forward to whether there will be any dramatic surprises?
Reference Links:
[1]https://x.com/elonmusk/status/1952814912839008347
[2]https://www.youtube.com/watch?v=-nByurcQHDI
[3]https://x.com/_The_Prophet__/status/1952855259841478657
[4]https://x.com/richardcsuwandi/status/1952828128998699335
This article is from the WeChat