AI benchmarking is becoming increasingly meaningless. Google says it's better to let AIs play games together.
After an eight-year hiatus, following the advent of generative artificial intelligence, Google has organized another "AI Chess King Championship." Top-tier AI models from both China and the United States, such as OpenAI o4-mini, DeepSeek-R1, Google Gemini 2.5 Pro, Anthropic Claude Opus 4, xAI Grok 4, and Kimi K2 Instruct, will go head-to-head.
According to Google, the purpose of this competition is to evaluate and promote the progress of AI models in complex reasoning and decision-making abilities through direct confrontations in strategy games, thus addressing the issue that existing benchmark tests struggle to keep pace with the development speed of models. Meanwhile, this event is also aimed at promoting Google's newly launched and publicly available benchmark testing platform, Kaggle Game Arena.
Different from current conventional AI benchmark tests, the test questions on Kaggle Game Arena are "strategy games." The reason Google launched a platform for AI to play games is that traditional AI benchmark tests have reached a bottleneck and can hardly reflect the real capabilities of flagship models. Simply put, AI manufacturers, driven by fame or profit, have misused various AI benchmark tests. Therefore, as an industry giant, Google has chosen to step forward to set things right.
Actually, in this wave of the AI boom, the phenomenon that "money is no longer valuable" is quite peculiar. In the past, a unicorn usually referred to a young, unlisted technology innovation company with a valuation exceeding $1 billion. However, nowadays, as long as the founder has a certain technical background, it's as easy as having a meal or a drink for an AI startup to achieve a valuation of $1 billion.
There has even emerged a fraudulent company like Builder.ai, which claims to use artificial intelligence for programming but actually relies entirely on Indian programmers to write code manually. Regarding this phenomenon, the financial industry attributes it to the "fear of missing out" (FOMO) on the opportunities that the AI revolution might bring. This fear has prompted them to invest money in any seemingly decent AI company, thus creating an irrational boom around AI.
In this context, it's reasonable for entrepreneurs to take advantage of the AI FOMO sentiment in the investment market to boost their company's valuation. So, how can one make their AI startup more valuable? Since current AI technology is extremely sophisticated, investors have a simple way to judge the strength of an AI company: the one with a high score is a good target.
"If you're not convinced, run a score" has thus become the core means for many AI companies to promote their products. If you often follow AI-related news, you're probably familiar with lists such as the LMArena benchmark test and the Chatbot Arena for large models. When score results are tied to financing, an operation familiar to digital enthusiasts and mobile game players has emerged: "score manipulation."
Currently, there is a wide variety of benchmark tests for evaluating the capabilities of large models in the market, mainly including knowledge reasoning, mathematics, and programming. Take the list produced by the well - known AI open - source community HuggingFace as an example. It mainly evaluates the ability of large models to follow instructions and their multi - step reasoning ability in long texts.
Similar to 3DMark for PCs and AnTuTu for mobile phones, AI benchmark tests measure the capabilities of AI models in different fields by setting a series of objective and reproducible scenarios. However, due to the need for reproducibility and consistency, AI benchmark tests naturally lack flexibility, thus leaving room for "score manipulation." AI models can remember the questions in the benchmark test dataset through repeated tests and then conduct targeted training to achieve high scores.
For example, in test sets like GSM8K and MATH, which assess the mathematical abilities of AI models, models such as GPT - 4o and Gemini 1.5 Pro can easily achieve an accuracy rate of over 80%. There has even been a situation where the benchmark test side actively cooperates with AI manufacturers in score manipulation. Earlier this spring, Meta's new - generation open - source model, Llama 4, had an epic failure. It scored far ahead of others in the benchmark test but had a poor actual performance. AI researchers found that Llama 4 was tested in 27 different versions for the Chatbot Arena before its release, but only the best result was made public.
It's not difficult to see that benchmark tests are becoming increasingly difficult to measure AI models, especially the "State - of - the - Art" models with the current highest level. Therefore, Google has developed Kaggle Game Arena and organized an "AI Chess King Championship" as a platform for flagship models of various manufacturers to showcase their upper limits.
So why did Google choose games as the scenario for testing the capabilities of large models? According to them, the randomness within established rules in games is very suitable for measuring AI intelligence. Clear rules can restrict AI from going out of control, and sufficient randomness allows it to demonstrate its upper - limit capabilities. In addition, games also have the characteristics of measurable results, visualizable processes, verifiable reasoning, and zero - sum games.
In fact, the game industry and the AI industry are closely related. Take OpenAI for example. For the general public, this name came into the spotlight because of the epoch - making ChatGPT. For players of "DOTA2," OpenAI left an indelible impression in 2019. At that time, OpenAI's OpenAI Five program easily defeated the champion team OG, initially proving to the outside world that AI can not only dominate chess games but also outperform humans in more complex e - sports games.
According to the conversation between former OpenAI Chief Scientist Ilya Sutskever and Huang Renxun, through the development of OpenAI Five for "DOTA2," OpenAI's training mode shifted from "reinforcement learning" to "reinforcement learning with human feedback (RLHF)," which is the key to ChatGPT's more intelligent performance compared to previous AI products.
If an AI can play games well, it can not only prove its intelligence level but also has high commercial prospects. After all, game manufacturers are eager to have more intelligent NPCs to enhance the player experience.
[All images in this article are from the Internet.]
This article is from the WeChat official account "Three Easy Lives" (ID: IT - 3eLife). The author is San Yi Jun. It is published by 36Kr with authorization.