Ultimately, the AI arena is just a business.
"XX releases the most powerful open - source large - language model, comprehensively surpassing closed - source models such as XX in multiple benchmark tests!"
"The open - source model XX with trillions of parameters has strongly topped the global open - source model list!"
"The pride of domestic technology! Model XX has taken the first place in the Chinese evaluation list!"
With the advent of the AI era, are your social platforms like Moments and Weibo often flooded with such news?
One model wins the championship today, and another becomes the king tomorrow. In the comment section, some people are excited, while others are confused.
A series of real - world problems are in front of us:
What is the so - called 'topping' of these models compared with? Who scores them, and what are the scoring criteria? Why are the rankings on each platform different, and which one is more authoritative?
If you have similar doubts, it means you have started to shift from 'watching the fun' to 'understanding the essence'.
In this article, we will break down the 'rules of the game' of different types of 'AI arenas' - that is, the large - language model rankings.
01 Type One: Objective Benchmark Tests (Benchmark), the 'College Entrance Examination' for AI
In human society, the college entrance examination score is the most important criterion for determining a student's college level.
Similarly, in the field of AI, there are also many highly standardized test questions used to measure the performance of AI models in specific abilities as objectively as possible.
Therefore, in this era when large - model products are constantly emerging, after each manufacturer launches a new model, the first thing to do is to take it to the 'college entrance examination' venue for a score test. Let's see what it can really do.
The Artificial Analysis platform has proposed a comprehensive evaluation benchmark called 'Artificial Analysis Intelligence Index (AAII)', which summarizes the results of 7 extremely difficult single - item evaluations focusing on cutting - edge capabilities.
Similar to the stock price index, AAII can provide a comprehensive score to measure the intelligence level of AI, especially focusing on tasks that require in - depth reasoning, professional knowledge, and complex problem - solving abilities.
These 7 evaluations cover three areas generally regarded as the core of advanced intelligence: knowledge reasoning, mathematics, and programming.
(1) Knowledge and Reasoning Field
MMLU - Pro:
Full name: Massive Multitask Language Understanding - Professional Level
It is an enhanced version of MMLU. MMLU covers knowledge - based Q&A tests in 57 disciplines. On this basis, MMLU - Pro further increases the difficulty through more complex questioning methods and reasoning requirements to test the model's knowledge breadth and in - depth reasoning ability in professional fields.
GPQA Diamond:
Full name: Graduate - Level Google - Proof Q&A - Diamond Set
This test machine contains professional questions in the fields of biology, physics, and chemistry. As its name implies, its original intention is straightforward: even graduate students in relevant fields will find it difficult to find answers in a short time even if they are allowed to use Google search. And Diamond is the most difficult subset, which requires AI to have strong reasoning and problem - decomposition abilities rather than simple information retrieval.
Humanity’s Last Exam:
A highly difficult benchmark test jointly released by Scale AI and the Center for AI Safety (CAIS), covering multiple fields such as science, technology, engineering, mathematics, and even humanities and arts. Most of the questions are open - ended, requiring AI not only to conduct complex multi - step reasoning but also to show a certain degree of creativity. This test can effectively evaluate whether AI has the interdisciplinary comprehensive problem - solving ability.
(2) Programming Field
LiveCodeBench:
This is a programming ability test close to reality. Different from traditional programming tests that only focus on the correctness of the code, AI will be placed in a'real - time' programming environment and write code according to the problem description and a set of public test cases. The code will then be run and scored using a more complex set of hidden test cases. This test mainly examines whether AI programming has high robustness and the ability to handle boundary cases.
SciCode:
This programming test is more academic, focusing on scientific computing and programming. AI needs to understand complex scientific problems and implement corresponding algorithms or simulations with code. In addition to testing programming skills, it also requires AI to have a certain in - depth understanding of scientific principles.
(3) Mathematics Field
AIME:
Full name: American Invitational Mathematics Examination
It is a part of the American high - school mathematics competition system, with a difficulty level between AMC (American Mathematics Competition) and USAMO (United States of America Mathematical Olympiad). Its questions are highly challenging, requiring AI to have creative problem - solving ideas and mathematical foundation, and can measure the reasoning ability of AI in advanced mathematics.
MATH - 500:
A test composed of 500 randomly selected questions from the large - scale mathematics problem dataset 'MATH', covering various mathematics questions from junior - high to high - school competition levels, including algebra, geometry, and number theory. The questions are presented in LaTeX format. The model not only needs to give the answers but also detailed problem - solving steps, which is an important criterion for evaluating AI's formal mathematical reasoning and problem - solving abilities.
Figure: The AI model intelligence ranking list of Artificial Analysis
However, due to the different uses of models, major platforms do not adopt the same evaluation criteria.
For example, the large - language model list of Sinan (OpenCompass) is evaluated based on its own closed - source evaluation dataset (CompassBench). We cannot know the specific test rules, but the team provides a public validation set for the community and updates the test questions every 3 months.
Figure: The large - language model list of OpenCompass
Meanwhile, the website also selects some evaluation sets from its partners, evaluates AI models in mainstream application fields, and releases test rankings:
And HuggingFace also has a similar open - source large - language model list. The evaluation criteria include MATH, GPQA, and MMLU - Pro mentioned above:
Figure: The open - source large - language model ranking list on HuggingFace
In this list, some evaluation criteria are added with explanations:
IFEval:
Full name: Instruction - Following Evaluation
It is used to evaluate the ability of large - language models to follow instructions, with an emphasis on formatting. This evaluation not only requires the model to give correct answers but also focuses on whether the model can output answers strictly in the specific format given by the user.
BBH:
Full name: Big Bench Hard
A collection of relatively difficult tasks selected from the Big Bench benchmark test, which forms a set of high - difficulty questions specifically designed for large - language models. As a 'comprehensive test paper', it contains various types of difficult questions, such as language understanding, mathematical reasoning, common sense, and world knowledge. However, there are only multiple - choice questions on this test paper, and the scoring criterion is accuracy.
MuSR:
Full name: Multistep Soft Reasoning
An evaluation set used to test the ability of AI models to conduct complex, multi - step reasoning in long texts. Its testing process is similar to human 'reading comprehension'. After reading the article, one needs to connect the clues and information scattered in different places to get the final conclusion, that is, 'multi - step' and'soft reasoning'. This evaluation also uses the form of multiple - choice questions, with accuracy as the scoring criterion.
CO2 Cost:
This is the most interesting indicator because most LLM lists do not mark carbon dioxide emissions. It only represents the environmental friendliness and energy efficiency of the model and cannot reflect its intelligence and performance.
Similarly, searching for LLM Leaderboard on HuggingFace, you can also see rankings in multiple fields.
Figure: Other large - language model ranking lists on HuggingFace
It can be seen that using objective benchmark tests as the 'college entrance examination' for AI has clear advantages: objectivity, efficiency, and reproducibility.
At the same time, it can quickly measure the 'hard power' of a model in a certain field or aspect.
However, along with the 'college entrance examination' come the inherent drawbacks of exam - oriented education.
The model may be affected by data pollution in the test, resulting in inflated scores, but it may know nothing in practical applications.
After all, in our previous large - model evaluations, simple financial indicator calculations may also go wrong.
At the same time, objective benchmark tests can hardly measure the'soft power' of the model.
Textual creativity, the emotional quotient and sense of humor in the answers, and the elegance of the language - these hard - to - quantify evaluation indicators that we don't usually mention specifically determine our experience of using the model.
Therefore, when a model heavily promotes its 'topping' in a certain benchmark test, it becomes a'single - subject champion', which is already a remarkable achievement, but it is still far from being an 'all - around academic master'.
02 Type Two: Human Preference Arena, an Anonymous Talent Competition
As mentioned before, objective benchmark tests focus more on the 'hard power' of models, but they cannot answer the most practical question:
Is a model really 'useful' in practice?
A model may know everything in the MMLU test but be helpless in the face of simple text - editing tasks;
A model may solve algebraic and geometric problems instantly in the MATH test but fail to understand the humor and irony in the user's words.
Facing the above dilemmas, the LMSys.org team composed of researchers from universities such as the University of California, Berkeley, came up with an idea:
"Since models ultimately serve humans, why not let humans directly judge?"
This time, the evaluation criteria are no longer test papers and question sets, and the scoring criteria are in the hands of users.
LMSys Chatbot Arena, a large - scale crowdsourcing platform that ranks large - language models through 'blind - test battles'.
During the battle, two models appear simultaneously and answer the same question, and users decide who wins and who loses.
Users cannot know the 'true identities' of the two 'contestants' before voting, effectively eliminating stereotypes.
For ordinary users, using LMArena is very simple:
After logging in to https://lmarena.ai/, users first ask a question, and the system will randomly select two different large - language models and send the question to them simultaneously.
The answers generated by the two models anonymously labeled as Assistant A and Assistant B will be displayed side by side, and users need to vote for the most appropriate answer based on their judgment.
After voting, the system will inform users which models Assistant A and Assistant B are, and this vote will be added to the global users' voting data.
Figure: The text ability ranking list of LMArena
LMArena has designed seven categories of ranking lists, namely Text (text/language ability), WebDev (Web development), Vision (visual/image understanding), Text - to - Image (text - to - image generation), Image Edit (image editing), Search (search/internet access ability), and Copilot (intelligent assistance/agent ability).
Each list is generated by users' votes, and the core innovative mechanism adopted by LMArena is the Elo rating system.
This system was originally used in two - player battle games such as chess and can be used to measure the relative strength of players.
In the large - model ranking list, each model has an initial score, that is, the Elo score.
When model A defeats model B in a battle, model A can win some points from model B.
The number of points won depends on the strength of the opponent. If a model defeats a model with a much higher score than itself, it will get a large number of points; if it only defeats a model with a much lower score than itself, it can only get a small number of points.
Therefore, once a model loses to a weaker one, it will lose a large number of points.
This system is very suitable for handling a large amount of '1v1' pairwise comparison data, can judge relative strength rather than absolute strength, and can make the ranking list update dynamically, which is more credible.
Although some relevant researchers have pointed out that there are problems such as private - test privileges and unfair sampling in LMArena's ranking list, it is still one of the relatively authoritative ranking lists for measuring the comprehensive strength of large - language models at present.
In an environment full of AI news, its advantage lies in eliminating users' preconceived biases.
At the same time, the hard - to - quantify indicators such as creativity, sense of humor, tone, and writing style mentioned before will be reflected in the voting, which helps to measure subjective quality.