HomeArticle

29 individuals, valued at 12 billion.

36氪的朋友们2026-01-19 15:28
A large model "benchmarking" tool is the first to make a fortune.

Recently, US AI startup LMArena announced the completion of its Series A financing, raising $150 million. Its post - investment valuation reached $1.7 billion (approximately 12 billion RMB). This round of financing was led by Felicis Ventures and UC Investments, an investment fund under the University of California, Berkeley, with well - known US VCs such as A16Z, Lightspeed Venture Partners, The House Fund, LDVP, and Kleiner Perkins participating.

There are mainly three reasons why this financing is interesting:

Firstly, LMArena's valuation has skyrocketed very quickly. Its previous round of financing was a seed round in May 2025, led by A16Z, with a valuation of $600 million at that time. That is to say, its valuation tripled within seven months, and it quickly joined the ranks of unicorns.

Secondly, the LMArena team is extremely small. According to records from data platforms such as PitchBook as of early 2026, the company has only 29 employees in total, which means each person is valued at about $400 million.

Finally, LMArena's product seems to have little technological sophistication, and many people think, "I could do it too." Strictly speaking, LMArena is not an AI company. It is just a website that scores and ranks large models, which can be understood as the Antutu for large models.

When various large models are in fierce competition, no one expected that a "benchmarking" tool for large models would be the first to make a fortune.

The Unintentionally Born Unicorn

LMArena's emergence as a unicorn is actually an unintended result.

LMArena originated from an open - source academic organization, LMSYS Org, which was initiated by students and professors from prestigious universities such as the University of California, Berkeley, the University of California, San Diego, and Carnegie Mellon University. Its core mission is to democratize the use and evaluation of large models by developing open - source models, systems, and datasets.

It is worth mentioning that LMSYS Org has a high proportion of Chinese members. Lianmin Zheng, a doctoral student at Berkeley, Hao Zhang, an assistant professor at UCSD, and Wei - Lin Chiang, a researcher at Berkeley, are among the core members.

In March 2023, LMSYS Org released an open - source model called Vicuna, whose performance is comparable to that of ChatGPT. However, they found that there was no reliable testing method in the market at that time to truly distinguish the quality of models. So, in April 2023, the research team launched an open testing platform called Chatbot Arena. Unexpectedly, this platform became increasingly popular in the AI circle. In September 2024, the Chatbot Arena platform was officially renamed LMArena, and it is now one of the most authoritative large - model evaluation platforms globally.

LMArena's core concept is very simple: "anonymous battles".

After entering the LMArena website, the system will ask you to enter a prompt at will. Then, the system will randomly select two AI models to generate responses to this prompt. Without knowing the identities of the models, users compare the quality of the two responses and choose which model wins. The winning model gets points, and the losing model loses points. After hundreds of thousands or millions of such battles, the final score of each model can be obtained.

Although this scoring mechanism is simple, it directly addresses the core pain points of large - model evaluation.

Traditional large - model evaluation methods are generally "question - answering tests", such as MMLU (Massive Multitask Language Understanding), GSM8K (elementary school mathematics), and HumanEval (code generation). However, with the development of large models, these evaluations are facing three fatal challenges: saturation, contamination, and disconnection.

Firstly, there is saturation. As large models become better at "answering questions", approaching the human upper limit, the discriminatory power of such tests is decreasing. If everyone scores 90 or even 95 points or above, the test loses its meaning.

Secondly, there is contamination. Since the test questions are usually publicly available on the Internet, large models can be pre - trained specifically for these questions, which contaminates the test results.

The most difficult problem to solve is disconnection. The test questions are different from users' real - world scenarios. A large model that can answer questions well may not be able to solve real problems. The situation of "high scores but low abilities" also exists in large models.

LMArena solves these three problems by collecting real human preferences and changing the large - model evaluation method from a "classroom test" to an "arena duel".

Now, LMArena's rankings have been widely accepted in the AI industry as the most authoritative "human preference" indicator. More than 400 large models have been scored and ranked by LMArena, and millions of independent users participate in the evaluations every month. Whether it is OpenAI, Google, or major domestic AI companies, they will send their newly released models to LMArena for ranking. Once a model gets a high score, the company will definitely boast about it at the product launch.

Will the "Scoring" Tool's Commercialization Be "Adopted" by Big Tech Companies?

In early 2025, LMArena was officially registered as a company and began to shift from an academic project to commercial development.

When it comes to "benchmarking", it is easy to think of the once - popular Android benchmarking tools in China. These tools usually have users, popularity, and traffic, but it is difficult for them to find a way to monetize. Their final fate is often to be "adopted" by big tech companies, gradually lose credibility, and finally be abandoned by users. Will LMArena also face such problems?

The answer is definitely yes. Although LMArena has not directly received investment from big AI companies, VC institutions including A16Z have heavily invested in many AI companies, and the indirect interest correlation cannot be ignored.

The biggest challenge to LMArena's credibility was the Meta "cheating" incident that caused a stir in the AI circle in early 2024.

In April 2025, researchers from multiple AI companies and universities, such as Cohere, Stanford University, and the Massachusetts Institute of Technology, jointly published an article accusing LMArena of helping some AI companies manipulate rankings.

The article pointed out that before releasing Llama 4, Meta privately tested 27 model variants on the LMArena platform but only publicly announced the score of the best - performing model to top the rankings. Additionally, the article argued that LMArena disproportionately increased the "battle" times of models from big tech companies including Meta, OpenAI, and Google, giving these companies' models an unfair advantage in the rankings.

In response to these accusations, LMArena stated that "some claims do not match the facts" and that publishing the scores of pre - released models is meaningless.

To maintain transparency, LMArena will open - source some of its code and regularly release battle datasets for researchers to analyze. However, the controversy over its fairness will probably accompany LMArena's commercialization process.

To Be the "Certification Officer" for Products in the AI Era

So, if it does not sacrifice fairness, what better commercialization methods does LMArena have?

In September 2025, LMArena officially launched its first commercial product, AI Evaluations. AI Evaluations is mainly targeted at enterprises or research institutions developing large AI models, providing them with model evaluation services. The ARR (Annual Recurring Revenue, calculated by multiplying the revenue of the most recent month by 12) of AI Evaluations reached $30 million in December 2025.

Considering that AI Evaluations has been on the market for less than four months, this result is quite good. However, it is obviously not enough to support a valuation of $1.7 billion. What potential of AI Evaluations did the Silicon Valley VCs that invested in it see?

After leading LMArena's seed - round financing, A16Z published an article explaining its investment logic, with three core points:

Firstly, A16Z believes that LMArena's scoring has "de facto" become the standard for evaluating the performance of large AI models and is a "key infrastructure" for the development of the AI industry.

Secondly, LMArena has created a simple and successful flywheel mechanism: more models attract more users, which generates more preference data, and in turn attracts more models to join. Obviously, once this flywheel is formed, it becomes a barrier that is difficult to replicate.

Thirdly, A16Z believes that neutral and continuous evaluation will be a necessity for the supervision of large AI models in the future.

A16Z predicted several possible business scenarios for LMArena in the future. One of the most important is to provide compliance support for regulated industries, such as hospitals or other critical infrastructure. For these industries, the reliability of AI cannot rely solely on the promises of AI companies but must be guaranteed through transparent and continuous evaluation. A16Z envisions that the "LMArena certification" will become the "green certification" for AI products in the future. The number of user evaluations on the LMArena platform will not be in the millions but in the billions in the future.

In early 2025, LMArena launched the Inclusion Arena product, which directly embeds tests into real - world AI applications through APIs and SDKs to collect feedback data from the production environment. As of July 2025, this product had collected more than 500,000 real - battle records. Its value lies not only in greatly enhancing the reference value of the rankings but also in fact building an "AI continuous integration/continuous deployment pipeline".

A16Z acknowledges that LMArena faces a huge challenge of "maintaining neutrality under commercial pressure". However, companies that can make AI "reliable, predictable, and trustworthy" will create the greatest value in the future.

This article is from the WeChat official account "China Venture Capital", written by Tao Huidong, and is published by 36Kr with authorization.