GPT-5 Controversy, Open-Source Catch-Up, and Capability Leap: Epoch AI's Year-End Report Reveals Accelerated AI Capabilities
On December 25th, the year - end report released by Epoch AI, a non - profit organization focused on artificial intelligence benchmark testing, shows that overall, the capabilities of AI models are rapidly improving.
Top international models such as GPT and Gemini have performed excellently on the expert - level mathematical problem set FrontierMath. However, they still failed to achieve full marks on truly difficult problems, indicating that there is still room for improvement in their reasoning abilities. Meanwhile, the progress in AI reasoning abilities and reinforcement learning has almost doubled the growth rate and significantly reduced costs. Many models can now run on consumer - grade hardware.
In this context, Chinese open - source large models have also made some progress, but there is still an obvious gap compared with international top - tier models. In the FrontierMath test, the vast majority of Chinese models hardly scored any points. The highest score was achieved by DeepSeek - V3.2, with only about 2%. This shows that although Chinese models are catching up, they still face challenges when dealing with truly complex problems.
01 The "Seven - Month Catch - Up" of Chinese Models: The Power of Open Source is Remodeling the Landscape
The highest score of Chinese models still lags behind the global frontier by about seven months.
In the latest FrontierMath evaluation by Epoch AI, Chinese open - source models have delivered an impressive performance. FrontierMath is a highly challenging mathematical benchmark test carefully designed by expert mathematicians, covering major branches of modern mathematics such as number theory, real analysis, algebraic geometry, and category theory. The complete dataset contains 350 questions, including 300 in the basic set (Levels 1 - 3) and 50 extremely difficult questions (Level 4). Researchers usually need hours or even days to solve these problems.
The FrontierMath question set
The FrontierMath question set is divided into two categories: public and private. Ten questions from the first three levels of the basic set are open to the public, and the remaining 290 questions form the private set. Among the extremely difficult questions in Level 4, two are public, and the remaining 48 are in the private set.
The evaluation results show that on the Level 1 - 3 question banks, the highest score of Chinese models still lags behind the global frontier by about seven months. Although this figure seems significant, in the history of AI development, it means that Chinese models are narrowing the gap with top - tier laboratories such as OpenAI and Anthropic at an astonishing speed. Just two years ago, the gap between open - source models and closed - source frontier models was measured in "years". Now, the performance gap between the best open - source models running on consumer - grade GPUs and the absolute frontier is less than a year.
What is more noteworthy is the Level 4 question bank - 50 extremely difficult mathematical questions that "take days to solve". DeepSeek V3.2 (Thinking) became the only Chinese model to achieve a non - zero score on this level, correctly answering one question (about 2%). Although it seems insignificant, it has great symbolic meaning: it shows that Chinese models have the potential to challenge top - tier mathematical problems. Even OpenAI's o3 and o3 - mini have single - digit accuracy rates on such questions.
Technically, through multi - head latent attention (MLA), innovation in the mixture - of - experts (MoE) architecture, and multi - token prediction, DeepSeek enables the model to reach a pre - training level comparable to Meta Llama 3 with only one - tenth of the computing power. Subsequently, the released inference model R1 performs as well as OpenAI's o1, but the development cost is only a fraction of the latter. This confirms Epoch AI's view that the main driving force for the decline in AI training costs is not cheap hardware, but algorithm optimization and data improvement.
Epoch AI's evaluation is completed using third - party APIs (Fireworks for DeepSeek and Together for other models) to ensure the security of the FrontierMath question bank. Epoch AI analysis points out that some third - party APIs may slightly affect the model scores, and newly released models are more affected. This means that the actual capabilities of Chinese models may be stronger than those shown in the public evaluation.
The answering method of FrontierMath is also worth understanding: the model needs to submit a Python function answer that returns the answer. The answer is usually an integer or a sympy object. The model can think, run Python code, and submit the answer when it is confident. Each question has a strict token limit (a hard upper limit of 1,000,000 tokens), and the evaluation system will record the submitted results and score them. The time limit for running Python code using Python tools is 30 seconds to ensure that the evaluation can be repeated and verified on commercial hardware.
The data also shows a trend: the time window from the emergence of any frontier AI capability to its widespread availability is less than a year. This provides an opportunity for Chinese models to catch up with the frontier, but also brings challenges: since the frontier is still advancing at a high speed, the pursuit has no end.
02 The "Arms Race" of Global Frontier Models: From GPT - 5 to Gemini 3
When GPT - 5 was released in 2025, it caused some "disappointment" in the market. Compared with intermediate versions such as Claude 3.7 and Gemini 2.5, the performance improvement seems limited. However, Epoch AI data shows that the leap of GPT - 5 compared with GPT - 4 is almost the same as that of GPT - 4 compared with GPT - 3:
·MMLU: +43%
·MATH: +37%
·TruthfulQA: +40%
·HumanEval: +67%
·GPQA Diamond: +55%
·MATH Level 5: +75%
·Mock AIME 24 - 25: +84%
The reason for the weakened "shock" is the accelerated release rhythm: it took about two years from GPT - 3 to GPT - 4, and only one year from GPT - 4 to GPT - 5. The market has been "fed" by intermediate models such as Claude 3.7, Gemini 2.5, and o1, so the expectations for GPT - 5 have naturally increased.
Gemini 3 Pro also encountered challenges in the FrontierMath evaluation, mainly due to API stability issues. On the Tier 1 - 3 question banks, its accuracy rate is 38%, but it lost points on 10 questions due to API errors. Among the extremely difficult questions in Tier 4, the accuracy rate is 19%, and 3 questions were affected by API errors. Epoch AI retried at least 10 times to ensure the rigor of the evaluation. This shows that API stability has become an important constraint on the performance of frontier models.
xAI's Grok 4 encountered more serious network and timeout problems: 8 out of 48 questions in Tier 4 (16%) could not be scored normally. Epoch AI adopted specific rules for processing while maintaining complete independent editing to ensure the transparency of the evaluation.
In addition, OpenAI's R & D expenditure also reveals the real cost structure: in the $5 billion computing power budget in 2024, 90% was used for experimental training and basic research, rather than the final release of GPT - 4.5 or other models. This shows that the core cost of building a top - tier model is not "making the model", but "figuring out how to make it". Therefore, DeepSeek can achieve similar performance at a lower cost because it stands on the shoulders of frontier laboratories.
03 The Accelerated Capabilities of AI Models: The Progress Speed of Frontier Models Doubles
The capabilities of AI models are improving at an unprecedented speed.
The latest data shows that the capabilities of AI models are improving at an unprecedented speed. According to the analysis of the Epoch Capabilities Index (ECI) by Epoch AI, since April 2024, the progress speed of top - tier models in various benchmark tests has been almost twice that of the previous two years. Specifically, the annual increase in capabilities before the breakpoint was about 8 points, while after the breakpoint, it increased to about 15 points, showing a significant acceleration.
This acceleration coincides with several important changes: the rapid rise of inference models (such as OpenAI's o1 and DeepSeek R1) and the increased investment in reinforcement learning by frontier laboratories. This indicates that the development model of AI is changing: it no longer relies solely on large - scale pre - training, but uses multiple strategies of pre - training, inference computing, and reinforcement learning to improve model capabilities.
The ECI ranking of global main models
Epoch AI's report tracked 149 frontier models from the end of 2021 to the end of 2025, including all core frontier models. The analysis used a piece - wise linear model to fit the trend of the capabilities of top - tier models changing over time and determined the best "breakpoint" as April 2024. The capability growth rates before and after the breakpoint were 8.2 points/year and 15.3 points/year respectively, with an acceleration ratio of about 1.86 times. Statistical analysis shows that this acceleration signal is robust and significant, and it can better reflect the actual development speed compared with a single linear trend.
This means that after 2024, the performance improvement of frontier models not only increases in absolute value but also has a faster iteration speed. The investment of leading laboratories in computing power, algorithms, and training data will directly determine their ability to maintain the lead. At the same time, it also poses higher requirements for open - source teams: to catch up with closed - source models in a shorter time window, continuous optimization of algorithms and training strategies is needed.
In short, the speed of AI capability improvement is accelerating, and the rhythm of the global AI competition is also compressed. It is difficult to maintain the leading advantage for a long time.
04 The Top Ten AI Trends in 2025: Technological, Economic, and Social Impacts
In the just - past 2025, Epoch AI released 36 data insights and 37 newsletters, a total of 70 short surveys on AI. Which content attracted the most attention from readers? The year - end inventory shows that the reading and interaction data of these insights and newsletters have screened out the core directions of the top ten trends.
Among these most popular surveys, the top five are data insights that readers are most concerned about, which reveal the most core industry trends such as AI capability progress, computing power distribution, and cost changes. The following five reflect trends in policies, social applications, and industry practices.
That is to say, the top ten trends of this year are not simply set by researchers, but combine the readers' attention and the weight of data insights, presenting a professional and market - and public - oriented panorama of AI.
Trend 1: The Inference Cost Drops Dramatically, but the Task Differences are Obvious
From April 2023 to March 2025, the inference cost at the same performance level decreased exponentially:
The slowest tasks: a 9 - fold decrease per year
Medium - speed tasks: a 40 - fold decrease per year
The fastest tasks: a 900 - fold decrease per year
The cost reduction is mainly driven by two factors: intensified market competition (more API providers and more transparent pricing) and efficiency improvement (optimization of inference algorithms and increased hardware utilization). However, the speed at which different tasks enjoy the cost benefits varies greatly: simple tasks (such as text classification) are almost free, while complex tasks (such as doctoral - level scientific reasoning) have a slower decline rate. This shows that the economic advantages brought by the popularization of AI capabilities are not equal for all tasks, and enterprises and developers still need to optimize strategies for specific applications.
Trend 2: The Gap between Consumer - Grade Hardware and Frontier Models is Shortened to 7 Months
Epoch AI found that the gap between the best open - source models running on a single consumer - grade GPU (such as RTX 4090, RTX 5090) and the absolute frontier models has been compressed to about 7 months.
This means that billions of users can run AI close to the frontier level on their personal computers; if an enterprise only relies on fixed model capabilities, it is difficult to maintain a long - term competitive advantage; in terms of policies, "technological blockade" is difficult to prevent the spread of capabilities.
This trend highlights the disruptive impact of open - source AI: the rapid popularization of frontier capabilities, the shortening of the market competition window, and the need for innovation advantages to rely on continuous iteration and overall service capabilities, rather than the performance of a single model.
Trend 3: Most of OpenAI's Computing Power is Invested in Experiments, and the R & D Cost is Far Higher than Training