Just now, Altman released GPT-5, allowing everyone to use the "doctorate-level" AI for free. However, the wrong benchmark graph has been criticized across the entire internet.
After waiting for years, GPT-5 was finally released early this morning.
We were full of anticipation, and the nervousness of several core members at OpenAI during the live stream was clearly visible.
During the live stream, Altman also posted more than a dozen tweets to introduce the highlights of GPT-5.
Since there is a lot of information, we will introduce them one by one based on Altman's tweets.
First of all, this is an integrated model. That means when you use it, you don't need to switch between different models. It will decide when to think deeply on its own.
Although Altman emphasized that benchmarks are not important, they still showed many benchmark results, such as in the fields of mathematics, programming, visual perception, and health. The specific benchmark scores are as follows:
Mathematics: Achieved 94.6% in the 2025 AIME test without tool assistance.
Actual programming applications: Reached 74.9% in SWE-bench Verified and 88% in Aider Polyglot.
Multimodal understanding: Achieved 84.2% in MMMU.
Health: Reached 46.2% in HealthBench Hard.
With the extended reasoning ability of GPT-5 Pro, the model also set a new SOTA in the GPQA test, scoring 88.4% without tool assistance.
In terms of fees, GPT-5 is divided into a free version, Plus, and Pro plans. According to Altman, the free version can also access "doctor-level intelligence" (the regular version of GPT-5 with reasoning capabilities). Plus users have fewer restrictions on usage frequency, while Pro users can access GPT-5 Pro.
For developers, the API prices of the three versions of GPT-5 are as follows: The standard version of GPT-5 costs $1.25 per million input tokens and $10 per million output tokens. The GPT-5 mini and Nano versions are cheaper.
Although the live stream lasted for more than an hour, OpenAI spent most of the time introducing how "useful" GPT-5 is.
For example, in education, it can generate hundreds of lines of code in a few minutes and create interactive content to explain complex concepts, such as the Bernoulli effect.
In writing, GPT-5 has better writing skills than GPT-4o.
In programming, it can create a French learning website in a few minutes, help you practice pronunciation, and also allow you to do exercises and play games.
The voice mode has also been upgraded. The voice intonation is more natural, you can chat for as long as you want, and you can adjust the speaking speed at will. It is very suitable for learning foreign languages.
They also optimized the "AI medical diagnosis" function we reported on before. They invited a cancer patient to share her experience on-site, as well as the help ChatGPT provided in explaining her condition. Altman said that GPT-5 is the best health model to date.
However, there were also some small bugs on-site. For example, the benchmark chart was incorrect. Altman also admitted the mistake:
There was more than one such mistake:
What's more embarrassing is that Elon Musk also came to spoil the party. He retweeted the news that GPT-5 didn't beat Grok 4 in the ARC-AGI-2 test:
Even the demo about reducing hallucinations was criticized:
However, some people said that this is not a "hallucination" problem, but a problem with the data source.
Overall, in the eyes of many people, GPT-5's performance did not meet expectations.
So, how did GPT-5 perform in various aspects? Let's take a look at the detailed information in the technology blog.
Integrated Intelligent System
GPT-5 is a unified system consisting of three models: an efficient response model for answering most routine questions, a deep reasoning model "GPT-5 Thinking mode" for solving complex problems, and a real-time router that automatically assigns the optimal processing model based on the conversation type, problem complexity, tool requirements, and explicit user instructions (e.g., inputting "Think deeply about this problem").
The router system is continuously optimized through real-time signals such as user model-switching behavior, answer preference data, and accuracy feedback. When the usage limit is reached, a streamlined version of each model will take over subsequent queries.
OpenAI plans to integrate these capabilities into a single ultimate model in the near future.
GPT-5 not only outperforms its predecessors in benchmark tests and has a faster response time, but more importantly - it can handle various real-world scenarios more effectively.
OpenAI said that GPT-5 has made significant breakthroughs in three key areas: significantly reducing hallucination generation, improving instruction-following accuracy, and reducing compliant answers. At the same time, GPT-5 shows overall improvements in the three most commonly used functional scenarios of ChatGPT (text creation, programming development, and health consultation).
Evaluation
Next, let's take a look at GPT-5's performance on various benchmarks.
According to the blog, GPT-5 has achieved significant improvements in all capabilities, especially in the fields of mathematics, programming, visual understanding, and health. In mathematics, without tool assistance, GPT-5 achieved 94.6% in the 2025 AIME test; in real-world programming, GPT-5 scored 74.9% in SWE-bench Verified and 88% in Aider Polyglot, 84.2% in multimodal understanding (MMMU), and 46.2% in the health field (HealthBench Hard). The professional version of GPT-5 with extended reasoning ability set a new record in the GPQA benchmark test, scoring 88.4% without tool assistance.
The results of the AIME test with tools should not be directly compared with the performance of models without tool assistance; this is an example of how effectively GPT-5 can use available tools.
Coding Benchmark
Instruction-following and intelligent tool invocation capabilities: GPT-5 shows significant improvements in instruction-following and intelligent tool invocation benchmark tests. These capabilities enable it to reliably execute multi-step requests, collaborate across tools, and adapt to context changes. In practical applications, this means that GPT-5 is better at handling complex and dynamically changing tasks: it can follow user instructions more accurately and make full use of existing tools to complete more work steps end-to-end.