Are We Getting Closer to or Farther from AGI After GPT

Some random and unstructured thoughts

On March 15, 2023, GPT-4 was released. At that time, most people were still figuring out how to use ChatGPT and where its official website was. I only briefly tested the effects of GPT-3.5, played around with ChatBox, asked some silly questions, and then didn't know what else to ask.

So, I still clearly remember the feeling after my first serious conversation with GPT-4. The only thought swirling in my mind was: The world has changed.

Back then, the internet was filled with an atmosphere of both excitement and panic, similar to the gold rush. Everyone was frantically forwarding mind-boggling screenshots and discussing which occupations were about to disappear. We really thought that was the miracle itself.

Nobody expected that it was just the prologue of a long night.

After a long wait of 939 days, at the early hours of August 8, 2025, Beijing time, OpenAI finally unveiled the veil of GPT-5. The world held its breath, waiting for another miracle of "the world has changed."

However, when the curtain was drawn, what we saw was a performance far more complex, contradictory, ineffable, and hard to put into words than we imagined. It's like the norm for OpenAI this year: ordinary users praise it highly, and the DAU is increasing rapidly. On the other hand, hardcore users are full of complaints. I myself have already experienced several changes of the main models from GPT to Claude and Gemini, and I haven't used ChatGPT for a long time. Since the amazing spring press conference of GPT-4o last year, every press conference of OpenAI has left a mixed feeling, with more hype than surprises.

· · ·

At the beginning of the press conference, Sam Altman set the tone with a strong sense of pragmatism: "GPT-3 is like a high school student, GPT-4o is like a college student, and GPT-5 is like a team of on-demand doctoral-level experts." The keyword is no longer "chatting" but "doing things."

The core of achieving this is not simply piling up parameters but a philosophical revolution in architecture.

In the past, users had to make a painful choice between the speed of GPT-4o and the in-depth reasoning of o3, just like hesitating in an arsenal full of various weapons. GPT-5 tries to end this "trouble of choice."

It is a unified intelligent system. Inside, it contains a fast model (gpt-5-main) for handling most problems, a deep reasoning model (gpt-5-thinking) designed for high-difficulty problems, and the most crucial role - the real-time router. This router is like an experienced project manager, dynamically deciding which "expert" to deploy based on the type and complexity of your question, or even just your words like "think seriously about this."

Using GPT-5 through the API is even simpler: it offers three models - regular, mini, and nano, and each model can run at any of the four reasoning levels: minimum (a new level not available in previous OpenAI reasoning models), low, medium, or high.

The input limit for these models is 272,000 tokens, and the output limit (including invisible reasoning tokens) is 128,000 tokens. They support text and images as input and only text as output.

Tina Kim, a researcher at OpenAI, also said at the press conference: "With GPT-5, we will phase out all old models." It's more of a declaration than just confidence. The era of the "model zoo" that dazzled users is over, replaced by a highly coordinated intelligent organism with a unified will.

The GPT-5 System Card shows the inheritance relationship between the old and new models.

The ascension of any new king is inseparable from a grand "show of strength." GPT-5 achieved almost top - ranking results in various benchmark tests.

But even in this routine data - showing benchmark session, there was a mishap.

Keen - eyed netizens noticed that just five minutes into the press conference, the bar charts on the on - site PPT were drawn "quite arbitrarily." For example, in one chart, the bar representing 69.1% was actually shorter than the one representing 52.8%.

This small incident, together with Musk's immediate repost on X of the "congratulatory message" saying "Grok 4 defeated GPT-5 in ARC - AGI - 2," jointly formed an interesting footnote.

Benchmark scores are ultimately cold. The real difference lies in the vivid and passionate actual experience.

This is exactly the most core, fascinating, and also the most disturbing aspect of GPT-5. It doesn't benefit all creators equally but makes clear choices.

First is multi - modality. Audio input/output and image generation are currently not within the skill set of GPT-5. These functions are still covered by models such as GPT-4o Audio, GPT-4o Realtime and its mini - versions, GPT Image 1, and the DALL - E image generation model.

But maybe there will be a GPT-5o soon. Who knows.

Then there is AI programming, which developers care about the most. This year is a happy year for developers. On the same day as the release of GPT-5, Cursor CLI was also released, and all kinds of Coding Agents have emerged in large numbers this year.

The demonstration at the press conference was already amazing enough: in just two minutes, with just the sentence "Build a web application for my partner to learn French," GPT-5 generated a complete interactive website with flashcards, quizzes, and even a "mouse - eating - cheese" version of the Snake game.

The more crucial test lies in the ability to precisely modify production - level code. In another test, developers asked the AI to modify specific props in a.ts file in a complex production project and synchronously update all files that referenced the component. This is a tedious task that is prone to errors and has a far - reaching impact.

The result was that Gemini 2.5 Pro and Claude 4 Opus "failed completely." However, GPT-5 perfectly completed the task. It is no longer just a tool that "writes" code; it starts to "understand" the project and think like a real senior colleague.

Michael Truell, the CEO of the AI programming startup Cursor, was invited to give a demonstration at the press conference. He asked GPT-5 to solve an issue that had been pending on the GitHub of the OpenAI Python SDK for three weeks. GPT-5 quickly formulated a plan, searched the codebase, located the problem, and made modifications. The whole process was smooth. Truell's evaluation was: "This is the first time I trust a model to complete my most important work."

To make this "trustworthy" ability truly widespread and become the cornerstone of the developer ecosystem, a disruptive business strategy is essential. Let's talk about the API pricing of GPT-5, which is a market - killing move. It only costs $1.25 per million input tokens, which is half the price of GPT-4o and even more competitive than the same - level models of Google and Anthropic. Behind this is a clear strategic intention: Exchange profit for market share and low price for ecosystem.

Cited from Simon Willison's latest article

The price comparison makes me think of GPT-4.5, which was regarded as OpenAI's failed product this year and was later distilled into GPT-4.1 (this reverse naming still seems quite absurd to me).

At that time, the price per million output tokens was not $8 as shown in the chart for GPT-4.1 but $180, which was called an astronomical price. It was actually the product of a failed pre - training of GPT-5, with the internal code name "orion." Coincidentally, when it was released, DeepSeek - R1 cut its price, so it naturally became the target of ridicule.

However, this astronomical - priced model once became the strongest writing model in the minds of many users. In the official promotion at that time, GPT-4.5 also emphasized emotional reasoning and real - human experience.

Text writing is exactly the ability of GPT-5 that has caused controversy. GPT-5, an intelligent hybrid that allows users to choose models autonomously, doesn't seem to have a model that can match the writing ability of the emotion - specialized GPT-4.5:

Sam Altman himself posted a tweet, using the black - humor of "the obituary of GPT-4o" to prove that the writing ability of GPT-5 has been greatly enhanced:

But some people in the comments under his tweet also feedback that the writing of GPT-5 seems indeed unsatisfactory.

The reason for mentioning programming and writing is that in the system card of GPT-5, programming, writing, and health are officially recognized as the three most commonly used scenarios for ChatGPT.

We have made significant progress in reducing hallucinations, improving instruction - following ability, and minimizing sycophancy, and have enhanced the performance of GPT-5 in the three most common uses of the chatbot ChatGPT - writing, programming, and health. All GPT-5 models are also equipped with our latest security training method - Safe Completion - to prevent the generation of unauthorized content.

OpenAI also put a lot of effort into medical - health - related issues in its two new open - source models, gpt - oss - 120b and gpt - oss - 20b, not long ago.

Whether it's programming, writing, or life - and - death health consultations, an inescapable Sword of Damocles is the reliability of the model. In practical applications, what people care about the most is the problem of model hallucinations. Like almost all speakers at press conferences this year, Sam Altman also claimed that GPT-5 has significantly reduced hallucinations. (Here, I think of Pichai and Musk, especially Pichai, who often likes to emphasize the hallucination problem of Google's models.)

Today, I read an interesting view in Simon Willison's article. Many models have generally reduced hallucinations this year, and there are hardly any hallucinations in Gemini 2.5 Pro and Claude 4. Actually, part of the reason is that people are better at using AI.

People who use AI frequently will naturally avoid prompt words that are likely to cause hallucinations, such as asking a model without a search function for URLs or paper citations, or asking the AI to write a ten - thousand - word article without providing enough information. These were common mistakes two years ago.

In addition to directly generating wrong answers, there is another type of model hallucination called "the AI thinks it has completed the task," which was a common problem in many models last year. So, OpenAI also wrote in the system card of GPT-5:

We let gpt - 5 - thinking make various attempts on some tasks that are partially or completely impossible to complete, and reward the model for honestly admitting that it cannot complete the task.

In tasks that require

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

After GPT-5, are we getting closer to or farther from AGI?