Large language models are really starting to "steal people's jobs".
In the past month, the competition in the field of large AI models has become significantly more intense. Google and OpenAI, the two long - standing industry leaders, have nearly compressed the product iteration and release cycle to a weekly basis. Before the previous generation of AI models could establish a firm foothold, the next round of updates has already arrived, leading to continuous head - on confrontations.
The latest move comes from Google.
Early on December 18th, Beijing time, Google officially announced the release of Gemini 3 Flash. This is the fastest and most cost - effective model in the Gemini 3 series. It is also Google's fourth substantial update to its large - model product line within a month, which is interpreted as a "precision strike" against OpenAI.
1
OpenAI Sounds the "Red Alert"
Looking back at November, Google and OpenAI, the two most influential AI companies globally, almost simultaneously released their flagship models: Gemini 3 and GPT - 5.1.
Subsequently, Gemini 3 Pro significantly outperformed existing flagship models such as Gemini 2.5 Pro, GPT - 5.1, and Claude Sonnet 4.5 in multiple benchmark tests, establishing a good reputation in a short period.
Almost at the same time, OpenAI on the other side was not willing to be outdone.
After its new - generation product, GPT - 5.1, was at a disadvantage in the head - on confrontation with Google's Gemini 3, OpenAI quickly entered an emergency state. On December 2nd, according to foreign media reports, OpenAI CEO Sam Altman clearly stated in an internal memo to employees that the company had entered a "Code Red" emergency state.
In this state, OpenAI's resources and attention were redirected to its core product, ChatGPT. Figgie Simo, the director of OpenAI's applications, later confirmed that this "alert" directly accelerated the release of GPT - 5.2.
So, just one week later, on the occasion of OpenAI's tenth anniversary, GPT - 5.2 was quickly launched, and three versions were released at once: Instant, Thinking, and Pro.
Judging from the core benchmark tests officially announced, GPT - 5.2 performed extremely strongly. In multiple comparative tests, GPT - 5.2 Thinking almost achieved "first place across the board" when competing against GPT - 5.1, Gemini 3 Pro, etc. This also means that the leading advantage that Gemini 3 Pro had just established for less than a month was broken again.
2
ChatGPT
Is It Really Going to "Replace" Office Workers?
Compared with the dazzling benchmarking system, the most notable change in ChatGPT 5.2 comes from a completely different evaluation system: GDPval.
GDPval does not test whether the model can "solve problems" but directly measures its ability to complete real and specific knowledge - based work tasks. This evaluation covers 44 occupations across 9 core industries that contribute the most to the US GDP. The test content is not multiple - choice questions or Q&A but requires the model to generate real and deliverable work results such as sales PPTs, accounting and financial spreadsheets, emergency department schedules, manufacturing data charts, and even short - video content.
In other words, this evaluation system does not simulate work but directly "puts the model into the workplace".
According to the blind evaluation results of human experts, in high - difficulty knowledge - based work tasks, GPT - 5.2 Thinking outperformed or at least matched the performance of top - industry experts in 70.7% of the tasks.
In terms of efficiency, the gap is even more obvious: GPT - 5.2 Thinking can complete similar tasks about three times faster than human experts, and the comprehensive cost is only about 1% of that of humans.
In the more representative financial scenario, this improvement has also been verified. In the spreadsheet modeling test for "junior investment banking analysts", GPT - 5.2 Thinking achieved an overall score of 68.4%, a significant improvement compared to GPT - 5.1 Thinking's 59.1%, making it OpenAI's best - performing model in this type of task so far.
Overall, in the knowledge - based work tasks covered by GDPval, the proportion of GPT - 5.2 Thinking "beating or tying with industry experts" reached 70.9%. For the previous generation, GPT - 5 Thinking, this figure was only 38.8%.
The product segmentation of GPT - 5.2 has become extremely clear: the Thinking version has more stable long - context reasoning, significantly improved abilities in spreadsheets, PPTs, and complex solutions, and is suitable for truly heavy - duty professional work; the Instant version has more natural conversations, clearer explanations of problems, and higher efficiency in writing tutorials, making instructions, and daily workplace use; the Pro version has the strongest reasoning and coding abilities and is the first choice for scientific research and complex system design.
In a nutshell, the Thinking version handles heavy - duty tasks, the Instant version takes care of daily work, and the Pro version reaches the ceiling of performance.
For this reason, GPT - 5.2 Thinking has been ridiculed by the outside world as the first - generation model that truly starts to "compete with ordinary office workers for jobs".
3
Office "Experts" or "Workhorses": Which to Choose?
The release rhythm of the two tech giants, which clearly shows signs of "rushing", has triggered another wave of more direct market feedback - a large number of negative reviews from users have emerged. Some netizens posted GPT - 5.2's "report card" on SimpleBench, showing that GPT - 5.2 scored lower than Claude Sonnet 3.7, a model released about a year ago; the performance of GPT - 5.2 Pro was not much better, barely surpassing GPT - 5.
Source: SimpleBench
SimpleBench was originally designed to test the performance of large models in logical reasoning tasks that "seem simple to ordinary people but are extremely challenging for machines".
The doubts don't stop there. Bindu Reddy, a former AWS and Google executive, posted on a social platform, pointing out that GPT - 5.2 scored lower than Opus 4.5 and Gemini 3.0 on LiveBench. It also consumes significantly more tokens and has a higher token cost than 5.1, so it may not be worth upgrading from 5.1 at present.
GPT - 5.2 has directly clashed with Google's new offering, Gemini 3 Flash. If the keyword for GPT - 5.2 is "professionalism", then Google emphasizes one word: cost - effectiveness.
This is not simply about being "cheaper" but a systematic reconstruction of the relationship among "performance, cost, and scale".
Google CEO Sundar Pichai said in an official blog post that Gemini 3 Flash has broken through the "Pareto limit" in both performance and efficiency: its overall performance exceeds that of the previous - generation flagship model, Gemini 2.5 Pro, with a nearly three - fold increase in inference speed and a significant reduction in price.
Pichai said, "Gemini 3 Flash proves that speed and scale do not have to come at the expense of intelligence."
Judging from the evaluation results, this is not just a simple marketing slogan.
According to data from Imarena.ai, currently, Gemini 3 Flash ranks among the top 5 in the fields of text, image, and programming, and second in the categories of mathematics and creative writing. It is the most cost - effective cutting - edge model, with an input cost of only $0.5 per million tokens and an output cost of $3 per million tokens.
In comparison, the output cost of Claude Sonnet 4.5 is $15 per million tokens, and that of GPT - 5.2 is $14 per million tokens, nearly five times the price of Gemini 3 Flash.
Tulsee Doshi, the senior director of Gemini product management, said that Google positions Gemini 3 Flash as a "workhorse" model. This model maintains reasoning ability close to that of Gemini 3 Pro, while its operating speed is three times that of Gemini 2.5 Pro, and the cost is only one - quarter of that of Gemini 3 Pro.
4
Agents Are the Future Competitive Point
Looking at the recent frequent updates from OpenAI and Google, it is still difficult to determine who will win in the short term. However, from the perspectives of product design, publicity focus, and implementation paths, the next trend in the evolution of large models is becoming increasingly clear.
Whether it is ChatGPT 5.2 repeatedly emphasizing "specializing in agents" on its publicity page or Gemini 3 Flash directly promoting "high - performance" to large - scale application scenarios, these two seemingly different paths ultimately lead to the same destination: agents.
The competition in large - scale AI foundation models has fully shifted from "cloud - model capabilities" to the "terminal and system layers".
From recent actions, the competition between Google and OpenAI is no longer limited to parameter scale, reasoning ability, and benchmark test results.
On the terminal side, Gemini 3 has fully replaced the traditional Google Assistant and become the core of the Android ecosystem. This change is particularly obvious in the latest Android Auto update. While driving, users can complete complex cross - application and multi - step operations with a single natural - language command, such as querying email information, initiating navigation, and synchronously notifying relevant contacts.
In the office scenario, Google is trying to extend this "system ability" to Workspace. Relying on an ultra - long context window of 1M to 2M tokens, Drive, Docs, and Gmail are integrated into a unified knowledge space for direct dialogue. Users do not need to switch back and forth between files and emails but can directly ask analytical questions based on all historical data and generate structured results. This change at the workflow level has significantly increased the stickiness of enterprise users.
The feedback from the enterprise market is changing accordingly.
Marc Benioff, the founder of Salesforce, recently publicly stated that based on the performance of Gemini 3 in inference speed and accuracy, he and his company have shifted their preference for AI from ChatGPT to Gemini. Subsequently, Salesforce announced that it would integrate Gemini into the Agentforce 360 platform. This move is regarded as an important breakthrough for Google in the enterprise SaaS field originally dominated by Microsoft and OpenAI.
In response to Google's vertical integration, OpenAI has chosen to expand through alliances with technology giants. In the consumer market, the most important variable comes from Apple. iOS 26, expected to be launched between the end of 2025 and the beginning of 2026, will deeply integrate GPT - 5.1. This is not only an upgrade of Siri's backend capabilities but also involves system - level visual intelligence. Through the hardware - level camera interface, users can directly call the GPT model to recognize and understand the real environment.
For OpenAI, this "direct - to - model from hardware" path is a key means to compete against the advantages of the Android ecosystem on mobile devices. In the enterprise and office fields, Microsoft remains OpenAI's most stable support. Through Windows 11 and Microsoft 365, Microsoft's AI assistant, Copilot, continuously promotes GPT - 5.1 into the core enterprise processes. Microsoft's long - term accumulation at the operating system layer and enterprise cloud - service layer still forms an important moat for OpenAI.
Looking back over the past three years, since ChatGPT emerged in 2022, the core of industry competition has always revolved around two points: natural conversation and extensive knowledge. However, in 2025, as enterprises' expectations for AI have shifted from "content generation" to solving complex problems, cross - tool collaboration, and autonomous task execution, the competition dimension has fundamentally changed.
Although the paths seem different, the destination is the same: the real dividing line is not who can chat better but who can complete tasks well and continuously and stably. Gemini 3 and ChatGPT 5.2 are standing on either side of this fork in the road.
This article is from the WeChat official account "IT Times" (ID: vittimes). Author: Jia Tianrong, Editor: Wang Xin. Republished by 36Kr with permission.