HomeArticle

Google's new Gemini 3.1 model makes a big splash late at night, processing 363 tokens per second and hitting Claude with a 1/4 price advantage.

新智元2026-03-04 08:12
Google pulled another major move late at night, and Gemini 3.1 Flash-Lite officially made its debut. With an output speed of 363 tokens per second and a price of only $0.25 per million tokens, it outperforms GPT-5 mini and 2.5 Flash in benchmark tests, earning it the title of the most powerful "budget flagship."

After Gemini 3.1 Pro dominated the rankings and became a legend, Google dropped another bombshell late at night.

Just now, Gemini 3.1 Flash-Lite is officially launched!

With a speed of 363 tokens/s and an output price of $1.50 per million Tokens, it directly outperforms GPT-5 mini and Claude 4.5 Haiku in benchmarks.

For the same task, compared with 2.5 Flash (33 minutes), 3.1 Flash-Lite only takes 4 minutes, consumes the least tokens, and has the highest accuracy rate.

It's no exaggeration to say that 3.1 Flash-Lite can almost achieve "instantaneous" output.

Upload any PDF, text, picture, video, or audio, and it can quickly convert them into Markdown format.

Or, the "Particle Forger" of 3.1 Flash-Lite can quickly generate different dynamic effects, which is truly amazing.

Currently, developers can experience the preview version through the Gemini API of Google AI Studio, and enterprise users can access it through Vertex AI.

Run 5 times faster at 1/4 of the price

Let's first look at the most intuitive numbers.

The output speed of 3.1 Flash-Lite reaches 363 tokens/s, which is almost the same as its own 2.5 Flash-Lite (366 tokens/s), but far outpaces the previous generation Gemini 2.5 Flash (249 tokens/s).

What about those "aristocratic players"?

GPT-5 mini only has 71 tokens/s, Claude 4.5 Haiku is only 108 tokens/s, and Grok 4.1 Fast is a bit better at 145 tokens/s.

In other words, Flash-Lite is 5 times faster than GPT-5 mini and 3.4 times faster than Claude 4.5 Haiku, but its price is only one-fourth of the latter.

Now let's look at the specific pricing.

The input price of 3.1 Flash-Lite is $0.25 per million Tokens, and the output price is $1.50 per million Tokens.

3.1 Flash-Lite is 8 times cheaper than 3.1 Pro

In contrast, the output price of GPT-5 mini is $2.00, that of Gemini 2.5 Flash is $2.50, and Claude 4.5 Haiku is as high as $5.00, which is more than 3 times more expensive.

In a nutshell: It runs faster, costs less, and has higher benchmark scores than you.

Benchmark dominance: The "challenge across levels" of a small model

In the GPQA Diamond, which tests scientific knowledge and reasoning ability the most, 3.1 Flash-Lite directly scores a high score of 86.9%.

This result not only outperforms GPT-5 mini's 82.3% and Claude 4.5 Haiku's 73.0%, but also crushes the larger and more expensive Gemini 2.5 Flash (82.8%).

It is also extremely powerful in multi-modal understanding.

In the MMMU-Pro test, Flash-Lite scores 76.8%, outperforming GPT-5 mini (74.1%), Gemini 2.5 Flash (66.7%), Grok 4.1 Fast (63.0%), and Claude 4.5 Haiku (58.0%).

In the fact accuracy test SimpleQA Verified, the gap is even more drastic.

Flash-Lite leads with an accuracy rate of 43.3%, while Gemini 2.5 Flash is 28.1%, GPT-5 mini is only 9.5% (4.5 times lower), and Claude 4.5 Haiku is as low as 5.5% (nearly 8 times lower).

In terms of multi-language ability, Flash-Lite tops the list with 88.9% in the MMMLU test, exceeding Gemini 2.5 Flash's 86.6% and GPT-5 mini's 84.9%, and has no rivals in this price range.

In the field of video understanding, Video-MMMU scores 84.8%, which is also the highest in the same level, and both GPT-5 mini (82.5%) and Gemini 2.5 Flash (79.2%) are inferior.

Of course, 3.1 Flash-Lite is not without its shortcomings.

In the LiveCodeBench code generation test, Flash-Lite scores 72.0%. Although it's not low, GPT-5 mini is significantly stronger with 80.4%, and Grok 4.1 Fast also has 76.5%.

In the Humanity's Last Exam, Flash-Lite scores 16.0%, which is basically the same as GPT-5 mini's 16.7%, but Grok 4.1 Fast gets the highest score of this level with 17.6%.

But don't forget a core fact: The price of Flash-Lite is only a fraction of these opponents.

Rank in the top 40 globally in Arena

Lab benchmarks are just one aspect. Real blind tests are the real test.

In the text arena of Chatbot Arena, 3.1 Flash-Lite ranks 36th with an Elo score of 1432.

Its neighbors are o3 (1432 points) and GPT-5 High (1434 points), and Grok 4.1 Fast Reasoning (1430 points) follows closely behind.

A lightweight model priced at $0.25 ties with OpenAI's flagship reasoning model o3 in Elo score. This cost-performance ratio is quite shocking.

In the code arena, 3.1 Flash-Lite scores 1261 and ranks 35th (tied).

Its opponents here include Claude Haiku 4.5 (1308 points, ranked 31st) and DeepSeek V3.2 (1321 points, ranked 34th). The gap is not large, but there is still room for improvement.

In the Artificial Analysis evaluation, 3.1 Flash-Lite currently has the best output speed and cost-effectiveness in the industry.

Adjustable "depth of thinking"

In addition to its hardcore performance, 3.1 Flash-Lite also comes standard with the thinking levels function, allowing developers to freely set how much reasoning resources the model should invest in each task.

  • For high-frequency, low-complexity tasks such as batch translation, content review, and data classification, run the shallow thinking mode to minimize speed and cost.
  • For generating UI interfaces, building simulation environments, and executing multi-step complex instructions? Switch to the deep reasoning mode, and the effect is comparable to that of large models.

Actual test: The heavyweight performance of a lightweight model

In actual tests, 3.1 Flash-Lite demonstrates capabilities far beyond its positioning.

E-commerce scenario: Instantly fill the prototype diagram.

Given a single instruction, Flash-Lite can fill an entire e-commerce interface prototype with dozens of categories and hundreds of products in just a few seconds, including names, prices, classifications, and image placeholders.

In the past, this was a task that would take designers half a day to manually fill, but now it can be done with a single prompt.

Real-time data dashboard: Weather forecast + historical analysis.

Flash-Lite can combine the latest weather forecast interface and historical data to generate a dynamic weather data visualization dashboard in real-time.

For developers who need to quickly build a data display layer, this ability directly saves them a "front-end engineer".

SaaS AI agent: Multi-step task automation.

Flash-Lite can build a SaaS intelligent agent to handle multi-step flexible tasks, helping enterprises automate processes such as customer ticket processing and order tracking.

With low latency and low cost, such high-frequency call scenarios are exactly the main battlefield of Flash-Lite.

Mass content processing: Quick analysis and categorization.

Facing a large amount of unstructured content such as pictures, documents, and user comments, Flash-Lite can quickly complete analysis, tagging, and categorization.