Gemini 3 "Eye-opening" Pixel-level Manipulation, Google Responds to DeepSeek-OCR2
[Introduction] Google DeepMind has just introduced a new capability, empowering Gemini 3 Flash with the "clairvoyance" through code.
Surprisingly, Google DeepMind has just launched a significant new capability for Gemini 3 Flash: Agentic Vision. (Could it be stimulated by DeepSeek - OCR2?)
It can be seen that this technology has completely changed the way large language models understand the world:
From the past "guessing" to today's "in - depth investigation."
This capability is introduced by the Google DeepMind team. The core product manager, Rohan Doshi, said that traditional AI models often just take a static look when processing images.
If the details in the image are too small, such as the serial number on a micro - processing chip or a distant and blurry road sign, the model often has to rely on "guessing."
Agentic Vision introduces a closed - loop of "Think - Act - Observe":
The model no longer passively receives pixels but actively writes Python code to manipulate images according to the user's needs.
This capability directly enables Gemini 3 Flash to achieve a 5% to 10% performance leap in various visual benchmark tests.
Agentic Vision: The New Frontier of Agent - Based Vision
The method explored by DeepMind can be summarized as: using code execution as a tool for visual reasoning to transform passive visual understanding into an active agent process.
What does it mean? We know that current SOTA models usually process images in one go.
But Agentic Vision introduces a cycle:
1. Think: The model analyzes the user's query and the initial image to formulate a multi - step plan.
2. Act: The model generates and executes Python code to actively manipulate the image (such as cropping, rotating, annotating) or analyze the image (such as running calculations, counting bounding boxes, etc.).
3. Observe: The transformed image is appended to the model's context window. This allows the model to check new data with better context before generating the final response.
Agentic Vision in Practice
By enabling code execution in the API, developers can unlock many new behaviors.
The demonstration application in Google AI Studio has already shown this.
1. Zooming and inspecting
Gemini 3 Flash is trained to perform implicit zooming when detecting fine - grained details.
PlanCheckSolver.com, an AI - driven building plan verification platform, has increased its accuracy by 5% by enabling the code execution function of Gemini 3 Flash to iteratively check high - resolution inputs.
The background log video shows this agent process: Gemini 3 Flash generates Python code to crop and analyze specific patches (such as roof edges or building parts) as new images.
By appending these cropped images back to its context window, the model visually establishes its reasoning to confirm compliance with complex building codes.
2. Image annotation
Agentic Vision allows the model to interact with the environment by annotating images.
Gemini 3 Flash not only describes what it sees but can also execute code to draw directly on the canvas to establish its reasoning.
In the following example, the model is asked to count the numbers on a hand in the Gemini application.
To avoid counting errors, it uses Python to draw bounding boxes and number labels on each finger it identifies.
This "visual scratch paper" ensures that its final answer is based on a perfect pixel - level understanding.
3. Visual math and plotting
Agentic Vision can parse high - density tables and execute Python code to visualize the findings.
Standard LLMs often produce hallucinations in multi - step visual arithmetic.
Gemini 3 Flash bypasses this problem by performing calculations in a deterministic Python environment.
In the demonstration application example in Google AI Studio, the model identifies the original data, writes code to normalize the previous SOTA to 1.0, and generates a professional Matplotlib bar chart. This replaces probabilistic guessing with verifiable execution.
How to Get Started
Agentic Vision is now available through the Gemini API in Google AI Studio and Vertex AI.
It has also started to be launched in Gemini applications (accessed by selecting "Thinking" from the model drop - down menu).
Here is a simple Python code example showing how to call this capability:
- from google import genai
- from google.genai import types
- client = genai.Client()
- image = types.Part.from_uri(
- file_uri="https://goo.gle/instrument-img",
- mime_type="image/jpeg",
- )
- response = client.models.generate_content(
- model="gemini-3-flash-preview",
- contents=[image, "Zoom into the expression pedals and tell me how many pedals are there?"],
- config=types.GenerateContentConfig(
- tools=[types.Tool(code_execution=types.ToolCodeExecution)]
- ),
- )
- print(response.text)
Future Outlook
Google says that Agentic Vision is just getting started.
Currently, Gemini 3 Flash is good at implicitly deciding when to zoom in on tiny details. Although other functions (such as rotating images or performing visual math) currently require explicit prompt guidance to trigger, Google is working to make these behaviors completely implicit in future updates.
In addition, Google is also exploring how to provide more tools (including web and reverse image search) for Gemini models to further establish their understanding of the world and plans to extend this function to other model sizes beyond Flash.
Easter Egg: Is it Because of DeepSeek?
This is very interesting.
Just after DeepSeek open - sourced DeepSeek - OCR, which can be called "OCR 2.0", Google released Agentic Vision for Gemini 3.
Is this really a coincidence?
Let's boldly guess that Google's "late - night show" this time is very likely forced out by DeepSeek.
There are three reasons:
1. The Striking Coincidence of Timing
On January 27th, DeepSeek just released DeepSeek - OCR2, equipped with the core black technology DeepEncoder V2. It abandons traditional mechanical scanning and enables AI to learn to "read in a logical order" like humans, achieving a perfect understanding of complex layouts and charts with just a few hundred tokens.
On the same day, Google immediately presented Agentic Vision, as if shouting across the air in this "visual arms race": "You make AI understand logic, and we let AI operate directly."
2. The Peak Duel of Technical Routes
DeepSeek - OCR2 follows the "internal strength" approach, simulating the human visual attention mechanism through DeepEncoder V2, dynamically reorganizing image information, and achieving extreme lightweight and logical "seeing."
Google's Agentic Vision follows the "peripheral" approach, that is, "not only see clearly but also be able to operate." DeepSeek is teaching AI how to "see attentively," while Google is teaching AI how to "calculate with hands."
3. The Final Battle for Defining Visual AI
DeepSeek - OCR2 proves that even a 3B small model can outperform large models as long as the "visual logic" is right. Google tries to use "code execution" for a dimensionality - reduction strike: no matter how good your vision is, it's just "seeing," while I can write code for verification, which is "true understanding."
This battle is essentially about who can re - define "machine vision" - is it extreme perception or all - around interaction?
Regardless of whether it's a "stress reaction," we programmers are the ones who will benefit from this epic battle.
Reference:
https://blog.google/innovation-and-ai/technology/developers-tools/agentic-vision-gemini-3-flash/?linkId=43682412
This article is from the WeChat public account "New Intelligence Yuan", edited by Dinghui, published by 36Kr with authorization.