HomeArticle

The 7B model defeats O3 and GPT-5. The medical AI agent enables the model to learn "where to look and how to look".

量子位2026-05-27 18:09
Medical AI Agent has reached a critical turning point

Medical AI can write explanations, but it doesn't mean it really "sees" the key evidence.

In the past, most medical multimodal models encoded an image or a video into visual features and then let the large model generate answers and explanations.

However, the problem is that a tiny lesion, a boundary change, or a few - second surgical action often determines whether the answer is valid.

When the model "passively receives" visual context, it is easy to misidentify regions and miss lesions.

To address this issue, the LeapQuest team from Shanghai Chuangzhi College collaborated with Zhejiang University, Shanghai Jiao Tong University, and Fudan University and published two papers accepted by ICML 2026. They applied the Think with Images/Think with Videos paradigm to the medical AI field for the first time:

The model no longer just generates explanations after viewing images or videos. Instead, it actively invokes visual tools during the reasoning chain, re - observes key regions or moments, and revises judgments with new evidence.

This means that vision is no longer just an input; visual evidence itself has become part of the model's thinking process.

The core keywords of the two works are as follows:

The two works are not isolated model upgrades. Instead, they jointly propose a new paradigm for medical AI:

Let visual evidence enter the model's intermediate thinking process, and advance "explanation" from post - hoc language generation to evidence verification during the reasoning process.

It's not about being better at "writing explanations", but about starting to "think with visual evidence"

The most common way for medical AI to work in the past was to encode an image or a video into visual features and then let the large model generate answers and explanations.

The problem is that a seemingly complete explanation doesn't mean the model really sees the key evidence. Especially in medical scenarios, a tiny lesion, a boundary change, or a few - second surgical action often determines whether the answer is valid.

Ophiuchus and MedScope have taken this problem a step further: The multimodal model no longer just "passively receives visual context". Instead, it actively decides whether more evidence is needed, where to look, and which segment to review during the reasoning process, and incorporates the observation results returned by the tools into subsequent reasoning.

This is the "think with images/think with videos" paradigm systematically proposed in the medical AI field for the first time: Vision is no longer just an input; visual evidence itself has become part of the model's thinking process.

Think with Images: Let the model "take a second look" in image diagnosis

Ophiuchus' entry point is very straightforward: Although existing medical multimodal large models can write step - by - step reasoning, they are still prone to "misidentifying regions, missing lesions, and mistaking normal structures for abnormalities" when dealing with tasks that require fine - grained visual evidence.

This is not simply due to insufficient language ability, but insufficient visual interaction mechanism.

Therefore, Ophiuchus transforms the large model into a visual agent that can cooperate with medical image tools.

It can decide whether to invoke external visual tools based on the current reasoning state: use SAM2 for fine segmentation, use BiomedParse to locate medical structures according to text prompts, and use Zoom - in to magnify key regions.

The output after tool invocation is not an isolated result but will return to the reasoning chain in the form of observation to drive the next judgment.

More importantly, Ophiuchus doesn't "plug in" the tools outside the model. Instead, it makes the tools part of the reasoning chain.

The model needs to learn when to invoke tools, which tool to choose, how to interpret tool outputs, and how to revise strategies when tool results are unreliable.

This enables the model to move from "being able to invoke tools" to "being able to think with tools".

The value of Ophiuchus is not just to provide medical large models with a few more visual tools. It enables the model to learn to actively "decide where to look, how to look, and how to revise after looking" during the diagnosis process.

From closed - source SOTA to medical Agent: Ophiuchus proves with results that "seeing more details" is the key

With the same external tool configuration, Ophiuchus - 7B achieved an average score of 68.0 on 8 VQA benchmarks, higher than 62.2 of OpenAI - o3, 61.8 of Gemini 2.5 Pro, and 59.9 of GPT - 5.

In the tool - use accuracy assessment, Ophiuchus achieved an average tool - invocation accuracy of 97.9%.

The implications behind these results are more important than "being first on a certain list":

When a problem truly depends on local structures, lesion boundaries, and cell - level evidence, model size or language reasoning is not the only bottleneck.

Medical AI needs a mechanism that allows visual evidence to continuously enter the reasoning process.

Think with Videos: Moving from "thinking with images" to "reviewing key moments"

If Ophiuchus solves the problem of local evidence in medical images, MedScope extends this paradigm to the more challenging long - video scenario.

The challenge of long clinical videos is that the key evidence is not only fine but also sparse; not only the content but also the timing needs to be correctly observed.

A surgical action, a change in the endoscopic field of view, or the moment a device enters or leaves may only last a few seconds, but it determines whether the model really understands the clinical process.

The " think with videos " proposed by MedScope doesn't require the model to compress the entire video into context at once. Instead, it mimics the observation method of clinical doctors:

First, quickly establish a global understanding, then return to the suspicious time window, use crop_video to intercept segments, use get_frame to obtain key frames, and finally integrate these local observation results into the answer.

This makes MedScope's reasoning process inherently reviewable: To understand why the model gives a certain result, we can not only look at what it "says" but also at which video segment it reviewed, which frames it found, and whether these evidence support the conclusion.

ClinVideoSuite and GA - GRPO: Let the video model learn to "find evidence" rather than just "guess answers"

To enable the model to truly learn this behavior, MedScope built ClinVideoSuite: It includes 635K densely timestamped captions, 254K evidence - related QAs, 34K visual CoT trajectories, and an interactive training environment for reinforcement learning.

The data is not simple Q&As but emphasizes that the questions must rely on visual evidence in local time windows.

In terms of training, MedScope adopts a three - stage approach:

Stage 1: Conduct clinical reasoning warm - up to learn medical semantics and long - range video understanding;

Stage 2: Use visual - CoT cold - start SFT to teach the model when more evidence is needed and how to invoke tools;

Stage 3: Use GA - GRPO to strengthen the tool use for temporal alignment. Through grounding - aware reward and evidence - modulated advantage, make the model more inclined to retrieve visual segments that truly support the conclusion.

In evaluations such as SVU - 31K and ClinVideo - Eval, MedScope achieved SOTA among open - source models in multi - granularity video understanding, fine - grained temporal reasoning, and grounded VQA.

The paper also shows that removing evidence reward will significantly reduce the positioning quality. For example, R@0.5 drops from 40.1 to 33.2, and mIoU drops from 4.3 to 38.8, indicating that answer - level supervision is insufficient to teach the model to reliably select evidence.

The real paradigm shift: Vision changes from "input" to "thinking process"

Looking at the two works together, the most important thing is not that Ophiuchus deals with images and MedScope deals with videos. Instead, they jointly define a new medical multimodal intelligence paradigm:

The model's reasoning process is no longer just the expansion of language tokens. Instead, it is a closed - loop interaction between language, tools, image regions, video segments, and evidence feedback.

The next key ability of medical AI is not to generate longer explanations but to actively search for, verify, and cite visual evidence before giving explanations.

Ophiuchus and MedScope have turned this from a methodology into a trainable, evaluable, and scalable technical route.

Why this may be the key inflection point for medical AI Agents

The biggest difference between medical tasks and general visual Q&A is that every conclusion requires an evidence chain.

Radiologists will magnify the edges of lesions, pathologists will look for cell morphologies, surgeons will review key operations, and endoscopists will track the appearance and disappearance of lesions over time.

That is to say, clinical visual reasoning is inherently interactive, evidence - driven, and reviewable.

The significance of "Think with Images/Videos" is to make medical AI closer to this real - world clinical cognitive approach.

It is no longer satisfied with one - time prediction. Instead, it establishes a cycle of "hypothesis - verification - revision - answer" within the model.

This provides three important capabilities for clinically trustworthy AI: fewer hallucinations, stronger interpretability, and better suitability for complex processes.

Medical AI starts to truly "think while looking"

From Ophiuchus to MedScope, we can see a fundamental paradigm shift in medical multimodal large models:

From looking at images and videos to continuously looking during the reasoning process; from outputting answers to actively searching for evidence; from language chains to multimodal thinking chains involving visual evidence.

This also explains why "think with images/videos" is worth being proposed separately.