HomeArticle

Peking University and ByteDance jointly open-source the first spatio-temporal reasoning video model with fully transparent thinking process and performance surpassing GPT-4o

量子位2025-11-05 16:32
You can watch videos and accurately mark locations.

AI can now highlight key points when watching videos!

It can not only answer "what" and "what happened", but also point out "when and where" it happened.

A joint team from Peking University and ByteDance has launched the first open-source model, Open-o3 Video, that embeds explicit spatio-temporal evidence throughout the video reasoning process. This enables AI not only to answer questions correctly but also to intuitively mark the specific locations during the thinking process, truly achieving traceable video reasoning.

Meanwhile, the model adopts a non-agent architecture, avoiding complex tool calls and multi-round reasoning. It directly completes the closed-loop of "watch - think - prove - answer" in a single response.

In multiple video reasoning tests, the key indicators can be improved to 24.2%, and its performance surpasses that of closed-source models such as GPT-4o and Gemini-2-Flash.

Here are more detailed contents.

Research Background

Video understanding is one of the most complex tasks in multi-modal large language models (MLLM).

Different from static images, videos simultaneously carry dynamic changes in the time dimension and scene interactions in the space dimension.

This means that the model not only has to identify the objects and actions (What) in the frame but also must determine when (When) they appear and where (Where) they occur.

Recently, models such as Video-R1 and VideoRFT have significantly improved the logical consistency of video understanding through reinforcement learning. However, their thinking chains are still purely textual. The models may answer questions correctly but cannot point out the specific frames that support the answers.

This "black-box reasoning" makes the model's judgments difficult to explain and verify.

In addition, OpenAI's o3 model first proposed the concept of "Thinking with Images". By embedding images (such as boxed areas, local magnifications, and zoom views) in the reasoning process, the model can naturally reference visual clues in the reasoning chain, thus achieving "reasoning with evidence".

However, extending this concept to the video field, that is, enabling the model to provide both temporal and spatial evidence in reasoning, is more challenging:

1. It is difficult to maintain the consistency of text, timestamps, and object bounding boxes during reasoning.

The model needs to accurately align the time points when events occur in dozens or hundreds of frames. Any drift will lead to logical errors in reasoning, and the training is difficult.

Moreover, the position of the same object changes significantly in different frames, and the spatial position needs to be continuously tracked in the temporal dynamics.

2. There is a serious lack of spatio-temporal coupling supervision.

Existing data either only provides temporal annotations (Temporal Grounding) or only has single-frame spatial boxes (Spatial Grounding), lacking unified spatio-temporal annotations and corresponding thinking chains.

Model Training Process

Compensating for Data Shortcomings

Therefore, the most fundamental bottleneck in video reasoning based on spatio-temporal positioning clues lies in the data.

Existing video understanding datasets often only have annotations in the temporal or spatial dimension, and there is a gap between the modalities without spatio-temporal coupling thinking chain data.

So, the team constructed the first unified corpus system for explicit spatio-temporal reasoning, STGR (Spatio-Temporal Grounded Reasoning), which includes two parts: STGR-CoT-30k and STGR-RL-36k.

The former is used for supervised fine-tuning (SFT) to help the model learn the reasoning format and output structure with spatio-temporal annotations; the latter is used in the reinforcement learning stage (RL) to provide high-quality reward signals to continuously optimize the model's spatio-temporal alignment and evidence generation capabilities.

Both datasets include four types of tasks: temporal positioning; spatial positioning; spatio-temporal positioning data; and video question-answering data, and the distribution of the data.

Among them, 5.9k high-quality spatio-temporal data were annotated by the team according to the data pipeline in the figure. The specific process is as follows:

1. For two data sources (temporal grounding and plm-rdcap), Gemini 2.5 Pro was used for initial annotation to generate question-answer pairs, initial key frames, object detection boxes, and reasoning processes. The format of the explicit spatio-temporal positioning is as follows:

"<obj>object_name</obj><box>[x min, y min, x max, y max]</box>at<t>timestamp</t>s"

2. Since the quality of the detection boxes annotated by the large model is limited, the team filtered them in two ways:

Eliminate invalid boxes with an overly large coverage area (more than 80% of the frame);

Verify whether the object category matches through Qwen2.5-VL-7B, for example, use the query "Is this a dog?" to confirm the content of the detection box.

3. Consistency check: Rewrite the reasoning chain to ensure a one-to-one correspondence between questions-answers, timestamps, object names, bounding boxes, and the reasoning chain, and delete redundant or inconsistent samples.

Two-Stage Training Method

After laying the foundation with high-quality spatio-temporal corpus, the key question becomes how to enable the model to truly learn to "think in videos".

The team found that supervised fine-tuning alone cannot achieve satisfactory results. Because during the supervision stage, the model mostly imitates the language patterns of human annotators rather than truly understanding the logical relationship between visual clues and the reasoning structure.

Therefore, to enable the model to actively discover and reference key evidence, a self-correcting reinforcement learning mechanism must be adopted, allowing the reward signal to directly constrain "which frame to watch, which area to focus on, and what to think about".

This concept forms the core of Open-o3 Video's training: a two-stage learning mechanism - cold-start pre-training and reinforcement learning based on GSPO.

During the cold-start stage, the model first undergoes supervised fine-tuning using the STGR-CoT-30k data.

The goal of this stage is to enable the model to master the reasoning format and output specifications, that is, how to simultaneously generate structured tags such as <>, <>, and <> in the answer, and learn to correspond the reasoning chain to the video content.

This stage is equivalent to "teaching the model to speak": it learns how to describe visual evidence in language but has not yet formed a spontaneous evidence selection strategy.

In other words, the cold-start stage enables the model to have the "ability to generate traceable answers", and the next stage is to make this ability accurate, stable, and generalizable.

In the second stage, the team introduced the reinforcement learning framework GSPO.

Compared with the widely used GRPO, GSPO optimizes based on sequences, which is more conducive to the stability of long-term training and avoids the collapse of the thinking chain.

At this stage, the model is required to generate a complete spatio-temporal reasoning sequence in an open video scenario and then self-correct through the reward function. The reward function consists of three parts:

r_acc measures the correctness of the answer; r_thk reflects the rationality and completeness of the reasoning chain, encouraging the model to make full use of visual evidence when generating thinking text, such as calculating indicators like temporal IoU and spatial IoU. r_fmt evaluates whether the reasoning format meets the specifications.

The team particularly emphasized that a single accuracy reward cannot support multi-modal interpretable reasoning because the model may "luckily guess" the answer but ignore the key frames. Only when the reasoning process itself is included in the optimization goal will the model truly learn how to think in the visual world.

However, using reinforcement learning to simultaneously optimize the positioning capabilities in both the temporal and spatial dimensions is very challenging. In particular, it should be noted that the spatial reward (IoU) must rely on the accuracy of time prediction.

Specifically, if the time prediction is incorrect, even if the spatial box position is correct, it cannot correspond to the ground truth. That is, time prediction is the premise for training stability.

However, if strict time constraints are directly used in temporal reward prediction, the model often does not receive rewards in the early stage of training, resulting in a halt in learning. If loose constraints are always used, although the model can receive rewards, the temporal rewards are easily saturated, and the prediction cannot gradually converge to the exact position, so the calculation of the spatial reward is still inaccurate.

Therefore, the team proposed an adaptive temporal proximity mechanism, that is, gradually adjusting the tolerance range of the temporal reward during the training process. The specific formula is as follows:

As the training progresses, the standard deviation is gradually adjusted from large to small to achieve this convergence from "coarse positioning" to "fine positioning".

At the same time, the team proposed a temporal gating mechanism. That is, before calculating the spatial reward, first check whether the predicted timestamp falls near the real timestamp. Only when the time prediction is close to the ground truth (less than the set threshold) will the IoU between the predicted box and the ground truth box on the corresponding frame be calculated; otherwise, the spatial reward is 0.

Through such a training method and reward design, the model can be trained in a more stable and efficient manner.

Reasoning Enhancement

The spatio-temporal evidence proposed by the team can be used as a verifiable signal for test-time expansion.

Specifically, in the reasoning stage, the model generates multiple independent reasoning chains, each containing spatio-temporal evidence.

The corresponding key frame areas are cropped from the reasoning chains and input into the model again for relevance scoring with the question (0, 1, or 2 points, indicating no relevance to the question, possibly helpful for answering the question, and very helpful for answering the question, respectively).

Each answer is weighted and voted according to its score, and the answer with the highest confidence is finally output.

This mechanism effectively avoids being misled by low-quality thinking chains during voting and improves the accuracy and robustness of reasoning.

Experimental Results

Open-o3 Video has achieved significant performance on multiple video reasoning and understanding benchmarks.

First, the team tested on the spatio-temporal reasoning benchmark V-STAR, which comprehensively examines the model's performance in the three dimensions of  "what" - "when" - "where". 

It can be seen that Open-o3 Video has achieved significant improvements in both Temporal IoU (temporal alignment) and Visual IoU (spatial alignment). The overall mAM has increased by +14.4%, and mLGM has increased by +24.2%, surpassing large closed-source models such as GPT-4o and Gemini-2-Flash, fully demonstrating its significant advantages in spatio-temporal joint positioning and reasoning consistency!

Moreover, on the four benchmark tests of VideoMME, WorldSense, VideoMMMU, and TVGBench, Open-o3 Video stably surpasses the baseline model and many video reasoning models.