HomeArticle

The accuracy has been halved. Once the visual capabilities of large models are put to use in daily life, they "malfunction".

量子位2025-12-09 14:57
No longer limited to household chores, EgoCross evaluates the visual understanding ability of large models across domains.

We are accustomed to AI chatting eloquently on the screen and generating beautiful pictures, as if it knows everything. However, if we "throw" it into a real operating room and let it judge which pair of forceps to use next from the first - person perspective of the chief surgeon, this "top student" is likely to be completely at a loss.

In response to such issues, the EgoCross project team focuses on cross - domain first - person video question - answering evaluation. The new working system reveals the generalization bottlenecks of existing MLLMs in scenarios such as surgery, industry, extreme sports, and animal perspectives.

Currently, most first - person video benchmarks focus on daily life activities, while ignoring the huge domain differences in real - world applications.

A research team from East China Normal University and INSAIT has proposed the cross - domain first - person video question - answering benchmark EgoCross for the first time. It covers four high - value professional fields and contains nearly a thousand high - quality QA pairs. At the same time, it provides two evaluation formats: closed - book (CloseQA) and open - book (OpenQA), completely filling the evaluation gap in this field.

Meanwhile, through comprehensive tests of eight mainstream MLLMs, the team has revealed the cross - domain shortcomings of existing models and verified the improvement potential of methods such as supervised fine - tuning (SFT) and reinforcement learning (RL).

Currently, this research has been selected for AAAI 2026, and all datasets and codes have been fully open - sourced.

Breaking out of the daily "comfort zone"

The goal of Egocentric Video Question Answering (EgocentricQA) is to enable the model to give correct natural language answers under the input of "first - person video + question".

A large amount of work has made progress in this direction, but almost all of them only evaluate the model in daily life scenarios: cooking, cutting vegetables, tidying up the room...

In reality, more challenging scenarios often come from:

Surgery field: It is necessary not only to identify "cutting tools" but also to distinguish fine instruments such as "grasping forceps", "scalpels", and "bipolar forceps". At the same time, surgical procedures are long and risky, and the risks brought by identification and prediction errors are extremely high; Industrial field: It involves complex circuit board repair processes and fine object recognition; Extreme sports: The first - person camera shakes violently, the perspective switches frequently, and the picture is severely blurred; Animal perspective: The camera moves irregularly with the animal, and the perspective height and focus area are completely different from those of humans.

These scenarios are very different from "daily housework" in terms of visual style and semantic content, constituting a natural domain shift.

This leads to the core questions of this research:   Can existing MLLMs that perform well in daily scenarios still be reliable in these unfamiliar fields?   If not, where does the problem lie? And how can it be improved?

One benchmark, three major contributions

1. The first cross - domain EgocentricQA benchmark

Four professional fields with practical application value are carefully selected: surgery, industry, extreme sports, and animal perspective.

A dataset containing 957 question - answer pairs is constructed, covering 15 fine - grained task types.

Each question - answer pair is provided in two formats: open - ended (OpenQA) and multiple - choice (CloseQA).

2. Comprehensive model evaluation and analysis

Eight state - of - the - art multimodal large language models are evaluated, including closed - source models such as GPT - 4.1 and Gemini 2.5 Pro, as well as open - source models such as Qwen2.5 - VL and VideoLLaMA3.

The experiments reveal that even the best - performing model has a CloseQA accuracy of less than 55% (random guess is 25%) and an OpenQA accuracy of less than 35% in cross - domain scenarios.

In - depth analysis is conducted from multiple dimensions such as task type, domain difference, and model architecture.

3. Forward - looking improvement research

Technologies such as prompt learning, supervised fine - tuning (SFT), and reinforcement learning (RL) are explored.

It is found that the RL method can bring the most significant performance improvement (an average increase of 22%).

It provides a direction for building more generalizable models in the future.

Detailed explanation of EgoCross: How to construct "professional exam questions" for the four major fields?

EgoCross selects videos from five high - quality open - source datasets, covering four professional fields. Four core tasks are designed for each field: Identification, Localization, Prediction, and Counting, with a total of 15 subtasks to comprehensively evaluate the model's capabilities.

Identification: Such as action sequence identification and dominant hand - held object identification. For example, "What kind of animal is in the video?" "What is the instrument that does not appear in the surgery?"

Localization: Including temporal and spatial localization. For example, "When did the operator first touch the oscilloscope?" "In which area of the picture is the screwdriver?"

Prediction: Such as predicting the next action, direction, or stage. For example, "What is the next step after the surgical preparation stage?" "What is the next movement direction in extreme sports?"

Counting: The ability to count dynamic objects. For example, "How many different components are visible in the video?"

Experiments reveal the model's "acclimatization"

The experiments of the research team have revealed several key findings:

Significant domain gap: The model's accuracy in daily activities (EgoSchema) is 73.58%, but it drops sharply to 43.14% in EgoCross cross - domain scenarios.

Greater challenges in professional fields: The industrial and extreme sports fields are the most challenging for the model, while the animal perspective is relatively easier.

Influence of task type: Prediction - type tasks (such as predicting the next operation) decline more severely than basic identification tasks.

Differences in model performance: General large models (Gemini 2.5 Pro) are better than models specifically trained for first - person videos, indicating the limitations of current domain adaptation methods.

Forward - looking improvement attempts

"*" represents the Baseline without vLLM acceleration. Since vLLM acceleration causes a slight performance decline, it is marked in gray.

The research team explored three improvement methods:

Prompt learning: Without changing the model parameters, only adding domain - specific prompts and examples during the inference stage. For example, adding "This is a video of surgery/industry/extreme sports/animal perspective. Please answer in combination with the characteristics of this field" before the question to tap into the model's existing cross - domain capabilities through "prompting".

Supervised fine - tuning (SFT): Using Qwen2.5 - VL - 7B as the base model, performing full - parameter fine - tuning on a small amount of labeled video question - answer data in the target domain to make the model parameters adapt to the new domain distribution; in the industrial field, the performance after fine - tuning is nearly 20% higher than the baseline.

Reinforcement learning (RL): Building an RL framework based on GRPO (Generative Reward - based Policy Optimization). The specific approach is as follows: Sampling multiple candidate answers for each question (about 8 for each sample), then using a reward model to judge whether the answer is correct and score it, and using this as a reward signal to optimize the policy of Qwen2.5 - VL - 7B. RL brings an average increase of about 22 percentage points in CloseQA accuracy in the four fields, which is the most obvious among the three methods.

These studies have preliminarily revealed the ability boundaries of current large models and provided valuable insights for building more generalizable multimodal systems in the future.

It seems that to train an AI assistant that can not only do housework but also handle tasks in professional scenarios, more efforts are needed. After all, the real world is much larger than just the kitchen.

Paper link: https://arxiv.org/abs/2508.10729

Project homepage: https://github.com/MyUniverse0726/EgoCross

Challenge homepage: https://egocross - benchmark.github.io/

This article is from the WeChat official account "QbitAI", author: EgoCross team, published by 36Kr with authorization.