HomeArticle

Good news for novice photo editors! The intelligent photo-editing agent can accurately call over 200 professional tools with just one sentence. It's jointly developed by Tencent Hunyuan and Xiamen University.

量子位2025-12-26 15:06
"Edit photos" through iterative editing, visual perception, self-assessment, and self-reflection

Turn a photo into a masterpiece with a single sentence. It's simpler than professional software and more controllable than AI photo editing!

Tencent Hunyuan, in collaboration with Xiamen University, presents JarvisEvo - a unified image editing intelligent agent that mimics human expert designers to "retouch photos" through iterative editing, visual perception, self - evaluation, and self - reflection.

"Think like an expert and refine like a craftsman." JarvisEvo can not only retouch photos with Lightroom but also "see" the changes after retouching and judge the quality by itself, thus achieving self - evolution without external rewards.

Let's take a detailed look at it below~

Self - evaluation and correction

Research background and motivation

In recent years, although instruction - based image editing models have made significant progress, they still face two core challenges when pursuing a "professional - level" photo - retouching experience:

  1. Instruction Hallucination:

The existing Text - only Chain of Thought (Text - only CoT) has an information bottleneck. The model "can't see" the intermediate photo - retouching results during the reasoning process. It only relies on text to "imagine" the visual results of the next operation, which easily leads to factual errors and cannot ensure that each step meets the user's intention.

  1. Reward Hacking:

In the process of preference alignment in reinforcement learning, the policy model (Policy) is dynamically updated, while the reward model (Reward Model) is usually static. This allows the policy model to easily "exploit loopholes" by deceiving the reward function to obtain high scores instead of truly improving the photo - retouching quality and self - evaluation ability.

To solve the above problems, the team launched JarvisEvo.

iMCoT: Interleaved Multimodal Chain - of - Thought

It breaks the limitations of traditional "blind retouching". JarvisEvo introduces the iMCoT (Interleaved Multimodal Chain - of - Thought) mechanism. Different from pure text reasoning, JarvisEvo generates a new image after each step of editing and conducts the next step of reasoning based on visual feedback.

The model works in a cycle of "generating text hypotheses -> executing tools -> observing visual results -> reflecting on decisions" to ensure that each operation is precisely implemented.

SEPO: Synergistic Editor - Evaluator Policy Optimization

This is the engine for JarvisEvo to achieve "self - evolution". The team proposed the SEPO (Synergistic Editor - Evaluator Policy Optimization) framework, which consists of two co - evolving optimization loops:

Editor optimization loop (Loop 1): The model uses the self - evaluation score as an internal reward, no longer relying on the easily hacked external reward model.

Evaluator optimization loop (Loop 2): The model's evaluation ability is continuously calibrated using human - annotated data to prevent the model from "deceiving itself" when scoring itself.

Online reflection and self - correction

JarvisEvo has the ability to learn from mistakes. During the training process, the system automatically compares low - scoring trajectories with high - scoring trajectories to generate Reflection Data. By analyzing "why the retouching was wrong" and "how to correct it", the model acquires a powerful self - correction ability.

"Retouch while observing" like a human

JarvisEvo system architecture

The traditional Text - only Chain of Thought (Text - only CoT) usually involves "blind retouching", that is, generating all steps at once.

JarvisEvo adopts the Interleaved Multimodal Chain of Thought (iMCoT), simulating the closed - loop workflow of "observing - operating - checking" of human designers.

The entire reasoning process is divided into four core steps:

1. Visual perception and planning (Perception&Planning): The model first analyzes the original image (I) and the user's instruction (Q) to generate an initial photo - retouching idea.

2. Step - by - step tool execution (Step - by - Step Execution):

The model generates interleaved text reasoning content (C) and tool call instructions (T).

Tool sandbox (Sandbox): The instructions are sent to the external Adobe Lightroom environment for execution, generating an intermediate edited image (O).

Visual feedback (Visual Feedback): This is crucial. The model will "see" the just - retouched photo and decide whether to continue adjusting or correct errors based on the latest visual state.

3. Self - evaluation (Self - Evaluation): After retouching, the model will score itself (S) on the aesthetic quality and instruction compliance of the final result (Ot).

4. Self - reflection (Self - Reflection): If the result is not satisfactory, the model will trigger the reflection mechanism, analyze the cause of the deviation, and try to correct it.

Three - stage training framework

To create such an all - around agent, the team designed a rigorous three - stage training pipeline:

Stage 1: Cold - Start Supervised Fine - Tuning (Cold - Start SFT)

Data volume: 150K annotated samples (110K editing data + 40K evaluation data).

Goal: Teach the model the "basic skills". This includes mastering the syntax of multimodal reasoning, being able to alternately generate text and image content, learning to select the correct tools based on visual cues, and initially establishing aesthetic evaluation ability.

Stage 2: SEPO Reinforcement Learning (The Evolution)

Data volume: 20K standard instruction data (10K editing + 10K evaluation).

Core mechanism: Introduce Synergistic Editor - Evaluator Policy Optimization (SEPO). At this stage, the model breaks away from imitating the standard answers and starts to explore independently.

Driven by dual optimization: This stage enables the model to evolve from "being able to use tools" to "mastering photo retouching". Editor optimization: Optimize the photo - retouching strategy through self - scoring (Self - Reward) and use SLM (Selective Loss Masking) to prevent reward cheating. Evaluator optimization: Calibrate the model's aesthetic judgment using human - scored data to ensure it can be a fair judge.

Stage 3: Reflection Fine - Tuning

Data volume: 5K a small amount of online - generated reflection samples.

Goal: This is the key for JarvisEvo to have the "self - correction" ability. By learning how to reflect and correct on wrong paths, the model's robustness in handling complex instructions is greatly improved.

SEPO: Synergistic Editor - Evaluator Policy Optimization

In traditional Reinforcement Learning from Human Feedback (RLHF), the model usually relies on a static "reward model" to score.

However, this has a fatal flaw: as the policy model becomes stronger, it will learn to "exploit loopholes" (Reward Hacking), that is, generate certain specific and strange patterns to deceive high scores instead of truly improving its editing ability.

To solve this problem, JarvisEvo proposed the SEPO framework. Its core idea is: Let the model be both the "athlete" and the "referee", and through two parallel optimization loops, let these two abilities improve synchronously and restrict each other.

The Editor Policy Optimization loop (Loop 1) is to let the model learn how to use tools better to retouch good photos.

Self - Reward mechanism: JarvisEvo no longer relies on external black - box models to score but uses its own Self - evaluation ability. After generating the photo - retouching trajectory, the model will score itself according to the aesthetic quality and instruction compliance of the final image.

GRPO optimization objective: Adopt Group Relative Policy Optimization (GRPO). For the same input, the model generates multiple photo - retouching trajectories and updates through comparing the "win - rate" (Pairwise Preference Reward) of these trajectories, rather than simply relying on absolute scores, which makes the training more stable.

Selective Loss Masking (SLM) is the key technology. This is a mechanism to prevent "cheating". Without SLM, the model may find that "as long as the self - scoring text I generate at the end is a full score, the loss will become smaller".

To prevent this "information leakage", when calculating the gradient of the editor, the tokens of the self - scoring part are forcibly masked. This forces the model to indirectly obtain high scores only by truly improving the previous reasoning quality (Chain - of - Thought) and tool - use accuracy (Tool Use), rather than directly generating high - scoring text.

The Evaluator Policy Optimization loop ensures that this "referee" is fair, objective, and in line with human aesthetics.

Verifiable Reinforcement Learning (Verifiable RL): Although Loop 1 relies on self - scoring, what if the referee's aesthetic sense deviates? Loop 2 is specifically designed to solve this problem. We use a dataset containing human - expert annotations (Human - Annotated) to train the model's evaluation ability.

Score Alignment Reward: In this loop, the reward depends on the closeness between the model's score and the human - expert's score.

Function: This loop continuously calibrates the model's aesthetic standard, preventing it from falling into self - indulgence in Loop 1 and ensuring the value of the self - reward signal.

These two loops are carried out alternately, forming an "internal struggle" evolution effect, breaking the shackles of the static reward model and achieving a closed - loop, sustainable self - ability improvement.

On - Policy Reflection Data Generation Mechanism

How does JarvisEvo learn to "learn from mistakes"? The team implanted an automated data - generation process during the Stage 2 training:

Capture the opportunity: When the model generates a better photo - retouching trajectory Trajectory0 (score s0), and this score is significantly higher than a previous attempt Trajectory3 (score s3), reflection generation is triggered.

Attribution analysis: Call a large commercial model (such as Gemini - 2.5 - Pro) as a "tutor" and input the original photo, the wrongly retouched result O3, the correctly retouched result O0, and the user's instruction.

Generate a reflection chain: The "tutor" will generate a detailed analysis text (R), explaining why O3 failed (for example, "The white - balance parameter was set too high, resulting in color cast") and pointing out the correct approach.

Build a sample: Store the complete trajectory containing "wrong attempt -> in - depth reflection -> correct correction" in the dataset Dataset_reft for fine - tuning in the third stage.

ArtEdit Dataset

To support the above training, the team built ArtEdit - a bilingual (Chinese/English) professional photo - retouching dataset containing 170K samples. It includes 10 major categories and 37 sub - categories of professional photography scenes such as portraits, landscapes, architecture, still - lifes, and night views. Through the A2L (Agent - to - Lightroom) protocol, it seamlessly integrates more than 200 photo - retouching tools in Adobe Lightroom.

ArtEdit - Lr (120K): Focuses on photo - retouching tasks, including complete iMCoT trajectories (reasoning, tool parameters, intermediate images).

ArtEdit - Eval (50K): Focuses on aesthetic evaluation, including human - expert scores (1 - 5 points) on image quality and instruction compliance.

Experimental results

In the ArtEdit - Bench evaluation, in terms of L1 and L2 metrics, it improved by 44.96% compared with the commercial - level model Nano - Banana, maximally retaining the details of the original photo.

It comprehensively led in the SC (Semantic Consistency) and PQ (Perceptual Quality) metrics, with an average improvement of 18.95%.

Moreover, the correlation between its scores and human subjective preferences (SRCC 0.7243) exceeded that of GPT - 4o (Gemini - 2.5 - flash) and specialized IQA models.

In terms of visual effects, compared with other models, the images processed by JarvisEvo are more in line with user instructions and perform outstandingly in style creation and detail presentation.