Even a throwaway shot can be turned into a stunning masterpiece: Peking University open-sources the first "Aesthetic Photo Reconstruction" model
[Introduction] The team led by Peng Yuxin from Peking University has proposed a new task of "Aesthetic Photo Reconstruction". They automatically constructed a dataset named AesRecon from photography teaching videos and developed a two - stage model called AesFormer. By optimizing composition, perspective, and human poses, it enhances the aesthetic and artistic expression of photos.
When you take a photo, AI might help you brighten it, beautify your face, or apply a filter, but it's difficult to truly transform it into what you have in mind.
A photo often looks unappealing because it was not taken correctly from the start: the composition is off, the perspective is skewed, and the poses are stiff. Existing image beautification tools can adjust brightness, add beauty effects, and apply filters, but they cannot reorganize the composition, correct the shooting angle, or adjust human poses. Therefore, it's hard to fix the structural flaws left in the shooting stage.
In response to this challenge, the team led by Professor Peng Yuxin from Peking University has conducted the latest research in the field of aesthetic understanding. They defined the task of Aesthetic Photo Reconstruction. By automatically mining aesthetic corpus from online photography teaching videos, they constructed the first dataset and evaluation benchmark for aesthetic photo reconstruction, AesRecon. This dataset contains 9071 pairs of portrait photo samples, recording the real optimization process from ordinary original photos to outstanding final photos of the same person in the same scene.
On this basis, the team further proposed the aesthetic photo reconstruction model AesFormer. Through a two - stage method of "Aesthetic Planning + Aesthetic Editing", AI no longer just stays at surface - level modifications such as color adjustment and beauty enhancement. Instead, it can further adjust the composition, perspective, and human poses to enhance the aesthetic of the photo from the perspective of the picture structure. The relevant paper has been accepted by ICML 2026 and is open - sourced.
Paper link: https://arxiv.org/abs/2605.22126
Open - source code: https://github.com/PKU-ICST-MIPL/AesFormer_ICML2026
Lab website: https://www.wict.pku.edu.cn/mipl
From "Surface - level Modification" to "Picture Reconstruction"
Taking photos is an important way to record daily scenes, emotions, and memories. However, truly touching moments often pass by in a flash. To capture such moments, the photographer needs to quickly judge key factors such as composition, perspective, and human poses at the moment of pressing the shutter.
For professional photographers, this judgment comes from systematic training and long - term practice. For ordinary users, due to the lack of photography experience, the photos they take often have problems such as composition deviation, perspective imbalance, and stiff poses, resulting in a significant gap between the actual final photo and the ideal picture in their minds.
To bridge this gap, users usually use image beautification tools to enhance the aesthetic of photos. Existing tools can be roughly divided into two categories:
(1) Photo color - adjustment tools, such as automatic photo - retouching methods, Photoshop, Lightroom, etc. They mainly optimize the color style by adjusting basic visual parameters such as exposure, brightness, and contrast.
(2) Portrait beauty tools, such as skin smoothing, whitening, and face slimming. These functions have been widely popularized in applications like Meitu Xiuxiu and Xingtu.
However, these methods mainly improve the color, light, and human appearance, but it's difficult to fix the " structural flaws " left in the shooting stage. Facing problems such as composition deviation, poor perspective, and stiff poses, simply adjusting the color or applying beauty effects is often ineffective.
In other words, existing image beautification tools still lack a key ability: on the premise of keeping the identity of the person and the content of the scene basically the same, reasonably adjust the composition, perspective, and human poses of the photo to enhance the aesthetic of the photo from the perspective of the picture structure. Researchers defined this task as Aesthetic Photo Reconstruction, as shown in Figure 1.
Figure 1. Schematic diagram of the Aesthetic Photo Reconstruction task
However, it's not easy to achieve aesthetic photo reconstruction, mainly facing two difficulties:
(1) Scarcity of high - quality aesthetic corpus: Existing data lacks paired portrait photo samples of "the same person, the same scene, from poor to excellent", making it difficult to support the model to learn the real photo reconstruction process.
(2) Insufficient aesthetic ability of the model: Existing image editing models lack systematic photography aesthetic knowledge and aesthetic judgment ability, making it difficult to accurately identify photo problems and complete reasonable picture reconstruction.
In response to the above problems, the team led by Professor Peng Yuxin from Peking University proposed a new solution. The team first proposed an aesthetic corpus mining method VCMP based on photography teaching videos. By automatically mining aesthetic corpus from online photography teaching videos, they constructed a new dataset and evaluation benchmark for aesthetic photo reconstruction, AesRecon, which contains 9071 pairs of "ordinary original photos - outstanding final photos" portrait photo samples.
On this basis, the team further proposed the aesthetic photo reconstruction model AesFormer, adopting a two - stage approach of "Aesthetic Planning + Aesthetic Editing":
(1) Aesthetic Planning: Through cold - start supervised fine - tuning and aesthetic - guided group - relative strategy optimization, train the aesthetic planning model to analyze photo problems and generate executable aesthetic optimization plans.
(2) Aesthetic Editing: Through flow - matching training conditioned on the aesthetic optimization plan, train the image editing model to transform the optimization plan into pixel - level editing, enhancing the photo reconstruction ability.
The experimental results show that AesFormer achieved better results than existing methods on the aesthetic photo reconstruction evaluation benchmark.
Researchers upgraded image beautification from surface - level modification mainly focusing on color, light, and human appearance adjustment to picture reconstruction capable of optimizing composition, perspective, and human poses, providing a new research perspective and technical path for AI to understand and generate high - quality photographic works.
Technical Solution
In existing image resources, paired portrait photo samples that can present "the same person, the same scene, from poor to excellent" are very scarce, making it difficult to support the model to learn the real photo reconstruction process. Online photography teaching videos provide a feasible data source for this problem.
Such videos usually record the complete shooting optimization process of the same person in the same scene: the photographer and the model continuously adjust the camera position, composition, and human poses, gradually optimizing the picture from an ordinary original photo to a more aesthetically expressive final photo.
Figure 2. Framework diagram of the aesthetic corpus mining method (VCMP) based on photography teaching videos
Based on this observation, researchers proposed the aesthetic corpus mining method VCMP based on photography teaching videos. By automatically mining aesthetic corpus from online photography teaching videos, they constructed a new dataset and evaluation benchmark for aesthetic photo reconstruction, AesRecon, as shown in Figure 2. Specifically, first search for relevant content such as photography tutorials, pose guidance, and composition skills on video platforms to form a candidate set of photography teaching videos.
On this basis, VCMP completes corpus mining through four stages:
(1) Outstanding final photo positioning: Locate the high - quality outstanding final photo as the final display result in the video.
(2) Ordinary original photo matching: Match the located outstanding final photo with an ordinary original photo with consistent semantics but poor effect.
(3) Photo interference removal: Remove occluding elements such as subtitles, icons, auxiliary composition lines, and operation pop - ups in the video frames.
(4) Shooting event alignment: Check whether each photo pair comes from the same shooting event and filter out samples that do not meet the conditions.
Finally, AesRecon contains 9071 pairs of strictly aligned "ordinary original photos - outstanding final photos" portrait photo samples.
Figure 3. Framework diagram of the aesthetic photo reconstruction model (AesFormer)
To solve the problem of insufficient aesthetic ability of existing image editing models, researchers proposed the aesthetic photo reconstruction model AesFormer.
As shown in Figure 3, AesFormer adopts a two - stage method of "Aesthetic Planning + Aesthetic Editing":
(1) Aesthetic Planning: Through cold - start supervised fine - tuning and aesthetic - guided group - relative strategy optimization, train the aesthetic planning model to analyze photo problems and generate executable aesthetic optimization plans.
(2) Aesthetic Editing: Through flow - matching training conditioned on the aesthetic optimization plan, train the image editing model to stably transform the optimization plan into pixel - level editing, thus completing photo reconstruction.
Stage I: Aesthetic Planning
For each pair of photos in AesRecon, researchers first extracted the adjustments made by the photographer and the model around factors such as composition, perspective, and poses during the shooting process to form an aesthetic optimization plan from the ordinary original photo to the outstanding final photo.
On this basis, conduct cold - start supervised fine - tuning on the multi - modal large model, model the aesthetic optimization plan as an ordered decision - making sequence that conforms to photography logic, and guide the model to analyze photo problems along seven progressive photography dimensions, enabling it to have basic aesthetic understanding, problem diagnosis, and plan - making abilities.
The training samples are uniformly represented as (p, q, a): where p is the ordinary original photo, q is the task instruction, and a is the aesthetic optimization plan. The model learns to generate a given p and q, that is, maximize the conditional log - likelihood of the target optimization plan. The loss function is defined as follows:
Although supervised fine - tuning lays a foundation for the model, relying solely on SFT can easily make the model "memorize mechanically". This is especially crucial for aesthetic photo reconstruction: there is no single "correct way to modify" for the same photo. Different adjustments to composition, perspective, poses, and depth of field may all result in natural and beautiful photos.
Therefore, researchers further proposed aesthetic - guided group - relative strategy optimization. Through format rewards, semantic alignment rewards, and aesthetic creativity rewards, encourage the model to explore more diverse and reasonable optimization paths, further improving the aesthetic planning ability and the quality of plan generation. In the experiment, Qwen3 - VL - 8B was used as the base model to train the aesthetic planning model AesThinker.
Stage II: Aesthetic Editing
Researchers used flow - matching training conditioned on the aesthetic optimization plan to enable the image editing model to transform the abstract aesthetic optimization plan into precise pixel - level modifications, thus enhancing the photo reconstruction effect.
The training samples are uniformly represented as (p, a, g): where p is the ordinary original photo, a is the aesthetic optimization plan, and g is the outstanding final photo. The model takes p and a as inputs and g as the supervision target for training. In the experiment, based on Qwen - Image - Edit - 2511, the aesthetic editing model AesEditor was trained.
During inference, AesFormer adopts a two - stage serial process of "Aesthetic Planning + Aesthetic Editing": first, AesThinker generates an aesthetic optimization plan, and then AesEditor combines the input photo and the optimization plan to generate the final reconstructed photo.
Experimental Results
Table 1 shows the results of aesthetic photo reconstruction on the AesRecon evaluation benchmark. AesFormer outperformed open - source models in all indicators and performed comparably to Google's closed - source commercial model Nano Banana Pro, achieving better results in most indicators, verifying its reliable aesthetic photo reconstruction ability.
Furthermore, researchers explored a key question: Is the poor performance of existing image editing models due to the lack of clear editing instructions? In other words, can aesthetic photo reconstruction be achieved by simply combining existing Thinkers and Editors?
To this end, researchers used Qwen3 - VL - 8B and GPT - 4o to generate optimization plans, and then different editing models completed photo reconstruction according to the plans. The results showed that such combinations did not bring stable improvement, and in some cases, even led to performance degradation. This indicates that aesthetic photo reconstruction cannot be achieved through simple "generating instructions + executing editing", mainly because:
(1) General Thinkers lack aesthetic understanding ability and have difficulty accurately planning structural adjustments such as composition, perspective, and poses.
(2) General Editors lack aesthetic execution ability and have difficulty stably completing complex editing under the semantics of photography.
Table 1. Results of aesthetic photo reconstruction of the aesthetic photo reconstruction model (AesFormer) on the AesRecon evaluation benchmark
The case study in Figure 4 shows that open - source image editing models usually have difficulty completing the structural editing required for aesthetic photo reconstruction, so they cannot effectively fix the composition, perspective, and pose problems left in the shooting stage. In contrast, AesFormer decouples aesthetic planning and image editing, enabling it to more stably enhance the aesthetic of photos through picture reconstruction.