HomeArticle

Generate a complete short drama with a single sentence. A team from Nanyang Technological University proposed a hierarchical Agent framework, leading the production of AI short dramas towards standardization.

账号已注销2026-05-27 15:13
Narrative, visuals, and post-production are all taken care of by the Agent.

In recent years, the rapid development of video foundation models has significantly enhanced the ability to generate automated short videos. Models such as Sora, Kling, Seedance, and Veo have demonstrated strong capabilities in one-shot video generation.

However, the current production of short dramas usually relies on the one-shot generation of Large Language Models (LLMs) and a loosely coupled workflow, and there are still three main deficiencies:

  • Weak narrative rhythm: The opening is not captivating enough, and the plot lacks conflict and tension.
  • Insufficient spatial consistency: It is difficult to maintain coherence in the scene layout and character positions after camera transitions.
  • Immature quality control: The generation process still requires a large amount of manual review and correction.

To address this issue, a research team from Nanyang Technological University and its collaborators have released the Hierarchical Agent Framework "One Sentence, One Drama" (One Sentence, One Drama). Users only need to provide a creative idea, and this Agent framework can generate a personalized short drama with complete production and exquisite visuals.

Paper link: https://arxiv.org/abs/2605.22144

To evaluate the short drama generation effect, the research team added specific criteria for short dramas to the standard video quality indicators. The experimental results show that One Sentence, One Drama significantly outperforms the existing pipeline in terms of narrative quality, cross-shot consistency, and overall viewing experience.

This also indicates that with the continuous improvement of the Agent-driven structured process, short dramas and even longer video content are moving towards a production stage with controllable quality.

Figure | From one sentence to a complete short drama.

How to generate a short drama from one sentence?

According to the paper, the entire automated video production process is divided into four steps: story generation, visual material and prompt generation, consistent first-frame generation through 3D scene anchoring, and post-production. The review process runs through the entire process and is responsible for video quality control.

Figure | The personalized short drama generation pipeline is divided into four stages.

Story generation: The Agent first generates a structured story and a storyboard through retrieval and multi-Agent debate. Then, it calls on the rhythm pattern library and causal logic library disassembled from about 300 high-quality short dramas to combine narrative units in three dimensions: fact, logic, and rhythm, and builds a controllable short drama framework.

Figure | A story generation framework based on multi-Agent debate.

Visual material and prompt generation: The Agent first generates a panoramic view of the scene and reference images of the characters, and then generates the first frame and video prompts for each segment. The first-frame prompt defines the composition and perspective of the first frame, and the video prompt describes the subsequent actions, character interactions, and camera movements. Before generation, the review module will check the spatial relationships and props for coherence, and rewrite them if there are problems.

Consistent first-frame generation through 3D scene anchoring: The Agent first restores the scene space based on the panoramic view, and then unifies the character movements, camera positions, and scene relationships. Based on this, it selects a suitable camera position for the next shot to maintain spatial consistency across shots as much as possible. When dealing with multi-character scenes, the Agent will also fine-tune the camera position to ensure the integrity of the characters in the frame and their standing relationships.

Figure | Consistent first-frame generation through 3D scene anchoring.

Post-production: The Agent will uniformly handle transitions, background music, and voice connections according to the plot progression, and integrate each video segment into a short drama with a coherent rhythm and complete emotions.

Figure | Generation of diverse transition segments, background music planning, and mixing.

What's the effect?

In the evaluation, the research team constructed a short drama evaluation benchmark Short-Drama-Bench, covering 7 major types and 17 sub-genres, including revenge, real-life themes, ancient palace intrigue, suspense reasoning, time-travel and rebirth, sweet romance, and workplace business battles. A total of about 239 minutes of videos were generated, covering long, medium, and short dramas. Compared with general video benchmarks, this benchmark pays more attention to the narrative rhythm and final effect of short dramas.

To more comprehensively evaluate the short drama generation effect, the research team also divided the evaluation system into: VBench is responsible for measuring general video quality, and ViStoryBench is used to evaluate the visualization effect of the story. They also set up 8 specific short drama indicators to examine the opening and ending hooks, upgrade effects, narrative coherence, spatial coherence between characters and the environment, and the naturalness of BGM and transitions.

From the qualitative results, the advantages of this Agent framework are not only reflected in the indicator scores but also intuitively demonstrated in the generated examples. Compared with the baseline method, it is more stable in visual continuity across segments, and the connection of character positions, scene layouts, and camera relationships is more natural. At the same time, its plot rhythm and transition processing are closer to the viewing habits of short dramas, resulting in a stronger sense of the final product.

Figure | Qualitative examples.

Figure | Examples of generated videos

From the quantitative results, compared with methods such as MovieAgent, ScriptAgent, StoryMem, and commercial short drama generation products such as Toonflow, this Agent framework shows an overall leading advantage in short drama-specific indicators, VBench, and ViStoryBench.

In addition, the ablation results show that each video production link has different functions. Story generation affects the opening appeal and plot progression, 3D first-frame generation mainly improves cross-shot spatial coherence, multi-stage review improves overall quality, and transitions and BGM make emotions and transitions more natural.

Figure | Quantitative evaluation. Top left: Comparison results on standard video generation and story visualization benchmarks. Bottom left: Comparison results on Short-Drama-Bench indicators, covering narrative hooks, narrative fluency, cross-segment continuity, and audio transition quality. Right: Human scoring results based on the same short drama evaluation dimensions, summarizing the average scores of 20 annotators in the benchmark test.

Deficiencies and future directions

The research team pointed out that although this Agent framework has shown strong advantages in the automated generation of short dramas, there are still some practical limitations to large-scale deployment.

For example, stronger controllability and higher production quality also mean higher generation costs. The average API cost of One Sentence, One Drama is about $25 - 27 per minute, while that of Toonflow is about $21.53 per minute. In terms of time cost, the research team takes about 74 - 90 minutes to generate a complete short drama of about 10 minutes. In the future, to achieve large-scale deployment, cost reduction remains a problem that must be faced.

In terms of human-computer collaboration support, the current Agent framework still mainly focuses on automatic generation. The research team pointed out that in the future, an interactive interface can be used to provide users with review scores and diagnostic feedback: Low-scoring segments can be regenerated, high-scoring segments usually do not require additional modification, and segments in the middle range can be decided by the creator whether to adjust.

In addition, this Agent framework also has audio licensing issues. To reduce copyright risks, the current BGM library mainly uses royalty-free or commercially available music, which limits the diversity of style and emotional expression. In the future, if it can access a larger-scale licensed music library and provide clear purchase or licensing options when matching specific tracks, this Agent will also have broader commercial application scenarios.

For more technical details, please refer to the original paper.

This article is from the WeChat official account “Academic Headlines” (ID: SciTouTiao), author: Xia Qiansi. It is published by 36Kr with authorization.