AI video generation: How to break the boundaries of creation?
01. When New Technologies Meet Old Challenges
If we were to choose a spotlighted area in the AI industry in the second half of 2025, video generation would almost certainly be the answer. After OpenAI released Sora 2 and launched its app version, the popularity of AI - generated videos spread like wildfire across the globe.
However, by examining the development of the industry, we can see that this is not just a sudden product boom. Behind it lies the continuous improvement of video - generation technology in terms of image quality, temporal modeling, and usability over the past two years. Whether it's from big companies or startups like Sora, Veo, and Tongyi Wanxiang, the cumulative technological contributions have significantly accelerated the iteration of global AI - related video capabilities.
Deeper impacts are gradually emerging within the industry.
As the progress of models is no longer limited to image quality alone but gradually covers key elements closer to industrial production, such as narrative ability, consistency of characters and styles, audio - video synchronization, and logical continuity across shots. When the generated results cross the threshold of being "watchable" and start to approach "usable" and "user - friendly", AI - generated videos truly enter the public eye and become one of the most promising sectors.
Meanwhile, the video industry itself is facing a structural challenge.
Over the past decade, the video - related industry has been one of the fastest - growing, most capital - intensive, and most innovative fields globally. From film and television entertainment, advertising and marketing, to e - commerce content, social platforms, and the creator economy, video has gradually become the core form of information, entertainment, and business expression. However, as the industry matures and competition intensifies, content production has reached its limit. Short dramas, e - commerce, and advertising have entered a stage of "faster, more detailed, and larger - scale" production. The content update cycle has been compressed to hours or even minutes, and the human resources and production cycles required by traditional production chains are clearly out of sync with this pace.
This pressure manifests in different forms in different fields: traditional film, television, and advertising still rely heavily on experience - intensive labor, with high costs for proposals and trial - and - error; the demand for high - frequency, fragmented materials on the MCN and e - commerce sides far exceeds the capacity of traditional shooting and editing processes; after short dramas and AI - generated comic dramas have moved beyond their early rough stages, they have higher requirements for the consistency of characters, scenes, and shots; overseas - bound content faces double challenges of speed and cross - cultural adaptation.
As the demand for content continues to grow and the AI - video generation capabilities mature rapidly, the ecological structure of the content industry is starting to change.
On the one hand, the threshold for creation has been significantly lowered. Videos are no longer content forms that can only be stably produced by a few professional teams. Individual creators and small teams are now capable of near - industrial production.
On the other hand, a new intermediate layer around video generation is emerging - from creative tools, workflow platforms, to vertical solutions for advertising, e - commerce, and short dramas. More and more companies are starting to redesign their product forms with AI - generated videos as the underlying capability.
This has brought about more chain reactions. For example, the relationship between platforms and creators is being reshaped. When content becomes a process - based asset that can be repeatedly generated, quickly verified, and continuously optimized, video production is gradually shifting from one - time creation to a scalable systematic project.
Therefore, in the past year, a large number of startups have emerged both at home and abroad in the upstream and downstream of the AI - video generation industry chain: some start from the video - generation ability itself to reconstruct the starting point of video production; some integrate AI into scripts, storyboards, and editing around the creator's workflow; others focus on enterprise and industry scenarios, emphasizing stability and scalable delivery; in the overseas market, cross - language and localized generation have also become important breakthrough points.
When technological breakthroughs and large - scale domestic demand converge at the same time, the content industry has gradually formed a clear judgment: AI - video generation has become an important part of the next - generation content infrastructure. More stable technology and faster tools are not enough. Creators may need a more fundamental and scalable productivity solution.
02. The Boundaries of Creation Are Being Broken by Technology
Every company is responding to this trend with its own actions.
Represented by OpenAI's Sora, its strategy is more inclined to showcase general capabilities: through the generation of extremely high - quality and visually impactful videos, it quickly raises public awareness and promotes AI - generated videos into the public culture and social communication scenarios. Google's Veo, on the other hand, continues its research advantages in multi - modality and generative models, emphasizing the model's ability to understand long - term sequences and express in complex scenarios, which is more of a frontier exploration of technological capabilities.
Domestically, more companies start from the platform ecosystem: some combine video - generation capabilities with content distribution, creator systems, and recommendation mechanisms, trying to integrate AI - generated videos into the existing creation - dissemination closed - loop; others empower the entire video - production process with generation capabilities to improve content supply efficiency.
These paths have different focuses: some prioritize solving the problems of "whether it can generate and how good the generated results look", while others are more concerned about "how to use and how to spread". There is also another emerging path that regards video generation as a productivity ability.
The differences between these paths are essentially based on different understandings of usability versus entertainment, B - end versus C - end.
In the C - end scenario, AI - generated videos mainly serve entertainment and self - expression functions. "Fun", "novelty", and "personalization" often take precedence over stability, and users are more tolerant of occasional inconsistencies and glitches. In B - end scenarios such as advertising, e - commerce, and short dramas, what creators and enterprises really care about is whether the shots, characters, and styles can remain consistent in the long run, whether the content is controllable and reusable, and whether it can be stably output in a high - frequency, high - concurrency production rhythm.
This is also an easily overlooked divide in the current market: many video models can meet the C - end's need for novelty and creation but struggle to support the B - end's requirements for certainty and scalability. If they cannot enter the B - end production process, it is difficult for AI - video capabilities to truly translate into productivity improvement.
Alibaba has chosen a more difficult but more valuable path for the entire industry - to turn AI - video generation into industry - level infrastructure. On December 17th, at the Alibaba Cloud Apsara Release Event, Tongyi Wanxiang 2.6 (Wan2.6) was officially commercially launched. As Alibaba's core model in the video - generation field, Wanxiang aims to respond to the trend of the content industry moving from being able to generate to being able to produce, and from trial - use to large - scale implementation.
Jin Luyao, the product manager of Tongyi Lab, dissected for us the capabilities that creators are most concerned about in actual production, such as multi - shot narrative, video - reference generation, and more stable long - term sequence output, and how these demands shape the evolution direction of the model.
For AI - generated videos to truly enter the production process, the primary prerequisite is to have multi - shot narrative ability.
In real - world video creation, the quality of a single frame has never been the most difficult problem. The real challenge lies in the continuity across shots - whether the characters are stable, the scenes are coherent, and the time and narrative are logical. Early video - generation models were better at generating isolated high - quality segments. Once they entered multi - camera, multi - shot creation scenarios, problems such as drifting character details, broken action logic, and inconsistent information would emerge, which is also an important reason why AI - generated videos have long remained at the concept - demonstration or single - shot material stage.
In Wan2.6, multi - shot ability is elevated to a core model - level capability. Compared with the "generate segment by segment and splice later" approach, Tongyi Wanxiang emphasizes the overall modeling of the timeline and shot language during the generation process: the model needs to clarify at the beginning "who the subject is", "how the space changes", and "how the narrative progresses" so that shot transitions can become a controllable variable. To this end, Wanxiang continuously strengthens subject consistency and temporal modeling during training and inference and supports natural - language storyboard instructions, allowing creators to directly complete multi - shot narrative scheduling through prompts.
This provides the continuity basis required for video generation to approach industrial production.
Generated by Tongyi Wanxiang
Jin Luyao told us that another important real - world demand is that creators often want to preserve the appearance, actions, and even voices of real people or objects while placing them in new virtual scenarios. In the past, meeting such demands highly relied on shooting, modeling, and complex post - production, with extremely high costs and technical thresholds.
Wan2.6 upgrades the reference object from pictures to videos and further integrates the overall modeling capabilities of appearance, actions, and voices. The model supports inputting about 5 - second reference videos, using the people, animals, or objects in them as the main subjects for subsequent generation. It can not only replicate the appearance but also learn the action patterns, facial expressions, and voice characteristics to achieve audio - video consistent results.
Compared with single - picture reference, video reference can provide more complete 3D and temporal information, enabling the model to understand the subject more realistically. This ability is particularly crucial in real - world scenarios. Whether it's a brand generating a complete advertisement from a rough video or a creator integrating real people with virtual environments, video - reference generation significantly lowers the production threshold and expands the usable boundaries of AI - generated videos in commercial scenarios.
Generated by Tongyi Wanxiang
Generated by Tongyi Wanxiang
"In video generation, duration is always a variable that needs to be carefully balanced," Jin Luyao added.
Too short a video cannot convey a complete message, and as the duration increases, the difficulty of maintaining consistency and temporal stability for the model rises exponentially. In the industry, most video models can only stably generate videos about 4 seconds long. Adding just one more second often brings exponential technical challenges.
Wan2.6 can stably generate controllable videos about 15 seconds long, supporting 1080P output and audio - video synchronization. For commercial scenarios such as advertising, e - commerce displays, and short - drama storyboards, 15 seconds can carry a complete narrative without significantly increasing modification and control costs, making it an ideal content length.
In this release of Wan2.6, the text - to - image generation ability has also been upgraded. In addition to basic generation, the model incorporates an understanding of narrative structure, supports mixed text - and - image input, and can automatically disassemble a story from simple prompts and generate storyboard frames, greatly improving the creation efficiency of story - based content. Combined with multi - image reference and commercial - level consistency control, text - to - image generation is moving from "inspiration sketches" to production tools directly usable in advertising and content creation.
Beyond meeting the most basic production needs of creators, Tongyi Wanxiang is also trying to take a step further - to continue exploring how to expand the boundaries of creation through the continuous evolution of model capabilities, allowing AI to play a more active role in expression, aesthetics, and narrative.
Generated by Tongyi Wanxiang
"Chinese aesthetics is an idea that Wanxiang has always adhered to," Jin Luyao told us. The continuous investment in the Chinese context and Chinese aesthetics is an important feature that distinguishes Wanxiang from many overseas models. Through cooperation with art academies and other institutions and the introduction of a large amount of Chinese aesthetic materials during pre - training and evaluation, the model's performance in terms of character temperament, style expression, and cultural details is more in line with local creation needs. This optimization is not a one - time process but is continuously iterated through evaluation systems, customer feedback, and reinforcement learning.
Generated by Tongyi Wanxiang
Real - world demands continuously raise the requirements for technical capabilities, and continuous breakthroughs in technical details, in turn, release new production efficiency. Tongyi Wanxiang is evolving in such a feedback loop. As Jin Luyao, the product manager of Tongyi Lab, said, "We've always believed that good results outweigh everything else."
03. When Efficiency Increases, Cycles Shorten, and Redundant Manpower Disappears
Not long ago, just a year or even half a year ago, most practitioners in the video and film - production industries could hardly imagine that their work efficiency could be doubled.
The increase in efficiency is a direct result of the reconstruction of content - production methods. In the traditional production system, creativity, execution, and post - production are divided into multiple linear steps, each undertaken by a specific position. In this highly specialized model, the process can only proceed sequentially, often requiring the complete implementation of the previous step before moving on to the next. This not only lengthens the overall production cycle but also leads to a large amount of redundant manpower.
When AI - video generation starts to intervene in the front - end of creation, many tasks that previously required cross - position collaboration are now compressed into a single creative interface. The boundaries between traditional positions such as screenwriters, directors, editors, and graphic designers are gradually blurring. Scripts can be directly converted into storyboards, storyboards can quickly generate visual materials, and editing and art adjustments no longer rely on a long post - production process. The hand - over costs between positions are significantly reduced, and creators are starting to make overall judgments based on the final results rather than sticking to their fixed processes.
Generated by Tongyi Wanxiang
As content production shifts from a linear process to parallel and instant generation centered around models, the increase in efficiency is not evenly distributed among all. The first changes usually occur in scenarios that are under high - frequency output pressure and are highly sensitive to costs and cycles.
These scenarios have one thing in common: on the one hand, they need to ensure continuous and large - scale content production; on the other hand, their creativity needs to be quickly verified and iterated. Therefore, the efficiency improvement brought by AI - generated