At the end of the year, Wensen Image faces a "major examination", while ByteDance submits its "answer sheet" with Jimeng AI.
At the end of 2024, the field of text-to-image/video is highly competitive.
In early December, OpenAI finally launched Sora, a significant offering. With a maximum resolution of 1080P, a duration of 20 seconds, and features like storyboarding and Remix, it undoubtedly stirred up the already intense competition in the text-to-image/video arena. However, the competitive landscape was soon refreshed again - Google released the text-to-video model Veo 2, which impressed with its performance in instruction following, camera control, and image quality. Domestic players are also not to be outdone. Dream AI, backed by ByteDance's strong short-video genetics, has undergone several iterations in the four months since its launch and finally solved the pain point of Chinese embedding in AI-generated images.
In the current era where image narrative is prevalent, text-to-image/video has undoubtedly become a must-win area for AI applications, and this competition is far from over. AI-generated videos that do not conform to physical principles, while amusing, reflect many pain points in the current text-to-image field, such as low generation quality, slow response speed, complex operation, and frequent artifacts.
Compared to other major players in the domestic and international text-to-image/video field, Dream AI entered the game relatively late, but it is already significant enough to not be underestimated by the industry, and users have high expectations for it. In fact, Dream AI itself is ambitious and has shouted the slogan of "Imagination Camera".
So, since its launch in May 2024, what are the unique capabilities of Dream AI? How does it compare to similar applications at home and abroad, including Sora? Through Dream AI, we may be able to get a glimpse of ByteDance's achievements in the text-to-image/video field in the first year of AI application.
One-Sentence Image Editing: Simple and Precise
In November, Dream AI launched the "Intelligent Reference" feature, claiming that users can achieve zero-threshold image editing with just one sentence and accurately obtain the expected results.
For example, try using the popular cultural relic action modification. Select a photo of the Terracotta Army, click to use the "Intelligent Reference" feature, and enter a simple prompt: The Terracotta Army is drinking milk tea. In just a few seconds, the original image is generated into a Terracotta Army image with a milk tea in its left hand. It can be seen that the other parts of the image basically remain in the original state without deformation, and there is no need to do additional steps such as smudging and outlining.
Prompt: The Terracotta Army is drinking milk tea
Try a more complex image editing effect. Remove the broken glass in front of the girl in the original image. From the effect image, it can be seen that Dream AI's understanding of the prompt is very accurate and in place. The glass is completely removed, and the other details of the original image are basically retained.
Prompt: Remove the broken glass in the picture
In multiple tests, it can be seen that Dream AI's image generation model can distinguish expressions, emotions, styles, idioms and other words, and can accurately perform tasks as directed.
From the effect point of view, the current image editing function of Dream AI can achieve various effects such as changing styles, actions, expressions, 2D to 3D, changing clothes/people, adding or subtracting subjects, and changing scenes. Compared with similar models, the coverage is relatively comprehensive.
Simple, precise and with diverse effects, for C-end mass users, this can meet most of the mapping needs in the process of using social media, such as the recently popular anthropomorphic production of classic cartoon avatars, and the AI snow scene that is popular on Moments. For creators, this simple and precise image generation realization can undoubtedly greatly reduce the use cost and improve the efficiency of creation.
At the beginning of this month, Dream AI launched the "Text Poster" function. Users can generate Chinese/English posters by entering one sentence, and later, the function of smudging and modifying typos was added.
In the actual test, in addition to achieving relatively basic requirements such as simplicity, speed and layout design, the more outstanding performance of Dream AI is undoubtedly the difficulty in generating Chinese in AI image generation. The quick smudging and modification of typos on the same platform is also difficult to achieve for current domestic and foreign text-to-image models. In addition, Dream's model can also automatically improve the copy and supplement the details of the picture according to the prompt words. In terms of controlling the generation of text in the picture, Dream is the industry pioneer.
With this function, the needs of B-end in scenarios such as e-commerce promotions, new product promotions, year-end activities and video covers can basically be met. Especially for small merchants or marketing individuals, semi-professional designers, and text media workers who do not know how to make posters but have needs, Dream will be a good auxiliary tool.
Video Generation: Complex and Diversified
Video generation is currently a highly competitive area in the AI application field and an important touchstone for the capabilities of various applications. Dream AI launched the S&P dual model in mid-November. According to the introduction, using the DiT architecture, the S2.0 pro model has an excellent performance in the first-frame consistency and image quality, while the P2.0 pro model has a higher "prompt-following ability", that is, even if complex prompts such as lens switching, continuous character actions, emotion interpretation, and camera control are input, the model can understand and accurately generate videos.
The maximum realization of instruction understanding, lens switching, and camera control has reached a new level of competition in the current main models in the video generation field. The latecomer OpenAI Sora has a storyboarding function that allows users to freely add storyboards; and Google Veo 2, which is recognized as having comprehensively surpassed Sora, has achieved the extreme in complex instruction understanding and camera control. Professional photography terms such as depth of field can be directly input and accurately understood to achieve the desired output effect.
And Dream's P2.0pro model has also made efforts in these areas. In terms of lens switching, by inputting an image and a prompt, a multi-lens video can be generated to achieve lens switching such as panoramic, medium, and close-up, and maintain a high consistency in the overall style, scene, and characters of the video and the original image. In the following actual test, it can be seen that the video and the original image are basically consistent, and the generated facial expressions and body details are relatively accurate and vivid.
Prompt: Transform the video presentation into an animation style, highlight the girl's facial expressions, and show the joy after shopping
From the perspective of character actions, the current P2.0pro model can generate a complete set of actions for a single subject, multiple subjects, continuous and complex actions. For example, in the actual test, input the following single-person static image and the prompt: A man walks into the picture, a woman turns to look at him, and they hug each other, and the people around in the background are walking. In the generated effect, except for the problem that the eyes of the characters cannot focus, which exists in most models, the character actions are quite coherent, and the interaction effect also conforms to the real physical principles, and there are no problems such as dislocation and deformation of the limbs.
In terms of camera control, in addition to the simple "push, pull, shake, and move", the Dream P2.0 pro model can also achieve various camera movements such as zooming, subject surrounding, lifting, rotating, shaking, and fisheye lens, among which the "zooming" performance is particularly outstanding. In the following actual test of the original image + prompt (The camera shoots around the woman wearing sunglasses, moves from her side to the front, and finally focuses on the close-up of the woman's sunglasses), except for the slightly shaking camera, the description of the prompt has been largely realized.
In addition to the precise understanding of the lens language and action language, the P2.0pro model of Dream is also extremely precise in emotion interpretation. It can not only interpret single and simple emotions such as crying, laughing, sadness, and anger, but also understand and generate complex emotions such as "crying with a smile".
There are many scenarios for video generation