Alibaba's HappyHorse suddenly launched. In actual tests, it made Elon Musk and Sam Altman fight in court and even created GTA 6 on its own.
According to a report on April 27 by Zhidx, today, the latest video generation and editing model HappyHorse 1.0 (officially translated as: Happy Pony) of Alibaba's ATH Innovation Division has started a gray - scale test. Creators can register and use it on the Alibaba Cloud Bailian Platform and the HappyHorse official website, and the general public can experience it in the Qianwen App.
On the blind - testing platform Arena.ai, HappyHorse 1.0 ranked second in all three lists of text - to - video, image - to - video, and video editing, only second to the recently popular ByteDance's Seedance 2.0. To verify the real effect of HappyHorse 1.0, we conducted multi - dimensional actual tests.
Users can use HappyHorse 1.0 to generate a video with a length of 3s - 15s just by inputting a simple text description. The video can include elements such as multi - shot switching and coherent plots.
The official website shows that HappyHorse 1.0 supports a maximum resolution of 1080p and can generate up to 4 videos simultaneously. The list prices for generating 720P and 1080P videos are 0.9 yuan/s and 1.6 yuan/s respectively. After the limited - time discount for the Pro package monthly subscription, the prices are 0.44 yuan/s and 0.78 yuan/s.
Meanwhile, HappyHorse 1.0 also appears in multiple products under Alibaba, such as Alibaba Wukong, MuleRun, and JVS Claw and other Agent platforms. The Qianwen App has announced that a "test" video game will be launched soon. After determining the user's character in the short - drama universe, HappyHorse 1.0 will be used to generate short - drama clips in which the user "stars".
In this actual test, we found obvious advantages of HappyHorse 1.0 in instruction following and generation speed, and also saw that there is still room for improvement in aspects such as the physical accuracy of the picture and audio - video synchronization. The following are the core experience points we extracted:
(1) Fast generation speed: During the experience, it only takes about 2 - 5 minutes to generate a video, which is competitive compared with other video - generation models.
(2) Strong instruction - following ability: It can understand and execute complex prompt requirements, including camera movement, picture composition, style and atmosphere, etc.
(3) High restoration degree of multi - element reference: In image - to - video generation, all uploaded reference elements, including characters, scenes, props, etc., can be accurately presented.
(4) There is still room for improvement in audio - video synchronization: The synchronization effect of voices and sound effects is relatively good, but in complex scenarios such as musical instrument performances, there is an obvious misalignment between the hand movements generated by the model and the audio rhythm.
(5) Easy to have continuity errors in long narratives: In videos longer than 10s, physical bugs are likely to occur, such as objects moving without external force.
(6) Text rendering errors: The text in the picture often appears garbled or incorrect.
The following is our complete actual - test process:
HappyHorse official website: www.happyhorse.cn
01. Can understand super - long prompts of 800 words, but there is still room for improvement in physical authenticity
Our first - batch of actual - test tasks focused on the text - to - video ability. In this scenario, the model's instruction - following ability, the physical accuracy of the generated results, and the audio - video synchronization effect are all ability points worthy of attention.
Case 1: Complex actions and audio - video synchronization
Prompt:
A street music performance. The drummer beats the rhythm, the guitarist plays the melody. The audience forms a semi - circle, clapping their hands and swaying their bodies slightly with the rhythm. The atmosphere is warm. It has a Latin American style, with warm - colored lights in the evening. The camera slowly moves forward.
The generation speed of HappyHorse 1.0 is a major highlight. The time taken to generate the following video is about 2 minutes. From the generated results, the human figures generated by HappyHorse 1.0 do not have problems such as abnormal or distorted limbs. The camera movement, picture lighting and other elements meet the requirements of the prompt.
The audio - video synchronization effect is a shortcoming in this video. The guitarist's performance is out of sync with the music rhythm in the video. Especially in the switching of some chords and the landing points of heavy beats, the hand movements in the picture are misaligned with the notes in the audio, destroying the authenticity and immersion of the performance.
Case 2: Physical authenticity
Prompt:
On a seaside cliff, the waves violently hit the rocks, and the water splashes. The sky is overcast with clouds. The wind blows the clothes and hair of the characters. It has a movie - level sense of reality and is in slow - motion.
In this case, HappyHorse 1.0 needs to simulate the physical world, and the difficulty lies in the presentation of elements such as water and wind.
Finally, the generated result of HappyHorse 1.0 well restores the turbulent effect. The impact of the waves on the rocks and the foam on the sea surface are relatively in line with physical laws.
After switching to the characters, the floating direction of the protagonist's hair is basically the same as that of the clothes. The drawback is that the falling speed of the water droplets in the close - up picture does not quite conform to physical laws and seems to be half a beat slower.
Case 3: Super - long prompt
Prompt:
In this case, we examined HappyHorse 1.0's ability to understand complex prompts. This prompt is as long as 800 words and describes a real - machine demonstration picture similar to the well - known game "GTA".
Our prompt defined almost all elements in the picture, including characters, weather, environment, buildings, etc. HappyHorse 1.0 accurately presented these elements.
However, a physical bug occurred in the first picture at the beginning. The car door closed by itself without external force. In the last shot, the protagonist changed, which shows that there is still room for improvement in the consistency of HappyHorse in this case.
Case 4: Camera language and sense of narrative
Prompt:
On a city street at night, a detective is walking in the rain. The neon lights are reflected on the wet ground. The camera slowly zooms in from a long - shot to a close - up. It has a film noir style.
The above prompt makes detailed regulations on the camera movement method and picture style, which puts forward requirements for HappyHorse 1.0's instruction - following ability.
Finally, HappyHorse 1.0 generated the camera requirement of slowly zooming in from a long - shot to a close - up. The film style is correct, and the light and reflection of the neon lights are relatively natural, but there are obvious errors in the text rendering of the Chinese characters in the picture.
In this case, we also used a resolution of 1080P and the longest duration of 15s. It can be seen that the details of the magnified picture are still relatively clear.
From these cases, when the prompt requirements are relatively detailed, HappyHorse 1.0 can better understand and execute complex requirements for picture composition, camera movement, and style and atmosphere. The generated human figures and basic physical interactions are also relatively stable. However, there is still room for improvement in high - precision audio - video synchronization, microscopic physical details, and text rendering in the picture.
02. Supports up to 9 image references. The actual test makes Altman and Musk "face off in court"
In addition to text - to - video, HappyHorse 1.0 also supports image - to - video and video editing. These scenarios put forward high requirements for consistency and stability. However, during today's actual test, we were unable to successfully experience the video - editing ability.
Case 1: First - frame mode
We first tried the first - frame mode in the image - to - video function. The uploaded picture was a group photo of Sam Altman, the co - founder and CEO of OpenAI, and Dario Amodei, the co - founder and CEO of Anthropic, taken some time ago.
However, perhaps because the picture involved multiple real people, the model rejected this generation request.
After that, we uploaded a single photo of Altman and asked the model to generate a picture of him drinking coffee. This attempt was successful in the end. The similarity between the person in the picture and the real - person photo should be about 80%, and the appearance of the person did not change with the change of light and background.
Case 2: Multi - character reference
In the multi - image reference, we uploaded the images of Musk and Altman and asked HappyHorse 1.0 to imagine in advance the picture of these two people facing off in court and having a fierce argument.
This time, HappyHorse 1.0 did not reject our generation request. In the first - version generated result it gave, the effect of the two people arguing was not reflected. It was mainly "Musk" speaking unilaterally. In addition, the model seemed not to understand that the native language of these two people is English, and "Musk" in the picture spoke fluent Chinese.
After further refining the requirements, HappyHorse 1.0 was able to generate a picture of the two people arguing in English. The expressions of the characters were rich, but there were relatively obvious deviations compared with the reference pictures.
Case 3: Multi - element reference
In addition to uploading multiple characters, multi - image reference also allows users to provide materials for the background and specific elements of the video model's generated pictures. So we uploaded elements such as Bill Peebles, the core figure of Sora, the OpenAI office, and cardboard boxes, and asked HappyHorse 1.0 to generate a picture of someone leaving the job.
In this case, the good aspect of HappyHorse 1.0 is that it accurately generated all the reference elements we uploaded, and the characters, environment, etc. were basically the same.
However, there were many physical bugs in the picture, such as the cardboard box closing automatically and the door opening automatically.
03. 1080P and fast generation speed are the highlights of HappyHorse
When HappyHorse 1.0 started the test, some media such as Zhidx and many industry insiders participating in the HappyHorse 1.0 test discussed the current performance of this model and its competitive position in the industry.
Li Ming, the technical partner of Max International, an all - in - one AI e - commerce marketing platform for overseas markets, believes that the generation length of 3s - 15s, fast generation speed, and support for 1080P are the highlight functions of HappyHorse 1.0. At the same time, when the prompts are relatively clear, the output effect of the model is "acceptable".
However, in practice, HappyHorse 1.0 also shows some problems, such as the consistency of the generated video results and the mechanical feeling of the generated voice content. Li Ming believes that compared with models such as ByteDance's Seedance 2.0 and OpenAI's Sora 2, HappyHorse 1.0 "still has some room for improvement".
We mentioned to Li Ming the problems of audio - video synchronization and text rendering encountered during the actual test. Li Ming said that the rendering of text content such as subtitles is actually a common problem of current AI video - generation models. Currently, the industry rarely relies on large models to generate directly but uses post - production tools for supplementation. This mode also leaves room for adjustment.
In terms of audio - video synchronization, Li Ming observed that through better prompt engineering, the audio - video synchronization effect of models such as HappyHorse 1.0 and Seedance 2.0 can be improved, but these models still have some problems in the audio - video synchronization dimension at present.
Li Ming judged that for enterprises, the generation quality of current video - generation models is still a common pain point. The "lucky - draw rate" of some creative teams is even as high as 50% - 60%. There is also room for improvement in the timeliness of generation.
In contrast, price is a dimension that users are more likely to accept. If it can be fast and good, users' acceptance of the price will naturally increase.
The relevant team of the AI video - creation platform Flova, which participated in the internal test, believes that the HappyHorse 1.0 model performs well in terms of realism and narrative ability, especially suitable for narrative content and documentary - style subjects.
At the same time, the focal - length application of HappyHorse 1.0 is close to real - shooting, reducing the "AI feeling" of the video and making the viewing experience more real. Its camera movement is also relatively natural.