Meituan's LongCat-Video is officially released and open-sourced, supporting efficient long video generation.
On October 27, Meituan's LongCat team released and open-sourced the video generation model LongCat-Video. The team said that the model supports basic tasks such as text-to-video, image-to-video, and video continuation under a unified architecture, and has achieved leading results within the open-source scope in internal and public benchmark (including VBench) tests.
▲ The LongCat-Video video generation model has reached the open-source SOTA in basic text-to-video and image-to-video tasks (File photo)
The technical report shows that LongCat-Video is based on the Diffusion Transformer (DiT) architecture and uses the method of distinguishing tasks by the "number of conditional frames": no conditional frames are input for text-to-video, one reference frame is input for image-to-video, and video continuation relies on multiple previous frames. It can cover three types of tasks without additional model modification.
To improve the long-term sequence generation ability, the model introduces the native video continuation task in the pre-training stage. The team said that the model can stably generate minute-long videos and has made targeted optimizations in cross-frame temporal consistency and physical motion rationality to reduce problems such as color drift, image quality attenuation, and motion discontinuity.
In terms of efficiency, the model combines block sparse attention (BSA) and conditional token caching mechanism to reduce the redundancy of long-sequence inference; it is said that when processing sequences of 93 frames and above, it can maintain a stable balance between efficiency and generation quality. For high-resolution and high-frame-rate scenarios, the model adopts a combined strategy of "two-stage coarse-to-fine (C2F) + BSA + distillation". The report says that the inference speed is increased to about 10.1 times compared with the baseline.
In terms of parameter scale, the base model of LongCat-Video has about 13.6 billion parameters. The evaluation covers dimensions such as text alignment, image alignment, visual quality, motion quality, and overall quality; the team said that it performs outstandingly in indicators such as text alignment and motion coherence, and has achieved good results in public benchmark tests such as VBench.
The LongCat team positioned this release as a step in its exploration of the "World Model" direction. The relevant code and model have been open-sourced. The above conclusions and performance descriptions are all quoted from the team's technical report and release materials.
This article is from "Tencent Technology", compiled by Xiaojing, and published by 36Kr with authorization.