StartseiteArtikel

Exklusives Interview mit dem Chefwissenschaftler von Luma AI: Das Spiel in der Videogenerierungsmodell-Branche hat sich verändert.

富充2025-12-05 09:37
Das Machen von Memes für die Consumer ist Sora's Spaß, aber die Nutzung des multimodalen einheitlichen Modells für die Dienstleistung professioneller Kunden ist der eigentliche Geschäftszweig der künstlichen Intelligenz-basierten Videogenerierung.

Text | Fu Chong, Zhou Xinyu

Editor | Su Jianxun

"If we only focus on iterating video generation itself by 2026, it won't be enough." Song Jiaming, the chief scientist of Luma AI, made this prediction to Intelligent Emergence.

Founded in 2021, Luma AI is a star startup in the field of video generation in the United States. Intelligent Emergence learned that recently, Luma AI has completed a $900 million Series C financing at a valuation of $4 billion. This round was led by HUMAIN, an institution under the Saudi Public Investment Fund (PIF), and old shareholders such as AMD Ventures, Andreessen Horowitz, Amplify Partners, and Matrix Partners all increased their investments significantly.

While most video - generation AI companies are still competing in terms of longer durations and better image quality, Song Jiaming specifically explained his "different view": What really needs to be improved in the next stage is not the picture itself, but the model's ability to understand and reason about the real world.

He used a scenario on a film - making set as an example: In film production, if the director needs to reshoot a missed aerial shot, traditional video - generation models only generate relevant content based on the prompt words, but it is easy to have details that do not match the previous and subsequent pictures.

However, a reasoning model can understand the scene space, character positions, and camera logic of the existing clips, so as to produce a video that is more physically reasonable and has smoother transitions.

For this reason, the reasoning video - generation model can be applied in professional film, television, and advertising, which becomes the basis for its monetization.

"The popular trend of making memes with Sora 2 does not mean that the To - C era of video models has arrived. After the novelty wears off, ordinary users are unlikely to continue paying." Song Jiaming explained the current commercial situation of video - generation models.

The key to achieving stronger reasoning ability in video - production models is to train a 'unified multimodal model' using language, image, and video data. Because multimodal fusion will provide the model with a richer and more diverse amount of data, which will promote the model's ability to upgrade from 'generation' to 'understanding'.

This path has been verified in the field of image generation this year: In 2024, there were still disagreements in the industry about the multimodal architecture. But after entering 2025, image - generation models have basically integrated tasks such as text - to - image and image editing into a unified model. The focus of competition has also shifted from architecture design to high - quality data collection.

He believes that the video - generation model will also experience the same convergence process next year.

Continuously predicting the next technological and commercial directions and breaking through itself is what Luma AI has always been doing.

This company, founded in 2021, initially started with 3D generation and shifted to the video - generation model with a larger market space at the end of 2023.

In June 2024, Luma AI launched the video - generation model Dream Machine for AI and design beginners, starting the exploration of the To - C market. Dream Machine attracted one million users in four days with 'zero promotion fees'. With its movie - level camera movements and generation effects, it is known in the industry as 'a video - generation model that can compete with Sora'.

However, Luma did not stay in the To - C popularity. Since this year, Luma AI has gradually shifted its focus to B - end professional users with stronger willingness to pay and more rigid demands, such as film, television, advertising, and content - production institutions.

In September this year, Luma AI launched Ray 3, the world's first large - scale video - reasoning model.

But in a recent exclusive interview, Song Jiaming gave a new judgment to Intelligent Emergence: Ray 3 is likely to be Luma's last generation of traditional video - generation models. The company has established the 'unified multimodal model' as the core direction for the next stage.

This goal also requires greater computing power and financial support.

HUMAIN, one of the investors in Luma AI's current round, is building a 2GW artificial - intelligence super - computing cluster called 'Project Halo' in Saudi Arabia. This is one of the world's largest computing - infrastructure construction projects. Luma AI will use this computing power as a core customer to train the next - generation multimodal world model, further enhancing the video - reasoning and unified - model capabilities.

Starting from 3D generation, attracting attention in the To - C market with Dream Machine, and now better serving B - end professional customers through the layout of reasoning and unified multimodal models. Every key decision of Luma AI has been an expansion based on its original business.

Regarding the current industry observation and future prediction of video - generation models, Song Jiaming detailed his views in the exclusive interview. The following content is from the dialogue and has been sorted out by the author:

△ Song Jiaming, Photo: Provided by the interviewer

The Future of Video - Generation Models: Reasoning Ability and Unified Multimodality

Intelligent Emergence: You once said that 'Ray 3 might be Luma AI's last generation of traditional text - to - video models'. How should this statement be understood?

Song Jiaming: My judgment is that future large - scale models will no longer treat pictures, videos, audios, and texts as isolated modalities but will process them within a unified framework. This is what we call the 'unified multimodal' model.

The increase in data volume brought about by unified multimodality will give video - generation models better reasoning ability, which helps the model make more reasonable video processing and helps users automatically identify problems in the video.

The reason why language models are useful is that they have strong context - learning, Zero - shot (zero - sample learning) and other abilities, as well as strong reasoning ability. I think these abilities will eventually appear in the visual and video modalities, rather than just competing in terms of longer durations and better - looking image quality.

Intelligent Emergence: Can you use a specific example to explain the difference between a video - reasoning model and a traditional video model?

Song Jiaming: Take an example from film shooting. In actual shooting, the crew will set up several cameras simultaneously to shoot different angles of multiple actors. Suppose the director suddenly realizes after work that a missed aerial overview shot needs to be 'reshooted' by AI.

In this case, if you only use a traditional video - generation model, it will probably 'use its imagination' to generate an aerial shot that looks okay at first glance. But upon closer inspection, you will find that the positions of the characters and the layout of the background objects may not match those in the previous shots.

In the video - reasoning task we define, the model needs to first 'understand and reason' rather than 'generate': It needs to find the corresponding relationship of the same background object in different perspectives from the materials of different cameras, infer the positions of each actor and each prop in a unified three - dimensional space, and then generate a video that is physically reasonable, has natural camera movements, and seamlessly connects with the previous shots from a new aerial perspective.

Intelligent Emergence: Many video - generation model companies have achieved good results this year, and their technological paths are diverse. But you seem to think that 'this is the last year for the diverse development of video models', and next year, video generation will converge to a unified model. Why?

Song Jiaming: If we look at the historical rules of image - generation models, last year, there was still uncertainty in the industry about whether to create a unified image model. Or rather, last year, the general tendency was to create different tool - flows for different tasks and then make corresponding adjustments or fine - tune the models according to different tasks. But this year, the trend is to integrate all tasks into the same multimodal model.

Now, few people would say that they will create an architecture completely different from GPT 4o or Nano Banana. Once the architecture is unified, the core of competition shifts from model design to data - driven. The real focus lies in whether one can collect a sufficient amount of high - quality data.

I think what happened in the image field this year will also happen in the video field next year.

Intelligent Emergence: In the technological path of the unified model, what role is Ray 3 playing for Luma?

Song Jiaming: Ray 3 is a phased achievement.

The more important accumulation here is the infrastructure, whether it is the training infrastructure, the reasoning infrastructure, or the basic data infrastructure. In fact, these may be more important than the algorithm accumulation itself.

Because after years of development, there have not been many core changes in algorithms. Basically, the autoregressive route (GPT 3) and the diffusion - model route (DDPM) from five years ago are still being used, with only some minor changes in the past five years. So I think the most significant progress during this period actually comes from scaling, that is, expanding the scale of the model and data.

Intelligent Emergence: What is the relationship between these directions, such as unified multimodality and video - reasoning models, and AGI in your mind?

Song Jiaming: I have relatively strict standards for AGI.

Now many people say that 'certain code models have surpassed most programmers'. I agree that in this regard, they can be called'super - human'. But if that's all, then calculators have long surpassed human mental arithmetic, and we wouldn't call calculators AGI. For me, if humans can do something while AI can't do it at all, then it can't be called AGI.

Currently, in many aspects, AI still lags far behind humans, such as autonomous driving, robotics, embodied intelligence, and long - term planning and execution in the real physical world.

The significance of the unified multimodal video model for AGI lies in ultimately expanding the ability to understand and operate the real world from the pure language space to the dimensions of vision, action, and time.

 

△ Luma AI's model can generate high - definition and imaginative HDR video clips just based on prompt words. Photo: Provided by the interviewer

The To - C Era Hasn't Arrived Yet

Intelligent Emergence: From a product perspective, what inspiration can the popularity of Sora 2 and Nano Banana bring to model companies?

Song Jiaming: I think an important inspiration is to design some usage scenarios from a product perspective and find the points that drive users to use the product, so that the technical features can become a hot topic for dissemination.

Intelligent Emergence: When Luma AI launched Dream Machine, we talked in an interview that it could largely serve To - C users with little design and AI experience. But later, the company gradually shifted its focus to B - end professional users. Why?

Song Jiaming: I prefer to see it as a gradual process rather than a sudden change.

We can first make an analogy with language models: The To - C popularity of chatbots was very high last year, but this year, people are talking more about clear To - B and To - Pro scenarios such as code - writing and intelligent agents.

For ordinary users, there isn't much difference between different chatbots, and they aren't willing to pay high subscription fees. But for programmers, if a tool can double their output, the company is willing to pay for this tool on their behalf.

The same logic applies to video models. To - C users easily get bored with video generation and may not have a stable willingness to pay. In contrast, B - end customers, such as film and television companies, advertising companies, and content - production parties, will have a much higher willingness to pay and stickiness once they find that an AI can save a large amount of manpower, time, and hardware investment in their main processes.

Intelligent Emergence: Before, Sora 2 from OpenAI was widely used for making memes on social platforms. Do you think this means that video - generation models are starting to target the To - C market?

Song Jiaming: I think OpenAI's To - C strategy and the To - C strategy of video - generation models are not the same concept. OpenAI focuses on To - C mainly because its valuation has reached $500 billion. So if it goes for To - B, it seems that there isn't a B - end market large enough to accommodate it.

OpenAI itself is a business and needs to find greater growth points. This is the same principle as Meta and ByteDance. That is, when a company reaches a certain scale, it will definitely make efforts in the To - C market to maximize its scale. But this doesn't mean that the entire video - generation model field should or can target the To - C market.

Olivia Moore, a partner at the well - known US investment institution A16z, once posted a set of data on her social media showing that the 30 - day retention rate of Sora 2 was only 1%, and the 60 - day retention rate was less than 1%. In contrast, the retention rate of TikTok videos can be maintained at around 30%. This also shows indirectly that the meme - making effect of Sora 2 doesn't mean that the video - generation model has successfully penetrated the To - C market.

Intelligent Emergence: What real difficulties does the video - generation model face in targeting the To - C market?

Song Jiaming: From a pure technical perspective, there is already a lot of AI - generated video content on short - video platforms, so targeting the To - C market is not impossible. The difficulty lies in figuring out whether the business model can be profitable.

From a business - model perspective, I haven't seen the value of video - generation To - C applications as social products.

Today's Douyin, YouTube, and Instagram are essentially'social + distribution' platforms. Most people watch the 1% most popular videos, and public topics are formed around these contents. If in the future, everyone watches videos customized by AI for themselves 100% of the time, the resonance between people will decrease, and there will be a lack of a basis for communication about 'watching the same thing', which doesn't conform to the basic logic of social interaction.

Intelligent Emergence: There are many companies doing well in video generation now. Do you think the competition pressure is high on the To - B side?

Song Jiaming: If we only look at the public opinion, it seems that the competition is extremely fierce. But in the US To - B market, the actual pressure is not as great as it seems.

The reasons are quite realistic: First, due to political and compliance factors, after screening, almost all the suppliers that can enter the list of serious US enterprises are US - based. This list is actually quite short, including Google, us, and a few other US startups.

Second, the US To - B market is more mature, and the acceptance of software subscriptions, API fees, and enterprise services is much higher. 'Easy to do' in the To - B business doesn't mean it's effortless, but rather that the business model is clearer.

Intelligent Emergence: After Dream Machine was launched in June this year, the commercial achievements seem to be quite good. However, Luma started with 3D video - generation business. How was the commercialization situation at that time? Where do the main differences come from?

Song Jiaming: We had commercialization attempts in the 3D field before, but I don't think they were scalable or very successful.

At that time, the 3D - generation technology was weaker than video in terms of both quality and application scenarios.

Currently, the most common application scenarios for 3D - generation models are still concentrated in fields such as games and digital humans. There aren't many game companies with in - depth technological capabilities, which means the potential customer base may be relatively small. Moreover, large companies like Tencent, which have both strong 3D capabilities and game businesses, theoretically prefer to develop their own basic capabilities and are less likely to rely on external models in the long run.