Exclusive Interview with the Chief Scientist of Luma AI: The Rules of the Game for Video Generation Models Have Changed
Text by | Fu Chong, Zhou Xinyu
Edited by | Su Jianxun
"If we only focus on iterating video generation itself by 2026, it won't be enough." Song Jiaming, the Chief Scientist of Luma AI, made this prediction to Intelligent Emergence.
Founded in 2021, Luma AI is a star startup in the field of video generation in the United States. Intelligent Emergence learned that recently, Luma AI completed a $900 million Series C financing at a valuation of $4 billion. This round was led by HUMAIN, an institution under the Saudi Public Investment Fund (PIF), and old shareholders such as AMD Ventures, Andreessen Horowitz, Amplify Partners, and Matrix Partners all significantly increased their investments.
While most video - generation AI companies are still competing in terms of longer durations and better image quality, Song Jiaming specifically explained his "different view": What really needs to be improved in the next stage is not the picture itself, but the model's ability to understand and reason about the real world.
He used a scenario on a film - making set as an example: In film production, if the director needs to reshoot a missed overhead shot, the traditional video - generation model only generates relevant content based on the prompt, but it is prone to having details that do not match the previous and subsequent pictures.
However, the reasoning model can understand the scene space, character positions, and camera logic of the existing clips, so as to produce a video that is more physically reasonable and has smoother transitions.
For this reason, the reasoning video - generation model can be applied in professional film and advertising, which forms the basis for its monetization.
"The popular trend of meme - making by the public triggered by Sora 2 does not mean that the To C era of video models has arrived. After the novelty wears off, ordinary users are unlikely to continue paying," Song Jiaming explained the current commercial situation of video - generation models.
The key to achieving stronger reasoning ability in video - production models is to train a 'unified multimodal' model using language, image, and video data. Because multimodal fusion will provide the model with a richer and more diverse amount of data, which will upgrade the model's ability from 'generation' to 'understanding'.
This path has been verified in the field of image generation this year: In 2024, there were still disagreements in the industry about the multimodal architecture, but after entering 2025, image - generation models have basically integrated tasks such as text - to - image and image editing into a unified model. The focus of competition has also shifted from architecture design to high - quality data collection.
He believes that the video - generation model will also experience the same convergence process next year.
Continuously predicting the next technological and commercial directions and breaking through itself is what Luma AI has always been doing.
This company, founded in 2021, initially started with 3D generation and shifted to the video - generation model with a larger market space at the end of 2023.
In June 2024, Luma AI launched the video - generation model Dream Machine for AI and design beginners, starting to explore the C - end market. Dream Machine attracted one million users in four days with "zero promotion fees". With its movie - level camera work and generation effects, it is known in the industry as "a video - generation model that can compete with Sora".
However, Luma did not stay in the C - end popularity. Since this year, Luma AI has gradually shifted its focus to B - end professional users with stronger willingness to pay and more rigid demands, such as film, advertising, and content - production institutions.
In September this year, Luma AI launched Ray 3, the world's first large - scale video - reasoning model.
But in a recent exclusive interview, Song Jiaming gave a new judgment to Intelligent Emergence: Ray 3 is likely to be Luma's last generation of traditional video - generation models. The company has established the "unified multimodal model" as the core direction for the next stage.
This goal also requires greater computing power and financial support.
HUMAIN, one of the investors in this round of Luma AI, is building a 2GW artificial - intelligence super - computing cluster called "Project Halo" in Saudi Arabia. This is one of the world's largest computing - infrastructure construction projects. Luma AI will use this computing power as a core customer to train the next - generation multimodal world model, further enhancing the capabilities of video reasoning and unified models.
Starting from 3D generation, attracting attention in the C - end with Dream Machine, and now better serving B - end professional customers through the layout of reasoning and unified multimodal models. Each key decision of Luma AI has been an expansion based on its original business.
Regarding the current industry observation and future prediction of video - generation models, Song Jiaming detailed his views in the interview. The following content is from the conversation, organized by the author:
△ Song Jiaming, Photo: Provided by the interviewer
The Future of Video - Generation Models: Reasoning Ability and Unified Multimodality
Intelligent Emergence: You once said that 'Ray 3 might be Luma AI's last generation of traditional text - to - video models'. How do you understand this statement?
Song Jiaming: My judgment is that in the future, large models will no longer treat images, videos, audios, and texts as isolated modalities, but will process them in a unified framework. This is what we call the 'unified multimodal' model.
The increase in data volume brought by unified multimodality will give video - generation models better reasoning ability, which helps the model make more reasonable video processing and helps users automatically identify problems in the video.
The reason why language models are useful is that they have strong abilities such as context learning and Zero - shot (zero - sample learning), as well as strong reasoning ability. I think these abilities will eventually appear in the visual and video modalities, rather than just competing in longer durations and better - looking image quality.
Intelligent Emergence: Can you use a specific example to explain the difference between a video - reasoning model and a traditional video model?
Song Jiaming: Take an example from film shooting. In actual shooting, the crew will set up several cameras simultaneously to shoot different angles of multiple actors. Suppose after work, the director suddenly realizes that he forgot to shoot an overhead overview shot and needs the AI to'reshoot' one.
At this time, if you only use a traditional video - generation model, it will probably 'use its imagination' to generate an overhead shot that looks okay, but upon closer inspection, you will find that the positions of the characters and the layout of the background objects may not match those in the previous shots.
In the video - reasoning task we defined, the model needs to first 'understand and reason' rather than 'generate': it needs to find the corresponding relationship of the same background object in different perspectives from the materials of different cameras, infer the positions of each actor and each prop in a unified three - dimensional space, and then generate a video that is physically reasonable, has natural camera movement, and seamlessly connects with the previous shots from a brand - new overhead perspective.
Intelligent Emergence: Many video - generation model companies have achieved good results this year, and their technical paths are diverse. But you seem to think that 'this is the last year for the diverse development of video models', and next year, video generation will converge to a unified model. Why?
Song Jiaming: If we look at the historical rules of image - generation models, last year, people may not have been sure whether to develop a unified image model. Or rather, last year, people tended to develop different tool - flows for different tasks and then make corresponding adjustments or model fine - tuning according to different tasks. But this year, the trend is to integrate all tasks into the same multimodal model.
Now, few people say they will develop an architecture completely different from GPT 4o or Nano Banana. When the architecture is unified, the core of competition shifts from model design to data - driven, and the real focus is on whether enough high - quality data can be collected.
I think what happened to images this year will also happen in the video field next year.
Intelligent Emergence: In the technical path of the unified model, what role is Ray 3 playing for Luma?
Song Jiaming: Ray 3 is a phased achievement.
The more important accumulation here is the infrastructure, whether it is the training infrastructure, the reasoning infrastructure, or the basic data infrastructure. In fact, it may be more important than the algorithm accumulation itself.
Because in the past few years of algorithm development, there have not been many core changes. Basically, the autoregressive route (GPT 3) and the diffusion - model route (DDPM) from five years ago are still being used, with only some minor changes in the past five years. So I think the most significant progress during this period actually comes from Scaling, that is, expanding the scale of the model and data.
Intelligent Emergence: What is the relationship between the directions of unified multimodality and video - reasoning models and the AGI in your mind?
Song Jiaming: I have a relatively strict standard for AGI.
Now many people say that 'certain code models have surpassed most programmers'. I agree that in this regard, they can be called'super - human', but if that's the only criterion, then calculators have long surpassed human mental arithmetic, and we don't call calculators AGI. For me, if humans can do a certain task while AI can't do it at all, then it can't be called AGI.
Currently, in many aspects, AI still has a long way to go compared to humans, such as autonomous driving, robotics, embodied intelligence, and long - term planning and execution in the real physical world.
The significance of the unified multimodal video model for AGI lies in ultimately expanding the ability to understand and operate the real world from the pure language space to the dimensions of vision, action, and time.
△ Luma AI's model can generate high - definition and imaginative HDR video clips just based on prompts. Photo: Provided by the interviewer
The To C Era Hasn't Arrived Yet
Intelligent Emergence: From a product perspective, what inspiration did the popularity of Sora 2 and Nano Banana bring to model companies?
Song Jiaming: I think an important inspiration is to design some usage scenarios from a product perspective and find the driving points for users to use, so that the technical features can become popular dissemination points.
Intelligent Emergence: When Luma AI's Dream Machine was launched, we talked in an interview that it could largely serve C - end users with little design and AI experience. But later, the company gradually shifted its focus to B - end professional users. Why?
Song Jiaming: I prefer to see it as a gradual process rather than a sudden turn.
We can first make an analogy with language models: The C - end popularity of chatbots was very high last year, but this year, people are talking more about clear To B and To Pro scenarios such as code writing and intelligent agents.
For ordinary users, there isn't much difference between different chatbots, and they aren't willing to pay a high subscription fee. But for programmers, if a tool can double their output, the company is willing to pay for this tool on their behalf.
The same logic applies to video models. C - end users easily get bored with video generation and may not have a stable willingness to pay. In contrast, B - end customers, such as film companies, advertising companies, and content producers, will have a much higher willingness to pay and stickiness once they find that an AI can save a large amount of manpower, time, and hardware investment in their main processes.
Intelligent Emergence: Before, OpenAI's Sora 2 was widely used for meme - making on social platforms. Do you think this means that video - generation models are starting to move towards the C - end?
Song Jiaming: I think OpenAI's To C strategy and the To C development of video - generation models are not the same concept. OpenAI pursues the To C market mainly because its valuation has reached $500 billion. If it focuses on To B, it seems that there isn't a B - end market large enough to accommodate it.
OpenAI itself is a business and needs to find greater growth points. This is the same principle as Meta and ByteDance. That is, when the scale reaches a certain level, the enterprise will definitely make efforts in the To C market to explore maximum scale. But this doesn't mean that the entire video - generation model field should or can move towards the To C market.
Olivia Moore, a partner of the well - known US investment institution A16z, once posted a set of data on her social media, showing that the 30 - day retention rate of Sora 2 was only 1%, and the 60 - day retention rate was less than 1%. In contrast, the retention rate of TikTok videos can be maintained at around 30%. This also shows indirectly that the meme - making effect of Sora 2 doesn't mean that video - generation models have successfully penetrated the C - end market.
Intelligent Emergence: What real difficulties does the video - generation model face in moving towards the C - end?
Song Jiaming: From a pure technical perspective, there is already a lot of AI - generated video content on short - video platforms, so it is technically achievable to target the C - end. The difficulty lies in figuring out whether the business model can work.
From a business - model perspective, I haven't seen the value of video - generation To C applications as social products.
Today's Douyin, YouTube, and Instagram are essentially'social + distribution' platforms. Most people watch the 1% most popular videos, and public topics are formed around these contents. If in the future, everyone watches videos customized by AI for themselves 100% of the time, the resonance between people will decrease, and there will be a lack of a communication basis of 'watching the same thing', which doesn't conform to the basic logic of social interaction.
Intelligent Emergence: There are many companies doing well in video generation now. Do you think the competition pressure is high on the To B side?
Song Jiaming: If you only look at the public opinion field, you will think that the competition is extremely fierce. But in the US To B market, the actual pressure is not as great as it seems.
The reasons are quite realistic: First, due to political and compliance factors, after screening, almost all the suppliers that can enter the list of serious US enterprises are US - based. This list is actually very short, including Google, us, and a few US startups.
Second, the US To B market is more mature, and the acceptance of software subscriptions, API fees, and enterprise services is much higher. The To B business is 'easier' not because it is effortless, but because the business model is clearer.
Intelligent Emergence: After Dream Machine was launched in June this year, the commercial achievements seem to be quite good. However, Luma started with 3D video - generation business. How was the commercialization at that time? Where do the main differences come from?
Song Jiaming: We previously tried to commercialize 3D technology, but I don't think it was scalable or very successful.
At that time, the 3D - generation technology was weaker than video technology in terms of both quality and application scenarios.
Currently, the most common application scenarios for 3D - generation models are still concentrated in fields such as games and digital humans. There aren't many game companies with in - depth technological capabilities, which means the potential customer base may be relatively small. Moreover, large companies like Tencent, which have both strong 3D capabilities and game businesses, theoretically prefer to develop basic capabilities on their own and are less likely to rely on external models in the long term.
Technically, there is far less 3D data than video data, and the AR/VR ecosystem as a whole is not yet mature enough to the