HomeArticle

The Year of Application Explosion: Discussing the Evolution and Commercialization of Model Technology

硅谷1012026-02-04 11:17
The logic, cost, and hidden challenges of "pioneering" in the commercial frontline of large models

Two years ago, when "Silicon Valley 101" talked about large models on the podcast, the general feeling was still "interesting but not very useful" - hallucinations, slowness, and high cost. It always seemed that there were still a few steps to go before they could become real productivity tools.

In 2026, the changes came faster than expected. Xu Dong, the general manager of Alibaba Cloud's Qianwen large model business, told "Silicon Valley 101" that a small team of five or six people can now generate 6,000 advertising videos a day using AI, with the cost reduced to less than 10 yuan, lower than the market price of 20 to 50 yuan - the business loop has been successfully established. The AI comic drama is also booming. The scale of the domestic short drama market has exceeded that of the movie market. The video generation model is evolving from 5 seconds to 15 seconds and is expected to exceed 1 minute by the end of the year.

The changes on the cost side are even more drastic. Xu Dong shared a set of figures: The inference cost of Qianwen is decreasing at a rate of nearly 10 times every six months. The inference speed has soared from 30 - 50 TPS to 80 - 100+, and the first - packet delay has been reduced from 2 seconds to 500 milliseconds. He said, Today, the capabilities of the small 4B models on the edge side have exceeded those of the largest closed - source models two years ago, and more than 70% of general tasks can be processed locally on mobile phones and in - vehicle systems.

2025 was called the "Year of AI Applications" by many. If the keyword in the previous two years was "what the models can do", then this year all enterprises are asking the same question - is it really cost - effective to use AI?

In this episode of the podcast, "Silicon Valley 101" invited Xu Dong, the general manager of Alibaba Cloud's Qianwen large model business, Professor Qi Lu, the director of the Insta360 Research Institute, and Lv Yingjie, the co - founder and CEO of Yuyi Technology, to dissect the logic, cost, and hidden challenges of large models in the commercial front - line "pioneering".

The following is a selection of the content of this conversation:

01 Technological Progress and Commercialization of Models

Hong Jun: I'm very glad you could come to our podcast. From the end of 2025 to the beginning of 2026, two topics have been widely discussed: AI Agent and AI applications, which have really started to enter people's lives. I'm very happy to invite you to talk about some trends in the commercialization direction of large models. Before that, could you briefly introduce your work at Alibaba and what you are mainly responsible for?

Xu Dong: Alibaba has been working on large models for a long time. The earliest model can be traced back to around 2021. First, there is a model that is particularly well - known overseas called Qianwen, with the English name Qwen. In the open - source field, it is already a very large - scale model, and many North American companies also use it as a base model. It represents our language model, and we use it to challenge AI's intelligence to see if it can use more tools and enter more production processes. This is one of our most important models.

The second is a visual - generation model called Wanxiang, with the English name Wan. It can generate pictures and videos and can also edit pictures and videos. It also has the potential to become the paradigm of the future world model. In the past three months, we have mainly released the preview version 2.5 and version 2.6, and received a lot of new feedback on video creation.

The third basic model is Fun, a pure audio model released at the Yunqi Conference last year. It covers ASR, TTS, and also includes voice cloning. We strive for it to be more realistic, support multiple languages, different dialects and accents, so that it can better understand and express.

Our models are quite structured. Based on these base models, we have started to communicate and cooperate with many customers, including manufacturing enterprises, physical enterprises, brand owners, and many Internet companies. But in the past year, I've noticed that a large number of AI Native companies have emerged. They have achieved good results through models and achieved very good ROI in many fragmented markets.

Hong Jun: You just mentioned several large models. In your opinion, in which directions did the most important evolutions of model technology occur in 2025?

Xu Dong: There have been many evolutions in the past six months. Taking the video generation model represented by "Wanxiang Wan" as an example, if we compare it to the state from GPT - 3.5 to 4, I think it has reached the level of GPT - 4. In the past, video generation models were mainly used for special effects and entertainment, but now they can enter the production field. For example, the recently popular AI comic dramas are growing at a very fast pace, and there is also the automated generation of advertising videos. Many 15 - second sliced advertisements have started to be generated automatically, forming a complete pipeline. Maybe five people can generate 6,000 videos a day. This is a very obvious trend.

From a technical perspective, there are several interesting features:

First, the generation time has become longer. Video generation has entered the 15 - second era from the previous 5 - second and 10 - second periods, and may reach one minute in the future, so that the content coherence will be better.

Second, the shot language has become more diverse. The model can switch different shots and adjust the lighting effects, approaching professional film - level capabilities. Users can achieve this through simple prompts.

Third, the ability to maintain character consistency (Carry). Inspired by Sora2, in role - playing, it can maintain the consistency of characters, objects, backgrounds, and voices, that is, "preserve the ID", providing better extension space for subsequent creation.

To put it simply, we hope to further extend the generation time from 15 seconds. Currently, we have achieved the longest video generation model in China; we hope to lower the threshold of shot switching and lighting changes, which originally required the cooperation of professional directors, photographers, and artists, through the model; finally, the ability to maintain consistency in role - playing, I believe, will become the standard for all future video generation models.

AI anime generated by the Wanxiang model. Image source: Wanxiang Wan

Hong Jun: What exactly does role - playing refer to?

Xu Dong: For example, you can take a 5 - second video of yourself with your phone, look up or turn your head, and say a few words, similar to an audition. After the model inputs this video, it can "preserve the ID" of the person's image and voice, and this image and voice can be replicated in subsequent creations.

Hong Jun: That means in the future, only a 5 - second real - person appearance is needed, and the subsequent content can be generated by AI and operated in a model - based way.

Xu Dong: Yes, this means that the controllability of the generated content is higher. In the past, it may have relied on random generation, but now more reference dimensions can be provided at the input end. In the field of anime creation, this is quite common. In the past, reference pictures were used, and now reference videos are starting to be used.

Hong Jun: You just mentioned that, for example, five people can generate 6,000 videos a day, and AI comic dramas are quite popular. Based on the improvement of model capabilities, what good commercial cases or applications have you seen?

Xu Dong: The domestic short drama market has exceeded the movie market. In terms of video promotion, it is obvious that short dramas have started to shift a certain proportion from the original real - person shooting or a large amount of manual editing to AI generation. The most popular recently is the comic drama, which has evolved from dynamic comics. It has a coherent plot and strong commercialization ability, and has become a typical example of combining with AI.

Hong Jun: In the short drama market, AI - generated short dramas, real - person IP short dramas, and mass - generated AI advertisements, the most concerned question is: what are the costs of using AI and human labor respectively? What do manufacturers value when considering accessing models?

Xu Dong: Now they are classified into S - level, A - level, and B - level according to quality. For a short drama, if the cost is reasonable, AI may be able to keep it below 20,000 yuan. Considering the promotion and ROI calculation, it is possible to break even or achieve good revenue. If higher - quality products are required, more post - production resources need to be invested, and the cost will be higher, but the quality of the drama will also be better.

In terms of advertising, the AI cost of a 15 - second video can be controlled below 10 to 15 yuan, which has good commercial space in the market. Generally, the market price of a qualified 15 - second advertisement is between 25 and 50 yuan, thus forming a good business cycle.

AI short video generated by the Wanxiang model. Image source: Wanxiang Wan

Hong Jun: The cost is really low. Alibaba has the Taobao e - commerce ecosystem. Are these e - commerce sellers using AI to create advertisements?

Xu Dong: The structure is quite complex. Now each traffic platform gives advertisers or agencies a certain amount of editing ability to match its platform. Advertisers also have a large number of materials, which may be created by themselves or by third - party agencies. In terms of advertisement generation, it may be done by agencies or sub - contracted by agencies to AI Native startups. There are more and more such startups. They combine the Wanxiang and Qianwen models into a pipeline, achieving the ability to generate 6,000 advertisements a day with five or six people as mentioned before, and then hand them over to the advertising teams of agencies or traffic platforms.

Hong Jun: It's quite interesting. This is about the video generation model. You also have an audio generation model and the Qianwen model. What progress and breakthroughs did the other two models make in 2025?

Xu Dong: The language model is undergoing continuous and profound changes. Although it's difficult to see a huge paradigm shift, I'll briefly talk about what we're doing.

First, high - quality datasets are becoming fewer, and everyone is working on them more carefully. By adjusting the data order and perspective, the efficiency of the model's knowledge learning is improved, and the model performs better in some corner cases.

Second, the model structure will become more sparse. Technologies such as Multi - Token Prediction (MTP) are being implemented in different models, and the speed will be faster, even doubled. In the future, it's possible that the first - packet response time may be shortened from 2 seconds to 500 milliseconds, and the TPS may be increased from 30 - 50 to 80 - 100 or more, which will perform very well in scenarios with high - performance requirements.

Actually, you can also understand it from the perspective of machine throughput. After the model becomes more sparse, the inference cost will also decrease, perhaps by an order of magnitude.

In addition, the instruction - following ability, Agent ability (especially tool - calling), and context length are all continuously evolving. With the improvement of coding ability, there may be a large number of continuously running Agents in the future. Unlike today's Chatbots, which give immediate results, they can use idle computing resources for AI for science research or generate in - depth reports. Behind this, they may call retrieval engines, CRM, ERP, and other tools. If they can use so many tools, we believe the output quality will definitely be better than that of pure text models.

Hong Jun: You've talked about many improvements in the details of basic models. When we tracked the development trends of large models and Agents last year, we noticed that 2025 was a key year for AI to move from models to applications, with a large - scale explosion of Agents, especially active application innovation in China. Why was it this time point last year? Was it because of the improvement of the basic capabilities of models or other key factors?

Xu Dong: First, it was the inference ability. After OpenAI launched o1 at the end of 2024, models no longer relied on so - called probabilities and began to show logical preferences, which I think is a very fundamental change.

Second, the scale of models became larger. The original challenge was that you couldn't train them, but through pre - training improvements, the controllability increased while the model size grew, and the ability to handle complex instructions became stronger. In the past, it relied on a deterministic workflow, but now the model's instruction - following and understanding abilities have been enhanced. As long as there is accurate context, it can find a balance between generalization and accuracy.

Third, it was tool - calling (Tool Use). As standards such as Claude Skills and MCP have been gradually accepted, more and more tools have become explicit. Today's models have begun to break out of the input - output window and enter more production processes. This process may not be a single box but a standard SaaS process or a hardware interaction method.

Hong Jun: I noticed a trend at the recently concluded CES. All products want to be related to AI, such as headphones, smart glasses, and video - editing and shooting tools. Many Chinese customers participated in the exhibition this year, and there was also the shadow of the Qianwen large model behind them. Could you talk about the role of large models in AI hardware products?

Xu Dong: The combination of models and hardware is not new. As early as ten years ago, ASR and CV models were related to hardware, but the commercial value was not significant. This time, I think the most important thing is that the models have become more human - like and can perform more tasks. In the past, they could only recognize, but now they can understand and give the results you want. Today, through the Qianwen App, you can directly order coffee, book a seat, etc. through natural language. If it's glasses, being able to complete these tasks through natural language is a big difference from the past. This is inseparable from voice, visual understanding, and text models.

Hong Jun: Has ordering coffee through glasses been realized?

Xu Dong: To be precise, it was realized a long time ago. The Qianwen App can order coffee through natural language and complete the commercial loop, all within an architecture based on a large model. I just tried it. It will generate cards. If the operation interface of the large model is the same as the original App, there may be challenges. When recommending coffee, it will consider distance, preferences, and historical choices because when wearing glasses, you hope that AI understands you and has memory, which is more convenient. You can switch, and when you click, you'll see the complete menu.