What industry trends does the shift of AI hotspots to multimodality reflect?
Image source: Jiemian News
The hotspots of AI in 2025 are shifting.
Since DeepSeek R1 sparked industry enthusiasm in the first half of the year, there have been few significant achievements in models under the "pure text + reasoning" paradigm. In the second half of the year, the focus of the conversation has clearly shifted to the multimodal field.
Sora 2 has been packaged into a deliverable application, and Google has launched Nano Banana with stronger image editing capabilities. The same is true for AI Agents. After general - purpose products like Manus with more prominent text attributes, LoveArt, which focuses on visual creation scenarios, has achieved a similar level of popularity.
Behind this, the iteration of text models has entered a stage with a relatively high baseline and incremental improvements, while the usability of multimodal understanding and generation capabilities has taken another step towards the "breaking - through - the - circle" level.
A researcher engaged in model training told a Jiemian News reporter that to understand this phenomenon, one must first recognize that research in the text and multimodal directions is parallel rather than sequential.
After major milestones such as GPT - 3, GPT - 4, and OpenAI o1, the language understanding ability of large models is sufficient for C - end (user) applications. Subsequent optimizations are concentrated on stable - state engineering, such as alignment, cost reduction, latency optimization, and robustness. These can further enhance the C - end application experience and B - end (enterprise) commercial value, but users no longer feel the strong impact as they did when GPT - 4 emerged.
A typical example is DeepSeek - OCR. It is a demo that is not topic - worthy enough to be shocking, but it has long - term influence.
DeepSeek - OCR was launched on October 20th, positioning itself to explore the visual compression ability of text (Contexts Optical Compression). Simply put, as the input of context increases, the computational volume of the model increases exponentially. However, by converting long text into image recognition, the number of token calculations can be significantly compressed. The verified result of this idea is that once it is implemented in applications, it is also a quite promising way to reduce costs and increase efficiency.
The situation on the multimodal side is completely different. Its capability curve is still in an area that can be perceived by more people. However, the aforementioned interviewee pointed out that from a parallel perspective, there has not been a breakthrough at the architectural level for multimodal models. It is more about sufficient data accumulation and improved training techniques.
Image source: Jiemian News
As he judged Sora 2 and Nano Banana, apart from OpenAI's initial concept of multimodal generation products taking shape and Google's grasp of the current user needs for image editors (such as targeted modification by anchoring a point), the two products have not achieved a leap in generation quality.
Moreover, to a large extent, in the multimodal generation field represented by "text - to - image and text - to - video", the performance improvement is based on the improvement of text model performance. Jiang Daxin, the founder and CEO of Jieyuexingchen, previously told a Jiemian News reporter that the relationship between understanding and generation is that understanding controls generation, and generation supervises understanding.
The primary market is also witnessing this shift in focus. An AI investor told a Jiemian News reporter that his perception is that the number of overall investment events in the industry has increased this year, but the investment scale has decreased. This is determined by the market scale and valuation of the application layer after the investment focus has shifted from the model layer to the application layer.
Among them, the most prominent investment this year is LiblibAI in the visual creation field of the application layer. On October 23rd, LiblibAI announced the completion of a $130 million Series B financing, with Sequoia China, CMC Capital, etc. participating. This has become the largest financing in the domestic AI application track in the capital market this year. This means that compared with other tracks, the team's PMF (product - market - fit) is more recognized by capital.
In the long run, the "hotspots" that the industry can expect may mostly come from the multimodal field.
Jiang Daxin has always emphasized that intelligence based solely on language is not enough, and multimodality is the inevitable path for large models. In this field, the unification of understanding and generation remains the breakthrough point at this stage.
Many interviewees have told Jiemian News reporters that from the perspective of model training, the visual modality faces greater challenges than the text modality. Just looking at the data, the representation of text can be self - closed in semantics, but the representation of visual information needs to be aligned with text first, and there is no naturally self - closed data. "It may take several major technological changes like ChatGPT and the reinforcement learning paradigm to solve this problem," an interviewee said.
One view holds that based on better multimodal models, world models, embodied intelligence, spatial intelligence, etc. can develop significantly, and the industry can get closer to AGI (Artificial General Intelligence).
A more practical consideration is that the model determines the upper limit of application capabilities. While text models are focusing on cost - reduction, efficiency - improvement, and slow performance enhancement, breakthroughs in multimodal models are expected to bring more PMF opportunities to the market. This will be a key and more valuable change in the eyes of entrepreneurs and investors.
This article is from Jiemian News, Reporter: Wu Yangyu, Editor: Wen Shuqi. Republished by 36Kr with permission.