HomeArticle

The reason why Baidu doesn't do Sora has been clearly explained by Li Yanhong.

周鑫雨2024-11-12 13:03
Li Yanhong: The multimodal model has not been widely applied at present because the hallucination problem has not been solved.

Written by Zhou Xinyu

Edited by Su Jianxun

At the Baidu World Conference held on November 12, 2024, the theme of discussing "What is a valuable AI application" became the main topic.

Li Yanhong, the founder, chairman, and CEO of Baidu Group, mentioned that setting the conference theme as "The Application Has Come", representing Baidu's perception and judgment of the current era of large models and generative artificial intelligence.

△ Daily average invocation volume change of Wenxin Large Model.

Currently, the daily average invocation volume of Wenxin Large Model has exceeded 1.5 billion. Li Yanhong believes that if the invocation volume of Wenxin Large Model can increase by 10 times in a year, it means that the market demand indeed exists. He mentioned that, in fact, the invocation volume of Wenxin has increased nearly 10 times in half a year.

At the conference, Li Yanhong mentioned several consensuses:

First, Retrieval-Augmented Generation (RAG) has become an industry consensus as eliminating "hallucinations" is necessary for the implementation of the model industry. Li Yanhong believes that the biggest change in large models in the past 24 months is that "hallucinations" have basically been eliminated.

Second, Intelligent agents are the most mainstream form of AI applications and are the new carriers of content, information, and services in the AI-native era.

"Intelligent agents" are undoubtedly the terms that appear most frequently at the World Conference. Li Yanhong compares intelligent agents to websites in the PC era and self-media in the mobile era. The difference is that intelligent agents are more like humans and more intelligent.

He mentioned four application directions of intelligent agents: company type (such as sales customer service), role type (such as digital human live streaming), tool type (such as intelligent generation of industry reports), and industry type.

The commercial value of technology is also a theme that Li Yanhong repeatedly mentioned in his speech.

For example, he believes that the commercial value of iRAG lies in being hallucination-free, ultra-realistic, cost-free, and immediately available.

△ Li Yanhong's speech

Specifically, regarding the release of the 0-code development tool "Miaoda", Li Yanhong believes that the product value lies in achieving the unlimited expansion of productivity. In his words, this is "an unprecedented era where one can make money just by having ideas".

At the industrial implementation level, Li Yanhong mentioned that the value increment that large models bring to the industry is reflected in two aspects: cost reduction and efficiency improvement.

Currently, Baidu Intelligent Cloud Qianfan Large Model Platform has fine-tuned 33,000 models and developed 770,000 enterprise applications. More than half of the central and state-owned enterprises are users of Qianfan.

The integration of Baidu Library and Baidu Netdisk

In the organizational structure adjustment in September 2024, Baidu Netdisk returned to MEG and was assigned to the Baidu Library BU - this also laid the groundwork for the ecological integration of the two content tool applications.

In the view of Wang Ying, Vice President of Baidu and the person in charge of Baidu Library and Baidu Netdisk, the users of Baidu Library and Baidu Netdisk in the past have the following two pain points:

On the one hand, materials of different forms, categories, and formats cannot be edited and operated on the same platform, and no content of any form or format can be generated;

On the other hand, the public domain knowledge in Baidu Library and the private domain knowledge in Baidu Netdisk are stored separately and cannot be coordinated to form a complete knowledge.

The "Free Canvas" function launched by Baidu Library has become a bridge to connect the content of Baidu Library and Baidu Netdisk. In Li Yanhong's view, the Free Canvas is essentially a tool-type intelligent agent.

Just like an intelligent whiteboard, users can freely select and combine the content that needs to be operated on Baidu Library and Baidu Netdisk through point selection, conversation, and box selection.

Based on the underlying Mixture of Experts (MoE) architecture and multi-modal model, the Free Canvas can support cross-modal processing of files such as text, images, and videos, and finally can generate cross-modal content such as graphics and text.

And these multi-modal contents generated by the Free Canvas are adapted to the content ecology of WeChat Moments, Xiaohongshu's image + text, and video + text, and can also generate professional field contents such as research reports with charts.

△ The Free Canvas generates a novel, comic, and video of Sun Wukong's modern adventure according to the requirements.

In the current situation where AI tool-type products are struggling to find a monetization model, Wang Ying believes that the business model of Baidu Library and Baidu Netdisk is inherently very compatible with large model products.

She told "Intelligent Emergence" that the charging model of Baidu Library and Baidu Netdisk is essentially profit-sharing with users. The products increase the retention rate and payment rate of users by bringing value to them and helping them make money.

"The AI capabilities can expand the functional boundaries of products, resulting in more combined products, bringing more benefits to users, and also increasing the payment conversion rate." Wang Ying told "Intelligent Emergence".

Before developing Sora, solve the "hallucination" problem first

Even though Li Yanhong mentioned that the combination of text and RAG (Retrieval-Augmented Generation) technology has achieved some results at present, he also pointed out that the combination of image and RAG technology is far from sufficient.

"The multi-modal model has not been widely applied at present because the hallucination problem has not been solved." Li Yanhong pointed out in his speech.

This perception also determines Baidu's attitude towards Sora. Li Yanhong mentioned that when Sora appeared, Baidu's decision was not to follow up, but to start solving the hallucination problem of the multi-modal model.

At the conference, Baidu released iRAG, an image-to-text technology based on Retrieval-Augmented Generation. In Li Yanhong's words, iRAG can remove the "machine flavor" of the generated images.

△ Images generated based on iRAG.

Wang Haifeng, Baidu's CTO, introduced the technical link of iRAG to achieve controllable image generation at the conference:

First, the large model analyzes and understands the user's needs, and automatically plans an accurate or generalized solution, such as which entities to enhance;

Next, in the enhancement stage, the entities that need to be enhanced are retrieved, and the corresponding references are selected;

Finally, in the generation stage, Baidu has independently developed a multi-modal controllable image generation technology. On the one hand, through local attention calculation, the large model can achieve highly generalized image generation while maintaining the unchanged characteristics of the entities; on the other hand, through overall attention calculation, highly accurate image generation can be achieved.

Xiaodu has made a pair of AI glasses

Xiaodu, which replaced its "brain" with a large model in 2023, this time launched not a speaker, but Baidu's first pair of glasses: Xiaodu AI Glasses.

△ Xiaodu AI Glasses.

At the hardware level, these glasses weigh only 45g, which is lower than the industry average weight of 49g. To improve the imaging effect, the glasses are equipped with a 16-megapixel ultra-wide-angle lens and an AI anti-shake algorithm; to improve the accuracy of sound recognition and reduce sound leakage, the glasses adopt a four-microphone array and an open anti-sound leakage speaker design.

In terms of battery life, Xiaodu AI Glasses can be fully charged in 30 minutes, achieving 56 hours of standby and over 5 hours of continuous listening. These three indicators all exceed the industry benchmark level.

What distinguishes Xiaodu AI Glasses from ordinary glasses is the "AI".

Based on the Wenxin Large Model and the DuerOS AI-native operating system, Xiaodu AI can achieve functions such as first-person perspective shooting, asking while walking, object recognition encyclopedia, audio-visual translation, intelligent memo, and song playlist.

According to Li Ying, Vice President of Baidu Group and CEO of Xiaodu Technology, Xiaodu AI Glasses will be launched in the first half of 2025.

A 0-code development tool is an intelligent agent team

At the conference, Baidu also officially announced "Miaoda", a 0-code application development platform that will be launched in Q1 of 2025.

Compared with other 0-code development platforms, the feature of Miaoda is that the application development process is collaborated by multiple multi-intelligent agents.

△ "Miaoda".

For example, in the process of web page production, the code writing and deployment of the web page are completed by the programmer intelligent agent, the copywriting in the web page is completed by the writing intelligent agent, the latest materials in the copywriting are queried online by the retrieval robot, and the accompanying pictures are realized by the intelligent agent that is good at generating images.

Finally, the intelligent agent responsible for quality inspection will also use the reflection ability to run the test code, find the bugs, and cooperate with the programmer intelligent agent to modify them.

Multi-intelligent agent collaboration is not only applied to "Miaoda" for novice developers, but also to Wenxin Kuaima Comate for professional programmers.

Wang Haifeng introduced that Comate has been iterated to version 3.0. In the entire development process, different intelligent agents in Comate 3.0 can achieve functions such as automatic code quality inspection and code completion, with the aim of improving the work quality and efficiency of programmers and allowing them to devote more energy to exploration and innovation.

Welcome to communicate!