36氪_让一部分人先看到未来

On the path of benchmarking against OpenAI, Zhipu AI has taken another step forward.

Text | Tian Zhe

Editor | Su Jianxun

On the road to benchmarking against OpenAI, Zhipu AI has made another step forward.

At the beginning of this year, OpenAI was exposed to be developing its own AI Agent software, which can replace humans, automatically navigate to any website and perform designated tasks.

On October 25, Zhipu AI launched a similar product - the Autonomous Agent AutoGLM. Zhipu claims that it is a mobile phone operation assistant that can simulate a user clicking on the screen, as well as a browser assistant for clicking on web pages.

The demonstration video of Zhipu shows that AutoGLM can realize functions such as online food ordering on web pages, organizing Xiaohongshu notes and generating guides, and summarizing papers.

At the same time, Zhipu AI also launched the End-to-End Emotional Speech Model GLM-4-Voice.

This emotional speech model can not only simulate real emotional expressions, but even the slightest pauses and breaths are vividly presented.

This technological breakthrough is similar to some plots of the sci-fi romance movie "Her". In the movie, the male protagonist Theodore accidentally met the AI assistant Samantha, who makes people feel close and warm through emotional voice interaction.

Now, the emotional speech model of Zhipu AI seems to be bringing the sci-fi scenes in "Her" into our daily lives. Artificial intelligence is no longer just a cold tool, but begins to have "emotions" and "personalities".

"Her" has taken shape, and an AI with self-awareness and emotions may appear soon.

Autonomous Agent Launched: Can Order Takeout and Give Compliments

Similar to OpenAI's AI Agent, the Zhipu Qingyan AutoGLM model does not require users to manually demonstrate operations. It is not limited to simple task scenarios or API calls and can replace humans to perform operations on electronic devices.

Currently, Zhipu AutoGLM can be adapted to 8 well-known application software such as WeChat, Taobao, Meituan, and Xiaohongshu, covering functions commonly used in daily life such as online chatting, online shopping, social networking, mapping, hotel and train ticket booking.

Specifically, after you give an instruction to AutoGLM, the conversation between the two sides will be presented in the form of voice and real-time subtitles will be displayed.

For example, you can ask AutoGLM to give a positive review to a designated store on Dianping and automatically edit the review. When performing an operation beyond the instruction, such as "sending the review", AutoGLM will actively prompt whether to proceed.

&nbsp;

You can also ask AutoGLM to find the historical orders of a certain period on Taobao and repurchase the designated goods.

&nbsp;

Even tasks with longer steps, such as liking and commenting on the Moments of a designated WeChat friend, can be completed.

&nbsp;

In addition to functions such as online shopping and editing comments, AutoGLM can also batch summarize multiple articles on WeChat official accounts and generate article summaries.

&nbsp;

The Zhipu official account introduces that AutoGLM is based on the self-evolving online course reinforcement learning framework WEBRL, which overcomes the research and application difficulties of web agents such as scarce training tasks, scarce feedback signals, and policy distribution drift. Coupled with the adaptive learning strategy, it can continuously improve and steadily improve its own performance in the iterative process. This means that AutoGLM has a certain self-correction ability.

Source: Zhipu Official Account

It is reported that in order to protect user privacy, AutoGLM will not actively obtain users' personal privacy information. If it needs to perform tasks outside the authorized scope, AutoGLM will actively prompt to obtain user consent.

Even if the user authorizes AutoGLM, it does not mean that AutoGLM permanently has the relevant permissions. Every time AutoGLM is started in the background, it will reapply to the user for accessibility permissions.

Currently, AutoGLM has been launched on the computer and is available for use by installing the Qingyan plugin. The mobile version of AutoGLM is currently in internal testing for Android phones.

Emotional Speech Model , With Emotions and Pauses

Two months ago, Zhipu Qingyan first showed off its voice call technology. Although the voice call function could understand the conversation and the responses were relatively accurate at that time, the tone sounded like a robot "reading a script" without much emotion. If you asked it to show some emotional expression, it would seriously tell you: "As an artificial intelligence, I cannot express emotions".

However, the upgraded emotional speech model now feels different. The voice sounds more "human-like" and can add some emotions, chatting with you like a real person.

It has learned to express in tones such as coquetry, teasing, anger, and hysteria. For example, the anthropomorphic voice can imitate a child and ask the elder sister for candied haws in a coquettish tone.

Imitate a child being coquettish

If you are tired of listening to Mandarin, no problem, it can also switch to the accents of Beijing, Northeast, Guangdong, Taiwan, and Chongqing. For example, the classic phrase "Bashi de ban" when introducing food directly makes people's appetite increase.

Imitate Sichuan dialect

Playing role-playing games is also not a problem. You can set it as the villain Voldemort in "Harry Potter" to fight with you and ask it to play in the designated tone. For example, it can speak in the common villain tone in TV dramas.

Imitate Voldemort

If you challenge it to speak tongue twisters at a fast speed, it may "fail" and the pronunciation may be a bit "floating".

Speak quickly

However, sometimes, GLM-4-Voice may occasionally have a short current sound when speaking.

Current sound

In addition, the pronunciation may occasionally be not standard, such as pronouncing "wei" in "wei shen me" as "wei".

Occasionally non-standard pronunciation

It is understood that GLM-4-Voice combines natural language generation (NLG) and speech synthesis technology. Compared with the traditional TTS technology (Text-to-Speech), the anthropomorphic voice can understand the context and achieve an emotional and natural conversation.

In addition, GLM-4-Voice directly models speech in the form of audio tokens, completing the understanding and generation of speech in one model. Compared with the traditional cascade scheme, there is less information loss and error accumulation, and theoretically has a higher modeling upper limit.

GLM-4-Voice Model Architecture Diagram; Source: Zhipu AI

This is not all the functions of Qingyan's anthropomorphic voice call. It benchmarks against GPT 40 and will further breakthrough in response and interruption speed, emotional perception and emotional resonance, controllable voice expression, and multilingual and multilingual dialects. Currently, it has achieved different emotions and can speak in various local dialects. It can also adjust the speaking speed and volume to achieve a conversation like a real person.

Currently, GLM-4-Voice has been launched on the Qingyan app, and users can chat with Qingyan naturally. It is worth noting that GLM-4-Voice has been open-sourced externally, which is the first open-source end-to-end multimodal model of Zhipu AI.

In the next step, it will also support video call functions. At that time, it will not only be able to recognize objects but also bring an interactive sense of "eye contact" with the tone.

According to the Zhipu official account, Zhipu released multimodalities from text, image, video to emotional speech models, and let AI learn to use various tools. The reason is that it has created a new base model - GLM-4-Plus. In terms of language and text capabilities, GLM-4-Plus is comparable to GPT-4o and the Llama3.1 with 405B parameters.

Increasing Investment in AI Phones, Zhipu Finds a Major Commercial Entry Point

"The current small models are still in the stage of finding the market. It is necessary to mohe the market and technology, improve efficiency, and find new application scenarios."

Two months ago, Zhang Peng, the CEO of Zhipu AI, mentioned this view in an interview with "Intelligent Emergence". Cooperating with mobile phone manufacturers on AI agents may be the new scenario Zhang Peng mentioned.

Finding scenarios is the top priority for model manufacturers. This means that model manufacturers can not only obtain a stable source of income, complete self-hematopoiesis, but also continuously collect data in the scenarios for product iteration.

And AI agents are one of the main forms of large model applications. They have the ability to autonomously perceive, make decisions, and take actions. AI phones and AI PCs, which are considered to be the next-generation product forms of computers and smartphones, are equipped with AI agents. According to the consulting agency IDC, it is expected that by 2027, the market share of AI phones and AI PCs in the Chinese market will exceed 50% and 80% respectively.

Zhipu is stepping up to promote the implementation of large models in AI phones. On the 22nd of this month, Zhipu reached a cooperation with Qualcomm to adapt and optimize the GLM-4V end-side visual large model for the Snapdragon 8 Supreme Edition, providing a multimodal interaction method; on the 23rd, Zhipu cooperated with Samsung phones around the GLM-4V end-side visual large model, and the two sides will create AI products.

Regarding AutoGLM, Zhipu also revealed that it is carrying out in-depth cooperation with mobile phone manufacturers such as Honor. In fact, in September this year, Zhipu and Honor have established an AI large model technology joint laboratory.

Honor also intends to further improve the performance of AI agents. In the media interview of the Honor MagicOS 9.0 launch conference on the 23rd, Zhao Ming, the CEO of Honor Terminal Co., Ltd., said that Honor is reconstructing the operating system through AI to build core underlying capabilities to achieve more intelligent services.

AI phones are in the limelight, and many mobile phone manufacturers such as Apple, Honor, vivo, and OPPO are competing to launch phones with AI functions, such as AI erasure and AI call summary.

However, the number of phones equipped with AI agents is currently relatively small. The reasons include the immaturity of the industry technology and the long-term lack of relevant standards.

However, this situation is gradually improving.

In the market, model manufacturers such as Open.ai and Zhipu have reported cooperation news with mobile phone companies, which will promote cooperation between model manufacturers, application software providers, and mobile phone manufacturers.

In the policy aspect, the China Academy of Information and Communications Technology and several domestic mobile phone manufacturers jointly released the "Research Report on Terminal Intelligentization Grading", which defines the grading of terminal intelligentization to a certain extent, which will promote the development of the domestic AI phone market.

Terminal Intelligentization Grading Definition

Several large models have already cooperated with smart phone ecosystem companies. Perhaps AI phones will become the main engine of Zhipu's commercial drive.

对于“磨合”这个词，一种可能的翻译是“run in”，您可以根据具体语境进行调整。如果您对翻译内容有其他需求或修改意见，欢迎随时提出。

This article is originally produced by「田哲」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

Zhipu AI has launched the latest "Autonomous Agent", and "Her" can finally become a reality.

Autonomous Agent Launched: Can Order Takeout and Give Compliments

Emotional Speech Model , With Emotions and Pauses

Increasing Investment in AI Phones, Zhipu Finds a Major Commercial Entry Point