Alibaba's Most Powerful Full - Modal Model Debuts: Understands 50 - Minute "Friends" TV Show and Achieves SOTA in 215 Global Evaluations

Some of its performance exceeds that of Gemini-3.1 Pro.

On March 31st, Zhidx reported that yesterday, Alibaba launched the latest generation of full-modal large model, Qwen3.5-Omni. This is a model that can natively understand text, images, audio, and audio-video inputs and can output in both text and audio modalities.

The last time Alibaba updated the Omni series model was in September last year. The Qwen3.5-Omni series launched yesterday includes three sizes: Plus, Flash, and Light, supporting 256k long context and audio input of over 10 hours, as well as audio-video input of over 400 seconds at 720P (1 FPS).

The Qianwen team stated in the technical blog that Qwen3.5-Omni-Plus achieved SOTA results in 215 audio/audio-video understanding, reasoning, and interaction tasks. This model's general audio understanding, reasoning, recognition, translation, and dialogue surpass Gemini-3.1 Pro, and its overall audio-video understanding ability reaches the level of Gemini-3.1 Pro. At the same time, its visual and text capabilities are on par with the Qwen3.5 model of the same size.

These capabilities have unlocked many interesting use cases. For example, in real-time mode, you can hold your phone, turn on the camera, and share your development ideas with Qwen3.5-Omni by showing it a sketch. It can then help you generate the corresponding code, enabling "programming by voice" and quickly outputting a prototype design.

In addition, Qwen3.5-Omni can understand 39 domestic dialects and 74 languages, and synthesize audio in 7 domestic dialects and 29 languages. It has significantly expanded its multilingual support compared to the previous generation model, Qwen3-Omni.

We tried chatting with Qwen3.5-Omni in Minnan dialect. It accurately understood the Minnan dialect, and the generated voice was quite authentic, although it still contained a few Mandarin words. From sending the voice to receiving the audio response, Qwen3.5-Omni took about 1 - 2 seconds and also used web search to provide the correct current weather information.

Currently, the Qwen3.5-Omni series models can be used through API calls on Alibaba Cloud Bailian, supporting both offline and real-time call modes. In addition, users can also experience this model on chat.qwen.ai, Hugging Face, and ModelScope.

The API call price of this model adopts a tiered billing model. In the commonly used scenario where the input is ≤ 128k, the price for audio input is $4.96 per million tokens, and the price for text/image/video input is $0.8 per million tokens. The output price of the model is $61.322 per million tokens (text + audio), and the price for text-only output is $9.6 per million tokens.

After the model was released, Zhidx immediately experienced Qwen3.5-Omni-Plus. This model demonstrated good processing capabilities in long-video understanding and multi-modal instruction following. At the same time, its low-latency real-time interaction and new voice control function improved the interaction experience.

Qwen3.5-Omni-Plus-Realtime:

https://help.aliyun.com/zh/model-studio/realtime

Qwen3.5-Omni-Plus:

https://bailian.console.aliyun.com/cn-beijing?tab=model#/model-market/detail/qwen3.5-omni-plus

ModelScope Offline Demo:

https://modelscope.cn/studios/Qwen/Qwen3.5-Omni-Offline-Demo

ModelScope Real-time Demo:

https://modelscope.cn/studios/Qwen/Qwen3.5-Omni-Online-Demo

01. Watch a 50-minute video in 1 minute

And achieve "programming by voice"

In the technical blog, the Qianwen team stated that one of the capabilities of Qwen3.5-Omni-Plus is audio-video captioning. Combining with the requirements of the prompt words, Qwen3.5-Omni-Plus can generate fine-grained descriptions at the script level, and perform automatic slicing, timestamp labeling, and detailed introduction of the relationship between characters and audio.

In the actual test, we uploaded an episode of the American TV series "Friends" of about 50 minutes to Qwen3.5-Omni-Plus and asked it to output an accurate description of the picture content according to the requirements of the system prompt words.

Qwen3.5-Omni-Plus took about 1 minute to process this episode, and the speed was quite ideal. Its description fully covered the video timeline without any jumps or omissions, meeting the core requirement of "describing according to time".

In terms of specific content, its description captured the core plot turning points, could identify important character relationships and emotional changes. The description was not a mechanical list but had a slight sense of narration, and the readability was much stronger than many AI video summaries automatically generated in network disks.

In the official example, Qwen3.5-Omni-Plus received a slice of "A Bite of China" and performed audio-video captioning on it. It can be seen that Qwen3.5-Omni-Plus can automatically divide appropriate time nodes according to the picture narration and content. The description of the content includes both the picture and the dubbing, with a clear structure and rich details.

Combined with more complex prompt words, Qwen3.5-Omni-Plus can also be used for review tasks, such as detecting whether a game live stream contains bloody violence, dangerous behavior, verbal abuse and bullying, and other inappropriate themes.

The Qianwen team also observed that the full-modal model has emerged the ability to program directly according to audio-video instructions, which they call "Audio-Visual Vibe Coding".

In the actual test, we uploaded a screen recording and asked Qianwen to quickly develop a prototype of a social media platform according to the pictures and voice instructions in it. After receiving the video, Qwen3.5-Omni-Plus quickly started programming, and the video content did not cause any obvious perceptible delay.

The generated web page effect is as follows. It basically conforms to the layout characteristics of the web version of Xiaohongshu, and the jump logic of each interface is correct. After manually inserting pictures, it should reach about 80% restoration.

In the official demo, the Qianwen team also demonstrated the ability of Qwen3.5-Omni-Plus to generate web pages based on sketches. Users only need to draw a simple interface wireframe on paper, take a photo and upload it, and orally describe the functional requirements. The model can understand the design intention and directly output the runnable front-end code.

02. Enhanced real-time interaction ability

Supports random interruption and voice cloning

In addition to the improvement of the base capabilities, the interaction ability of the Qwen3.5-Omni series models has also been enhanced.

Qwen3.5-Omni now supports semantic interruption, which means that users can interrupt the model while it is "talking", add information, provide new instructions, etc.

This interaction experience is based on Qwen3.5-Omni's ability to automatically recognize turn-talking intentions, which can avoid interruptions caused by echoing and meaningless background sounds. It is natively supported in the API.

In the official demo, it can be seen that Qwen3.5-Omni will not be interrupted by "um" and other echoing content. When the user actually asks a question, the model can stop the previous response in time and generate new content.

Qwen3.5-Omni natively supports web search and complex FunctionCall capabilities. The model can independently determine whether to use web search to answer the user's immediate questions. In the dialect dialogue example shown at the beginning of the article, the model can search for real-time weather information thanks to this ability.

End-to-end voice control and dialogue capabilities have also been integrated into Qwen3.5-Omni. The model can freely control the volume, speed, and emotion of the voice like a human being according to the instructions.

Qwen3.5-Omni supports voice cloning. Users can upload a voice sample to customize the voice. In the official demo, Qwen3.5-Omni can clone the speaker's voice and then convert it into different languages, realizing alternating interpretation.

03. Continues the Thinker-Talker division of labor architecture

Adopts the hybrid attention mechanism

How did the Qwen3.5-Omni series models achieve the above capabilities?

Qwen3.5-Omni continues the Thinker-Talker division of labor architecture of the previous generation - Thinker is responsible for understanding, and Talker is responsible for expression. But this time, both have been changed to Hybrid-Attention MoE (Hybrid Attention Mixture of Experts), which improves the model efficiency and performance.

Thinker is responsible for receiving visual and audio signals, encoding position information through TMRoPE, and outputting text. Hybrid-Attention enables it to quickly grasp the key points when processing 10-hour long audio and 1-hour video.

Talker receives the multi-modal output from Thinker and performs contextual voice generation. It also uses RVQ encoding to replace the heavy DiT operation.

To address the voice instability issues in streaming voice interaction caused by the difference in encoding efficiency between text and voice tokens, such as missed reading, misreading, or unclear pronunciation of numbers, the Qianwen team used ARIA (Adaptive Rate Interleave Alignment) technology to dynamically align text and voice units, which can improve the naturalness and robustness of voice synthesis while ensuring real-time performance.

The detailed comparison between Qwen3.5-Omni and Qwen3-Omni is as follows:

04. Conclusion: Full-modal capabilities may unlock more AI application scenarios

The full-modalization of models has become a major trend. From Qianwen's Omni series models to Google's Gemini, future models will no longer be a simple superposition of text, image, or audio capabilities, but will have a unified understanding and generation architecture, capable of naturally processing streaming audio-video inputs like humans.

With the continuous expansion of long-context processing, dialect and multilingual adaptation, and low-latency response capabilities, the full-modal capabilities of large models are expected to play a greater role in content review, intelligent customer service, and real-time translation, providing a more natural interaction experience.

This article is from the WeChat official account "Zhidx" (ID: zhidxcom), written by Chen Junda and edited by Li Shuiqing. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Alibaba's most powerful full-modal model makes its debut. In actual tests, it can understand the 50-minute "Friends" TV show, and it has achieved SOTA in 215 global evaluations.

01.

Watch a 50-minute video in 1 minute

And achieve "programming by voice"

02.

Enhanced real-time interaction ability

Supports random interruption and voice cloning

03.

Continues the Thinker-Talker division of labor architecture

Adopts the hybrid attention mechanism

04.

Conclusion: Full-modal capabilities may unlock more AI application scenarios