Alibaba dropped three open-source bombshells overnight, shattering 32 state-of-the-art (SOTA) records in open source.
On September 23rd, Zhidx reported that late at night, the Tongyi large model team of Alibaba unleashed three major moves: open-sourcing the native full-modal large model Qwen3-Omni, the speech generation model Qwen3-TTS, and updating the image editing model Qwen-Image-Edit-2509.
Qwen3-Omni can seamlessly handle multiple input forms such as text, images, audio, and video, and simultaneously generate text and natural speech outputs through real-time streaming responses. It achieved 32 open-source SOTA and 22 overall SOTA in 36 audio and audio-video benchmark tests, surpassing closed-source strong models like Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Meanwhile, its image and text performance also reached the SOTA level among models of the same size.
Qwen3-TTS supports 17 voices and 10 languages, surpassing mainstream products like SeedTTS and GPT-4o-Audio-Preview in speech stability and voice similarity evaluations.
The primary update of Qwen-Image-Edit-2509 is the support for multi-image editing, which can splice people + people, people + objects, etc. from different images.
Alibaba's open-source homepage
Alibaba open-sourced Qwen3-Omni-30B-A3B-Instruct (instruction following), Qwen3-Omni-30B-A3B-Thinking (reasoning), and the general audio captioner Qwen3-Omni-30B-A3B-Captioner.
Open-source address on Hugging Face:
https://huggingface.co/Qwen
Open-source address on GitHub:
https://github.com/QwenLM/Qwen3-Omni
01.
Supports interaction in 119 languages
Allows for arbitrary customization and modification of personas
On the international version website of Tongyi Qianwen, you can activate the video call function by simply clicking on the lower right corner of the input box. Currently, this function is still in the Beta testing phase.
During our actual testing, we found that the video interaction experience on the web version was not stable. Therefore, we switched to the international version App of Tongyi Qianwen for further experience. In the App, the video response delay of Qwen-Omni-Flash was low, almost reaching an imperceptible level, approaching the fluency of face-to-face communication between real people.
Qwen-Omni-Flash has a good knowledge reserve of the world. We tested it by identifying images of beer brands, plants, etc., and the model could give accurate answers.
The official blog mentioned that Qwen3-Omni supports interaction in 119 text languages, understanding in 19 speech languages, and generation in 10 speech languages. In terms of delay, the end-to-end audio dialogue delay of the pure model is as low as 211ms, and the video dialogue delay is as low as 507ms. It can also support 30-minute audio understanding. However, in actual use, when the model outputs foreign languages such as English and Spanish, it is still noticeable that its pronunciation has obvious Mandarin intonation characteristics and is not natural and authentic enough.
In the Cantonese interaction scenario, Qwen-Omni-Flash still occasionally mixes Mandarin words, which affects the immersion of the dialogue to some extent.
In several official demos, the interaction effects in Spanish, French, and Japanese were demonstrated.
The model can analyze the menu of an Italian restaurant and then recommend pasta to friends in French. Its response mentioned classic pasta and gave a brief introduction based on the menu description.
Qwen3-Omni can also view website content and summarize for users that this is the official website of the Picasso Museum in Barcelona, mentioning the history of five buildings and related streets.
In the Japanese communication scenario, the model can analyze the environment where the people in the video are located and what they are communicating about.
Qwen3-Omni supports arbitrary customization of system prompts and can modify the response style and persona.
In the demo, the model played the role of a kindergarten teacher in Guangdong, explaining Qwen3-Omni to children through the model's feature summary chart. It covered the four features of the model in the picture and used metaphors that children could understand more easily.
In the multi-person interaction scenario, Qwen3-Omni can also analyze the gender, tone of speech, and content of the people.
For example, in the following conversation, there is a girl speaking Sichuanese inviting friends to play, a boy speaking Mandarin who is heartbroken, and another boy whose dog was stolen. When Qwen3-Omni was asked what dialect the girl was speaking and what she said, it analyzed that it was Sichuanese, and she introduced herself, issued an invitation, and praised her hometown.
When asked to analyze which person in the video was the happiest, Qwen3-Omni thought it was Xiaowang, who spoke last, and mainly analyzed his tone and the action of giving a thumbs up.
In addition, Qwen3-Omni also supports analyzing music styles and elements, as well as reasoning about the pictures in the video. For example, when it analyzes that the user in the video is solving a math problem, it will also solve the problem.
02.
Reached SOTA in 22 tests
No intelligence degradation during pre-training
In the all-round performance evaluation, Qwen3-Omni's performance in single-modal tasks is on par with that of the single-modal models in the Qwen series with a similar parameter scale and performs better in audio tasks.
In 36 audio-video benchmark tests, it achieved the best performance in the open-source field in 32 tests and reached the SOTA level in 22 tests. Its performance surpasses closed-source models like Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe, and reaches the same level as Gemini-2.5-Pro in speech recognition and instruction following tasks.
Its blog mentioned that Qwen3-Omni adopts the Thinker-Talker architecture. The Thinker is responsible for text generation, and the Talker focuses on streaming speech token generation, directly receiving high-level semantic representations from the Thinker.
To achieve ultra-low latency streaming generation, the Talker predicts multi-codebook sequences in an autoregressive manner: in each decoding step, the MTP module outputs the residual codebook of the current frame, and then Code2Wav synthesizes the corresponding waveform to achieve frame-by-frame streaming generation.
The key points of its innovative architecture design include that the audio encoder uses the AuT model trained on 20 million hours of audio data, which has the ability of general audio representation; both the Thinker and the Talker adopt the MoE architecture, supporting high concurrency and fast reasoning.
At the same time, researchers mixed single-modal and cross-modal data in the early stage of text pre-training, which can ensure that the performance of multi-modal mixed training does not decline compared with that of pure single-modal training, and significantly enhance the cross-modal ability.
AuT, Thinker, Talker + Code2wav adopt the full-process full-streaming method, supporting direct streaming decoding of the first-frame tokens into audio output.
In addition, Qwen3-Omni supports function calls, enabling efficient integration with external tools/services.
03.
Released the text-to-speech model
Reached SOTA in multiple benchmark tests
Tongyi of Alibaba also released the text-to-speech model Qwen3-TTS-Flash.
Its main features include:
Chinese and English stability: Qwen3-TTS-Flash achieved SOTA performance in the Chinese and English stability on the seed-tts-eval test set, surpassing SeedTTS, MiniMax, and GPT-4o-Audio-Preview;
In terms of multilingual stability and voice similarity, on the MiniMax TTS multilingual test set, Qwen3-TTS-Flash achieved SOTA in WER for Chinese, English, Italian, and French, significantly lower than MiniMax, ElevenLabs, and GPT-4o-Audio-Preview. The speaker similarity in English, Italian, and French significantly exceeded that of MiniMax, ElevenLabs, and GPT-4o-Audio-Preview.
High expressiveness: Qwen3-TTS-Flash has a highly expressive anthropomorphic voice and can stably and reliably output audio that highly follows the input text.
Rich voices and languages: Qwen3-TTS-Flash offers 17 voice options, and each voice supports 10 languages.
Support for multiple dialects: Qwen3-TTS-Flash supports dialect generation, including Mandarin, Minnan dialect, Wu dialect, Cantonese, Sichuanese, Beijing dialect, Nanjing dialect, Tianjin dialect, and Shaanxi dialect.
Tone adaptation: After being trained on a large amount of data, Qwen3-TTS-Flash can automatically adjust the tone according to the input text.
High robustness: Qwen3-TTS-Flash can automatically process complex text, extract key information, and has strong robustness to complex and diverse text formats.
Fast generation: Qwen3-TTS-Flash has extremely low first-packet latency, and the first-packet model latency for single concurrency is as low as 97ms.
In terms of specific performance, on the MiniMax TTS multilingual test set, Qwen3-TTS-Flash achieved SOTA in WER for Chinese, English, Italian, and French, significantly lower than MiniMax, ElevenLabs, and GPT-4o-Audio-Preview. In terms of speaker similarity, Qwen3-TTS-Flash exceeded the above models in English, Italian, and French, showing excellent performance in multilingual speech stability and voice similarity.
Researchers introduced multiple architecture upgrades and acceleration strategies, enabling the model to achieve lower first-packet latency and faster generation speed.
04.
Update of the image editing model
Supports multi-image editing
Alibaba also launched the monthly iteration version of the image editing model Qwen-Image-Edit-2509 this time.
Compared with Qwen-Image-Edit released in August, the main features of Qwen-Image-Edit-2509 include:
Support for multi-image editing: For multi-image input, Qwen-Image-Edit-2509 is further trained based on the Qwen-Image-Edit structure through splicing, providing various gameplay options such as "people + people", "people + products", and "people + scenes".
Enhanced consistency for single-image input: For single-image input, Qwen-Image-Edit-2509 improves consistency, mainly in the following aspects: Enhanced consistency in people editing, including enhanced face ID preservation, supporting various image photos and pose changes; Enhanced consistency in product editing, including enhanced product ID preservation, supporting product poster editing; Enhanced consistency in text editing, in addition to supporting text content modification, also supporting the editing of fonts, colors, and materials of various texts.
Natively supports ControlNet, including depth maps, edge maps, key point maps, etc.
05.
Conclusion: Making efforts in the multi-modal track!
The models of Alibaba's Tongyi family are expanding at an accelerated pace
The new progress of the three models further strengthens Tongyi's competitiveness in the field of multi-modal generation. Among them, Qwen3-TTS-Flash has achieved performance breakthroughs in multi-speaker ability, multilingual support, multi-dialect adaptation, and text processing robustness, and combined with Qwen3-Omni to achieve an update in the speech performance of large models.
The team of Alibaba's Tongyi large model mentioned in the blog that for Qwen3-Omni, they will continue to promote model upgrades in multiple technical directions in the future, including the construction of core capabilities such as multi-speaker ASR, video OCR, and audio-video active learning, and strengthen the support for agent-based workflows and function calls.
Alibaba is continuously making efforts in the field of multi-modal large models, and some of its performance comprehensively surpasses that of competitors. In the future, it may promote implementation in more practical application scenarios.