Alibaba Unveils Most Powerful Speech Model: World-Beating Accuracy in Transcribing English Raps to Text

It can automatically distinguish 11 languages and filter noise. Now you can experience it for free.

According to a report by Zhidx on September 9th, yesterday, Alibaba released its latest speech recognition model, Qwen3-ASR-Flash. This model is trained based on the Qwen3 base model and supports 11 languages and various accents. Users can freely experience it through ModelScope, HuggingFace, and the Alibaba Cloud Bailian API Qwen3-ASR-Flash.

In multiple benchmark tests of ASR (Automatic Speech Recognition), Qwen3-ASR-Flash has significantly lower recognition error rates in dialects, multilingual languages, key information recognition, lyrics, etc., compared to Google Gemini-2.5-Pro, OpenAI GPT-4o-Transcribe, Alibaba Speech Laboratory's Paraformer-v1, and ByteDance's Doubao-ASR.

Specifically, this model is built based on massive multi-modal data and tens of millions of hours of ASR data. It supports 11 languages such as Chinese, English, French, and German. During the recognition process, it can automatically distinguish the language of the speech and automatically filter out non-speech segments such as silence and background noise.

In addition, users can customize the ASR results. By adding context information such as key information terms and the background of the audio when uploading the audio, the recognition results can match the existing information.

Below is an example of an e-sports game commentary audio released by the official. The researchers configured background information for this scenario, including a list of keywords and the background of the game. Therefore, in the recognition results, even if the e-sports commentator speaks very fast, it does not affect the recognition effect of game professional terms.

ModelScope address: https://modelscope.cn/studios/Qwen/Qwen3-ASR-Demo
Hugging Face address: https://huggingface.co/spaces/Qwen/Qwen3-ASR-Demo
Alibaba Cloud Bailian API call address: https://bailian.console.aliyun.com/?tab=doc#/doc/?type=model&url=2979031

01. Can Recognize Game Commentaries and English Rap, with High Anti-Interference Ability Against Continuous Multiple Noises

The official released five demonstration examples, including audio recognition challenges with various types of noise, rapid multilingual switching, dialects, and professional terms.

The first one involves continuous multiple types of noise, such as mobile phone ringtones, bicycle bells, music, water sounds, and thunder. There are also conversations between different people. Qwen3-ASR-Flash accurately recognized the speech even when multiple people were speaking simultaneously or the speaking intervals were very short, without being affected by the noise.

The second one is English rap. English rap is characterized by fast speaking speed and many connected words in the lyrics. In the recognition results, many connected words and long and difficult sentences in the lyrics were accurately recognized, and it was not affected by the background music.

The third one is the recognition of dialects. In this scenario, the protagonist in the audio is driving, and there is an alternation between the protagonist's dialect and the Mandarin of the intelligent voice customer service. The intelligent voice customer service in the audio misrecognized "correct" as "96", while Qwen3-ASR-Flash accurately recognized it.

The fourth one is multilingual sentence switching. In a 7-second audio, there are 5 languages such as English and Japanese, and the recognition results are presented one by one.

The last one is an audio of a chemistry course. In the recognition results, there were no errors in recognizing chemical terms such as ester groups, acids, aldehydes, and ammonia, as well as the interjections of the people in the audio.

02. The Lyrics Recognition Error Rate is Lower than 8%, and the Speech Recognition Results can be Customized

In terms of performance, Qwen3-ASR-Flash has lower recognition error rates in Chinese, English, multilingual languages, lyrics, and key information compared to Gemini-2.5-Pro, GPT-4o-Transcribe, Paraformer-v1, and Doubao-ASR.

In lyrics recognition, Qwen3-ASR-Flash supports the recognition of the whole song with or without background music. The researchers' actual measurement shows that the recognition error rate is lower than 8%.

This model supports Mandarin and dialects such as Sichuanese, Minnanese, Wu dialect, and Cantonese, as well as English with British, American, and multi-regional accents, and other languages such as French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, and Arabic.

If users want to obtain customized ASR results, they can provide background text in any format to obtain biased ASR results, and users do not need to preprocess the context information.

The supported formats include but are not limited to the following: a simple list of keywords or hot words, a complete paragraph or the whole document of any length and source, a mixture of a keyword list and a full-text paragraph in any format, and irrelevant or even meaningless text. The researchers mentioned that the model has high robustness against the negative impact of irrelevant context.

Based on this, Qwen3-ASR-Flash can use this context to recognize and match named entities and other key terms, and output customized recognition results.

03. Conclusion: The General Speech Recognition Accuracy will be Iteratively Improved in the Future

All along, complex acoustic environments, diverse speech features, and professional terms have been the biggest challenges in speech recognition. In order to ensure the controllability of the output results for users, Alibaba's researchers added a function to upload background text, making the recognition results more in line with users' expectations.

Next, the researchers will improve the general recognition accuracy of Qwen3-ASR-Flash and further lower the usage threshold for ordinary users.

This article is from the WeChat official account "Zhidx" (ID: zhidxcom). The author is Cheng Qian, and the editor is Xin Yuan. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Alibaba unveils its most powerful speech model, accurately transcribing English raps into text with a world - beating accuracy rate.

01. Can Recognize Game Commentaries and English Rap, with High Anti-Interference Ability Against Continuous Multiple Noises

02. The Lyrics Recognition Error Rate is Lower than 8%, and the Speech Recognition Results can be Customized

03. Conclusion: The General Speech Recognition Accuracy will be Iteratively Improved in the Future