StartseiteArtikel

Der ChatGPT-Moment in der Übersetzungsbranche: Meta stellt neues Modell vor, lerne seltene neue Sprachen anhand einiger Beispiele

新智元2025-11-11 20:10
Der Beenden der Dialektbarrieren ist hier.

Among more than 7,000 human languages, only a few are recognized by modern speech technology. Now, this inequality may be broken. The Omnilingual ASR system released by Meta can recognize over 1,600 languages and quickly learn new languages through a small number of examples. Centered around open - source and community co - creation, this technology gives every voice a chance to take the stage of AI.

You may hardly imagine that among the more than 7,000 active languages in the world, only a few hundred have received the "favor" of modern speech technology.

The vast majority of language users - from the indigenous people in African tribes, ethnic groups in the Amazon rainforest to the elderly in rural towns still speaking ancient dialects - have always been on the sidelines of the digital age.

Voice assistants, automatic subtitles, and real - time translation, the conveniences brought by AI seem to be only for a few "mainstream" languages. The rest of the language communities are still kept outside the technological gate.

This digital divide now has a game - changer.

Meta's artificial intelligence research team recently released the Omnilingual ASR system, a family of AI models that can automatically recognize and transcribe the speech of over 1,600 languages, enabling almost all human languages to be "understood" by machines.

This system is shared with the world in an open - source manner and can be expanded with new languages by the community, giving every voice a chance to take the stage of AI.

1,600 languages are just the beginning

The Omnilingual ASR launched by Meta this time has set a new record for the number of languages covered in speech recognition, supporting over 1,600 languages, including 500 languages that have never been transcribed by any AI system before.

In contrast, the open - source Whisper model by OpenAI only supports 99 languages, while Omnilingual ASR has almost increased this number by an order of magnitude.

For many people around the world who speak minority languages, this is undoubtedly a "digital redemption": for the first time, there is a possibility that their native languages can be fluently understood by AI.

The recognition performance of this system has reached the leading level in many languages.

According to the data provided by Meta, among the over 1,600 languages tested, 78% of the languages have a character error rate (CER) of less than 10%. If we consider the languages trained with more than 10 hours of speech data, this proportion reaches 95%.

Even for low - resource languages with extremely scarce training materials, 36% have achieved a CER of less than 10%.

These figures mean that Omnilingual ASR not only has a wide coverage but also can provide practical and high - quality transcription results for most languages.

However, 1,600 languages are not the end for Omnilingual ASR.

Its greater significance lies in breaking the previous limitation of the fixed and rigid language support range of ASR models, making language coverage shift from "quantitative" to "scalable".

Omnilingual ASR borrows the idea from large language models (LLMs) and introduces a zero - shot "context learning" mechanism.

This means that even if a certain language is not initially on the supported list, users can make the model learn a new language instantly during the inference process by providing a few audio clips of the language and the corresponding texts as examples.

There is no need to spend months collecting large - scale corpora or conduct professional deep - learning training. A simple few - shot learning is enough to learn a new language.

With this innovative paradigm, the potential language coverage ability of Omnilingual ASR has suddenly expanded.

Officially, it is stated that theoretically, this system can be extended to over 5,400 languages, covering almost all human languages with written records!

No matter how obscure a spoken language is, as long as there is a corresponding writing system and a few examples, it has a chance to be captured and recorded by Omnilingual ASR.

In the field of AI speech recognition, this is a paradigm shift from static and closed to dynamic and adaptive - the model is no longer restricted by the pre - set language list during training but becomes a flexible and open framework that encourages local communities to add new languages by themselves.

For those ethnic groups that have long been absent from the technological landscape, this is like holding a key to "unlock" new languages at any time by themselves.

Open - source and community, breaking the language divide

Another prominent feature of Omnilingual ASR is its open - source and community - driven nature.

Meta has chosen to fully open - source this large - scale multilingual ASR system on GitHub, releasing the model and code under the Apache 2.0 license.

Whether they are researchers, developers, or enterprise institutions, all can use, modify, and commercially use this model for free without worrying about cumbersome authorization restrictions.

Compared with the "semi - open - source" models with additional terms of some previous AI models, the open attitude of Omnilingual ASR is quite straightforward, setting an example for technological democratization.

To benefit all language communities, Meta has not only opened the model but also released a huge multilingual speech dataset - the Omnilingual ASR corpus.

This corpus contains transcribed speech data of 350 languages with scarce materials, covering many languages that were previously "silent" in the digital world.

All data is provided under the CC - BY license.

Developers and scholars can use these valuable resources to train and improve speech recognition models suitable for local needs.

This measure will undoubtedly help those languages lacking large - scale annotated corpora cross the data threshold, giving "small languages" a chance to achieve great things.

The unprecedented language breadth that Omnilingual ASR can cover is supported by global cooperation.

During the development process, Meta collaborated with local language organizations and communities to collect a large number of speech samples.

They cooperated with institutions such as the Common Voice project of the Mozilla Foundation and Lanfrica/NaijaVoices in Africa to recruit native speakers from remote areas to record speech.

To ensure the data is diverse and close to real life, these recordings often use open - ended questions, allowing speakers to freely express their daily thoughts.

All participants were reasonably compensated, and the data collection followed the guidance of cultural sensitivity.

This community co - creation model endows Omnilingual ASR with profound linguistic knowledge and cultural understanding, and also demonstrates the humanistic care of the project: technology development should not and does not "save" minority languages in a condescending way but cooperates with local communities, enabling them to become the protagonists of language digitization.

In terms of technical specifications, Meta provides a series of models of different scales to adapt to diverse application scenarios: from lightweight models with about 300 million parameters (suitable for low - power devices such as mobile phones) to powerful models with up to 7 billion parameters (pursuing ultimate accuracy).

The model architecture uses the self - supervised pre - trained wav2vec 2.0 speech encoder (expanded to a scale of 7 billion parameters) to extract general audio features and combines two decoder strategies: one is the traditional CTC decoding, and the other is a large - model text decoder incorporating Transformer, which endows the model with strong context learning ability.

The large - scale model requires a huge amount of data to support it - Omnilingual ASR was trained using over 4.3 million hours of speech audio, covering materials in 1,239 languages.

This is one of the largest and most diverse speech training corpora ever. Such a large - scale data, combined with the long - tail language corpora contributed by the community, ensures that the model can learn robust speech representations for various languages and has a good generalization basis even for languages it has never seen before.

As the research paper points out, "No model can cover all languages in the world in advance, but Omnilingual ASR allows the community to continuously expand this list with its own data."

This marks that speech AI now has the vitality of self - growth and can co - evolve with the rich diversity of human languages.

When technology lays down its arrogance and embraces diversity in an open - source way, when every language's voice has a chance to be heard and recorded, and when no language is forgotten in the digital world, we are one step closer to truly eliminating the language divide, and the real boundary - free connection among humans can begin.

Reference materials:

https://ai.meta.com/blog/omnilingual-asr-advancing-automatic-speech-recognition

This article is from the WeChat official account "New Intelligence Yuan". The author is New Intelligence Yuan. It is published by 36Kr with authorization.