36氪_让一部分人先看到未来

MaskGCT has rich application scenarios in areas such as short drama overseas expansion, digital humans, intelligent assistants, audio books, and auxiliary education.

Text | Liu Shiwu (36Kr Games)

On October 24, TTFun Technology announced that the speech large model "MaskGCT", jointly developed with The Chinese University of Hong Kong, Shenzhen, has been officially open-sourced in the Amphion system and is available for global users. Different from the traditional TTS model, MaskGCT adopts the mask generation model and the speech representation decoupling encoding technology, which can be quickly implemented in tasks such as voice cloning, cross-language synthesis, and voice control.

Test results (Source: MaskGCT)

It is understood that compared with the existing TTS large models, MaskGCT has made further breakthroughs in the similarity, quality, and stability of speech, and has achieved SOTA effects on three TTS benchmark datasets. Its significant features are as follows:

Second-level ultra-realistic voice cloning: By providing a 3-second audio sample, it can replicate any timbre such as human, anime, and "whisper in the ear", and can completely replicate the intonation, style, and emotion.
More finely controllable speech generation: It can flexibly adjust the length, speed, and emotion of the generated speech, support editing the speech through editing the text, and maintain a high degree of consistency in rhythm, timbre, and other aspects.
High-quality multilingual speech dataset: Trained on the 100,000-hour dataset Emilia jointly launched by The Chinese University of Hong Kong, Shenzhen, and TTFun Technology, etc., it is one of the largest and most diverse high-quality multilingual speech datasets in the world, achieving cross-language synthesis of six languages including Chinese, English, Japanese, Korean, French, and German.

The research and development of MaskGCT was completed by members of the Joint Laboratory of Artificial Intelligence of CUHK(SZ) and TTFun Technology. As a large-scale zero-shot TTS model, MaskGCT adopts the non-autoregressive mask generation Transformer, without the need for text-speech alignment supervision and phoneme-level duration prediction, its technical breakthrough lies in the innovative paradigm of adopting the mask generation model and the speech representation decoupling encoding.

The MaskGCT large model translates the animation segment of "Black Myth: Wukong" (Video source: TTFun Qianyin)

According to official experiments, MaskGCT is superior to the vast majority of current TTS models in terms of speech quality, similarity, and comprehensibility, and performs better when the model size and the amount of training data increase, while being able to control the total duration of the generated speech.

MaskGCT has been released in the open-source system Amphion jointly developed by The Chinese University of Hong Kong, Shenzhen, and the Shanghai Artificial Intelligence Laboratory

It is worth mentioning that MaskGCT is a two-stage model. In the first stage, the model uses text to predict the semantic markers extracted from the speech self-supervised learning (SSL) model; in the second stage, the model predicts the acoustic markers based on these semantic markers (following the mask prediction learning paradigm).

During the training process, MaskGCT learns to predict the semantic or acoustic markers of the mask according to the given conditions and prompts. During the inference process, the model generates markers of the specified length in a parallel manner. Through experiments on 100,000 hours of natural speech, the results show that MaskGCT is superior to other existing zero-shot TTS systems in terms of quality, similarity, and comprehensibility.

Currently, MaskGCT has rich application scenarios in areas such as short drama overseas expansion, digital humans, intelligent assistants, audio books, and assisted education.In order to accelerate the landing application, on the premise of safety and compliance, TTFun Technology has developed the multilingual rapid translation intelligent audio-visual platform "TTFun Qianyin", which can realize the function of quickly translating the uploaded video into multiple language versions with one click, including subtitle repair and translation, voice translation, lip-sync, etc., greatly reducing the previously expensive manual translation costs and lengthy production cycles, becoming a new choice for the overseas expansion of content such as films, TV series, games, and short dramas.

Video source: TTFun Qianyin

The "2024 White Paper on the Overseas Expansion of Short Dramas" shows that the overseas market scale in 2023 was as high as 65 billion US dollars, about 12 times that of the domestic market, and the overseas expansion of short dramas is becoming a new blue ocean track. Based on MaskGCT, TTFun Qianyin has the opportunity to help domestic short dramas "go global" in a lower-cost and faster way, improving the efficiency of the overseas expansion of Chinese cultural content.

This article is originally produced by「刘士武」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

The speech large model "MaskGCT" is officially open-sourced to provide services for products such as short dramas, games, and digital humans.