Meituan LongCat-Next: Turn images, sounds, and texts into tokens. Then what?
Recently, Meituan released a significant multimodal research achievement - LongCat-Next.
This is a discrete native autoregressive multimodal large model built on the LongCat-Flash-Lite MoE architecture. It has a total of 68.5B parameters and only 3B active parameters, capable of simultaneously processing three modalities: text, images, and audio within a unified framework.
The emergence of this model directly challenges a long - standing perception in the multimodal field: discretizing visual information into tokens leads to severe loss of details, and it is naturally inferior to continuous feature models in fine - grained understanding tasks such as OCR and complex chart analysis.
LongCat-Next is currently the first unified multimodal model that pushes the above - mentioned fine - grained visual understanding ability to a level comparable to that of dedicated continuous models within a pure discrete framework, and it is comparable to the dedicated visual model Qwen3 - VL - A3B with the same number of parameters.
In terms of image generation, its long - text understanding and text rendering capabilities have significant advantages over similar unified models, and the overall generation quality can compete with the dedicated text - to - image model Flux - dev.
In the audio aspect, its speech recognition and understanding capabilities surpass those of models of the same scale, such as Gemini 3.1 Flash - Lite preview and MiMo - Audio.
LongCat-Next also resolves the optimization conflict between visual understanding and generation.
Paper experiments show that under the same token budget, the joint training of understanding and generation does not drag each other down. Instead, the training signals of the understanding task have a positive impact on the generation quality, which contradicts the practical experience of most unified models.
After all modalities are jointly trained in the same embedding space in the form of discrete tokens, a cross - modal semantic blending phenomenon spontaneously emerges within the model, where visual tokens and text tokens form an intertwined distribution in the representation space.
Paper address: https://github.com/meituan-longcat/LongCat-Next/blob/main/tech_report.pdf
GitHub: https://github.com/meituan-longcat/LongCat-Next
HuggingFace: https://huggingface.co/meituan-longcat/LongCat-Next
blog: https://longcat.chat/longcat-next/intro
Next, let's intuitively experience its capabilities through several specific cases.
Get a sneak peek: Initial experience with text, image, and audio modalities
Let's first test its visual understanding ability.
We uploaded a picture of flower arrangements in the color scheme of La La Land and asked LongCat-Next to identify the plants in it and introduce their respective characteristics.
Prompt: Which plants are included in the flower bouquet in the picture, and what are their respective characteristics?
The model accurately identified yellow spray roses, purple lisianthus, sage - like herbs, and foliage plants, and gave relatively detailed descriptions of their colors and shapes. It also actively supplemented an analysis of the overall color - matching style of the flower bouquet.
We then used three landmark buildings with different styles as materials to examine the model's image recognition ability for domestic city landmarks.
Prompt: Where are these three places?
LongCat-Next accurately identified the "Wangjing Eye" in Beijing, the Bank of China Tower in Guangzhou, and the Nanjing Youth Olympic Center, and had some knowledge of the background information of each landmark building.
For example, it mentioned the online nickname "Cockroach Tower" of the Bank of China Tower in Guangzhou and its unique shape, as well as details such as the Nanjing Youth Olympic Center being designed by Zaha Hadid.
The following graphic reasoning question not only examines the model's image understanding ability but also involves the induction of abstract rules.
Prompt: Which option should be chosen for this question?
LongCat-Next grasped the key. Each graphic consists of two elements: an outer frame and internal black dots. Through horizontal comparison of multiple sets of data, it found the hidden rule that "the number of sides of the outer frame - the number of black dots = 2", and finally locked in answer B.
Now let's look at its image generation ability.
The mountain lake at sunrise generated by LongCat-Next, in terms of composition and light - shadow transition, is close to the texture of professional landscape photography.
prompt: A crystal clear mountain lake reflecting snow - capped peaks at sunrise. Still water, mirror - like reflection, pink and gold sky, pine trees along the shore.
The following case mainly examines the text rendering ability. In the generated product image of the mug, the text is neither deformed nor garbled, presenting a minimalist style overall.
prompt: A white mug on a wooden table with "LongCat-Next" printed on it in clean font. Simple background, morning light from a window, minimalist product photography.
The Santorini scene generated by LongCat-Next has the most prominent color performance. The blue domes, white walls, bougainvillea, and the sunset form a strong and harmonious color contrast, creating a very atmospheric scene.
prompt: Santorini white buildings with blue domes overlooking the Aegean sea at sunset. Warm golden light, bougainvillea flowers, calm ocean, iconic Greek island view.
LongCat-Next also supports output at any resolution. Even for extreme aspect - ratio composition requirements, it can still generate stably.
Beyond vision, LongCat-Next also incorporates audio into a unified discrete autoregressive framework.
Its audio understanding ability can respond accurately and coherently to sound signals, just like processing text, covering speech content recognition and semantic understanding of complex scenarios.
For example, when asked a classic logical puzzle in Sichuan dialect, LongCat-Next did not have recognition deviations or semantic losses. The speech signal in Sichuan dialect was accurately converted into semantic content for reasoning and smoothly entered the subsequent logical analysis process.
This shows to some extent that the discrete representation of audio in LongCat-Next has considerable robustness, and acoustic variations such as dialects and accents will not become breakpoints in the understanding process.
Given an environmental recording, it accurately determined that the recording location was near a railway station, subway station, or railway track from the continuous and rhythmic "click" sounds and steam whistle sounds.
It can be seen that LongCat-Next can complete scene - level semantic inferences by synthesizing multiple acoustic clues.
It can also keenly perceive the emotions behind the words. For example, in a male voice audio, LongCat-Next not only understood the literal content but also judged that the speaker was emotionally excited and angry from the increased volume and rapid speech rate.
In addition to "understanding" audio, LongCat-Next also has speech synthesis and voice cloning capabilities.
Provided with a reference audio of Mandarin with a strong Cantonese accent, it was asked to synthesize new target content while retaining the speaker's voice characteristics.
The synthesized audio restored the speaker's voice texture, and the highly recognizable Cantonese - accented Mandarin charm was also completely retained.
Switching to an English scenario, also providing a reference audio, the model was required to clone the voice and repeat the specified content.
LongCat-Next accurately captured the speaker's voice characteristics and accent habits. The output synthesized voice was highly similar to the original voice in terms of sound, and the expression of the target content was clear and accurate.
When "everything" becomes tokens
The model begins to truly unify the world
Today's large models still base their core modeling paradigm on "predicting the next token". However, the problem is that this token has long only belonged to language. LongCat-Next extends this concept to the multimodal field and proposes the Discrete Native Autoregressive (DiNA) framework.
Under this framework, continuous signals such as images and audio are converted into discrete tokens that share a representation space with text. With the unified token representation, the need to design dedicated architectures for different modalities is greatly reduced. Tasks such as visual understanding and generation, and audio processing are unified into an autoregressive prediction process at the core modeling level.
Extend multimodal capabilities into a native framework similar to language modeling through paired tokenizers.
How to losslessly convert high - dimensional audio - visual signals into tokens and restore them?
The first question is, can an image really be converted into tokens?
Language is naturally discrete, but vision is not. An image is a high - dimensional, continuous, and information - dense signal. Once compressed into a limited number of tokens, semantic loss (incomprehensible) and detail loss (unable to draw) are likely to occur.
LongCat-Next abstracts this problem into a core principle: semantic completeness. That is, after tokenization, the model's judgments based on tokens should be as close as possible to those based directly on the original image.
To convert high - dimensional visual signals into discrete tokens while minimizing information loss, LongCat-Next designed a visual tokenizer called dNaViT (Discrete Native Resolution Vision Transformer).