Cutting the Standalone Encoder: Gemma 4 12B Overthrows the Multimodal "Splicing Design"
Why does a 12B model make the 26B MoE nervous?
On June 4, 2026, Google released Gemma 4 12B. The official positioning is quite conservative: it is a mid - range model between E4B and 26B MoE, capable of running on a 16GB laptop, and is open - sourced under the Apache 2.0 license.
A tweet from DeepMind scientist Michael Tschannen revealed another intention. "In the past few years, my research focus has been on unifying cross - modal models and training paradigms. The Gemma 4 12B released today directly processes raw text, image, and audio inputs."
The keyword is "directly". "Support" and "integration" are inaccurate. Only one word can summarize it: directly.
The vast majority of tech self - media only focus on the two gimmicks of 16G laptops and open - source free, completely ignoring the underlying architecture innovation that truly subverts the multimodal industry in this release. This is also the core secret of how the 12B can threaten the 26B MoE.
Most reports interpret "encoder - free" as a subtraction: replacing the hundreds of megabytes of ViT with a 35M lightweight embedding, reducing the video memory from 15GB to 9GB, which can just fit into a consumer - grade laptop. This interpretation is correct, but it misses something more fundamental.
If the only goal was to reduce video memory, Google could have completely transformed the existing 26B MoE through quantization and distillation. There was no need to reconstruct the entire multimodal architecture from scratch. Gemma 4 12B is redesigned. What it aims to do is not to make the model smaller, but to enable the raw audio and video to directly reach the LLM without loss.
The Babel Tower Dilemma of Traditional Multimodal: Encoder Translation Inevitably Causes Information Loss
In the past three years, mainstream multimodal models, such as LLaVA, GPT - 4V, and even Gemma 4 26B, are essentially patchwork monsters. Their internal structures are quite similar:
The ViT encoder (usually 12 - 24 layers) cuts the image into patches and extracts feature vectors; the Conformer or Whisper encoder converts the sound waves into Mel spectrograms and extracts acoustic features. Then, both go through an alignment layer and are projected into the text vector space of the LLM. Finally, the language model starts to process this converted information.
This architecture works, but it has a structural flaw: the information has undergone at least one compression and conversion before reaching the LLM. The output of the ViT is a high - dimensional feature vector, and the original pixels no longer exist; the output of the Conformer is an acoustic feature representation, and the original sound waves no longer exist. The LLM receives high - level features that have been compressed and refined, losing a large amount of spatial details of the original image and the temporal texture of the audio.
The optimization goals of the three modalities are also mutually separated. The ViT learns image classification, the Conformer learns speech recognition, and the LLM learns text prediction. Additional training is required to bridge the differences during splicing, and the catastrophic forgetting of "forgetting to speak after learning to look at pictures" occurs repeatedly.
The encoder itself is not wrong. What's wrong is the architectural rule of "must - layer translation". Once compression and conversion occur, the information loss is irreversible.
Gemma 4 12B doesn't intend to repair this pipeline. It directly tears it down.
Visually, it abandons the traditional ViT encoder and uses a 35M lightweight embedding module instead. With a single matrix multiplication + 2D coordinate embedding + normalization, the image blocks are directly mapped to the same vector space as text tokens and then enter the attention calculation of the Transformer backbone. Feature extraction becomes a direct projection.
The audio processing is even more radical. The audio encoder is completely removed, and the raw audio signal is directly projected into the vector space of text tokens. There is no spectrum conversion and no acoustic feature extraction. The raw sound waves directly enter the model.
The traditional architecture is "process separately and then splice", while Gemma 4 12B is "unified processing of a mixed token sequence". Image tokens, audio tokens, and text tokens are arranged in order. After entering the unified Transformer backbone, they are processed by the same set of attention mechanisms, sharing the weights and inference logic of the backbone network.
The projection layer itself varies according to the characteristics of the modality. Vision requires 2D coordinate embedding, and audio requires temporal slicing. But after entering the backbone, the representation spaces and calculation logics of the three modalities are completely unified.
This is what Tschannen means by "unification". "Supporting multimodality" at the functional level is too shallow. "All modalities sharing the same set of representation spaces" at the architectural level is the real thing.
Actual Test Approaching 26B MoE: Architectural Efficiency is Rewriting the Rules of the Game
The actual test data from atomic.chat is quite illustrative: on an RTX 4090, the 12B model generates 8.9k tokens of physical simulation code with only 9GB of video memory, and its performance approaches that of the 26B MoE with a 15GB configuration. The parameter gap between the two is as high as 14 billion. The 12B model uses less than half of the video memory and achieves more than half of the speed of the flagship model, with almost no difference in code generation quality and physical logic reasoning ability.
In the past, the in - fighting ideas of large manufacturers have always been to stack MoE and increase the number of parameters to improve performance. However, Gemma 4 12B proves that optimizing the architecture can also achieve the same effect as the flagship model, directly shaking the industry's inertial R & D thinking of "winning by stacking parameters". This is the root cause of the uneasiness of the 26B - level large - model route.
The significant reduction in video memory is due in part to the encoder - free design. There is no additional memory overhead of an independent encoder, and there is no feature alignment loss between the encoder and the backbone. However, the performance approaching that of the 26B is the result of multiple optimizations. The ratio of training data and the improvement of architectural efficiency all contribute, and it cannot be attributed to a single factor.
The real signal is that Gemma 4 12B proves the mass - production feasibility of the "encoder - free unified architecture" in medium - scale models.
After this verification is completed, things start to spread in several directions.
Lightweight fine - tuning methods such as LoRA can be directly applied to the Transformer backbone. In theory, it can synchronously optimize the full - modality loop. There is no need to maintain the encoder and the backbone separately, and there is no need to worry about alignment issues. The specific fine - tuning effect still needs independent verification, and Google itself has not released an official ablation experiment.
The change in hardware threshold is more intuitive. Multimodal inference has been reduced from "dual - channel workstations" to "a single consumer - grade graphics card". Running native multimodality with 9GB of video memory directly determines whether it can enter the workflow of ordinary developers.
There is also room for imagination at the ecological level. The unified embedding space reserves an expansion interface in architectural theory. In theory, adding a new modality only requires customizing a dedicated projection layer to connect to the backbone. However, "connectable" and "usable" are two different things. Supporting training data, task design, and special optimization are all indispensable. "Adding a new modality at zero cost" is an illusion. "Architectural possibility" is a more accurate description.
Boundaries and Watersheds: Architectural Leadership Does Not Equal Omnipotence, but the Direction is Set
It must be honestly stated that when facing complex serial tasks of more than three steps and scenarios of multi - tool linkage, Gemma 4 12B still has problems such as planning illusions and path deviations. This is not a reason to deny it, but only shows that it is in the transition period from "able to converse" to "able to do things".
The touchscreens of early smartphones were not very sensitive either, but the direction was set. The verification of the encoder - free unified architecture has been completed, and the remaining engineering optimization is just a matter of time.
The release of Gemma 4 12B can easily be drowned in the information noise of "another model has been released". But if you shift your focus from the parameter table to the architecture diagram, you will see a clear signal:
The R & D logic of multimodal AI is shifting from "designing dedicated converters for each modality and then splicing" to "all modalities sharing the same set of attention mechanisms".
The 12B parameters are not the key. It proves that the "grand unification" of multimodality does not need to be achieved by stacking modules. A unified representation space is enough.
In the next two years, when the industry reviews the progress of multimodality in 2026, the benchmark score of Gemma 4 26B will be forgotten, and the architectural choice of Gemma 4 12B will be repeatedly cited. It is the first to verify the mass - production feasibility of the "encoder - free unified architecture" on a medium - scale, commercially available, and locally deployable model.
The 26B has won the current performance battle, while the 12B has rewritten the underlying rules of future multimodality.
This article is from the WeChat official account “AI Singing the Opposite Tune”, author: Li Xiaowen. It is published by 36Kr with authorization.