The World's First Native Multimodal Architecture NEO Arrives Right After Ilya's Prediction: Vision and Language Fully Integrated

1/10 of the data matches GPT-4V

When Ilya Sutskever recently publicly declared that "the era of relying solely on the Scaling Law is over" and asserted that "the future of large models lies not in simply getting bigger in scale, but in becoming smarter in architecture," the entire AI community realized that a paradigm shift was taking place.

Over the past few years, the industry seemed to be obsessed with building stronger models by piling up more data, larger parameters, and more powerful computing resources. However, this approach is approaching the critical point of diminishing returns.

Top AI experts like Ilya and LeCun have all pointed out that true breakthroughs must come from fundamental innovations at the architectural level, rather than patching up the existing Transformer pipeline.

At this crucial juncture, a new species from a Chinese research team emerged:

The world's first open - source native multimodal architecture (Native VLM) that can be deployed on a large scale, named NEO.

△

It should be noted that the mainstream multimodal large models before, such as the well - known GPT - 4V and Claude 3.5, essentially play the game of splicing at their core.

What does this mean?

It means grafting a pre - trained visual encoder (such as ViT) onto a powerful large language model through a small projection layer.

Although this modular approach achieves multimodality, vision and language are always two parallel lines, only being roughly brought together at the data level.

The joint research from SenseTime and universities like Nanyang Technological University aims to completely subvert all this from the root.

With NEO, the large model can not only see and speak, but also inherently understand that vision and language are two sides of the same coin.

What's even more astonishing is that with this native multimodal architecture, NEO only uses one - tenth of the training data and has caught up with or even surpassed those flagship competitors that rely on a vast amount of data and complex module stacking in multiple key evaluations!

So, how exactly does NEO achieve this? Let's continue to find out.

Why does it have to be a native architecture?

Before delving into the principle, we need to understand the current situation of multimodality.

As we mentioned earlier, the current mainstream modular architecture actually has three insurmountable technical gaps.

Firstly, there is the efficiency gap.

The training process of modular models is extremely complex, usually divided into three steps: pre - training the visual encoder and the language model separately first, then enabling the two to communicate through an alignment stage, and finally, fine - tuning with instructions may be required.

This process is not only time - consuming and labor - intensive, but also costly. Moreover, each stage may introduce new errors and inconsistencies. The knowledge of vision and language is fragmented in different "rooms" and can only barely cooperate by constantly "passing notes".

Secondly, there is the ability gap.

The visual encoder has a strong inductive bias from the very beginning of its design. For example, it usually requires the input image to have a fixed resolution (such as 224x224) or to be forcibly flattened into a one - dimensional token sequence.

This processing method may be sufficient for understanding the overall composition of a painting, but it is inadequate when dealing with scenarios that require capturing fine textures, complex spatial relationships, or arbitrary aspect ratios (such as a long - format image or an engineering drawing).

Because what the model sees is just an over - simplified and structured framework.

Finally, there is the fusion gap.

The mapping connecting vision and language almost always stays at a simple surface level and cannot achieve deep semantic alignment. This results in the model often struggling when dealing with tasks that require fine - grained visual understanding.

For example, when asked to describe a complex chart, it may confuse the legend and the data; when asked to understand a spatially - oriented instruction like "Put the second red apple on the left into the basket on the right", it may get the left - right direction or the quantity wrong.

Fundamentally, it is because within the model, visual information and language information have never been placed in the same semantic space for real, in - depth and integrated reasoning.

That's why the research team behind NEO started from first principles and directly created a unified model where vision and language are inherently connected from the very beginning -

This model no longer has a distinction between a visual module and a language module, but only has a unified brain specifically designed for multimodality.

Looking back at the history of AI, from RNN to Transformer, every real leap has come from fundamental innovations at the architectural level.

Over the past few years, the industry has been trapped in the path - dependence of the "scale - only theory". It is not until today that a group of top researchers represented by Ilya have collectively issued a warning: the inherent limitations of the Transformer architecture have become increasingly prominent. Relying solely on stacking computing power and data cannot lead to true general intelligence.

The birth of NEO is just in time. With its simple and unified native architecture, it strongly proves that the competitiveness of the next - generation AI lies in how smart the architecture is.

The three native technologies behind NEO

The core innovation of NEO is reflected in three underlying technical dimensions, which together build the native capabilities of the model.

Firstly, Native Patch Embedding.

Traditional models often use a discrete tokenizer in advance or connect a vision encoder to compress image information or semantic tokens.

NEO directly abandons this step. It designs a lightweight patch embedding layer that, through a two - layer convolutional neural network, directly builds a continuous and high - fidelity visual representation from pixels in a bottom - up manner.

This is like enabling AI to learn to directly perceive light and details with its "eyes", rather than looking at an abstract, pixelated image first.

This design allows the model to more precisely capture textures, edges, and local features in the image, fundamentally breaking through the image modeling bottleneck of mainstream models.

Secondly, Native - RoPE (Native Rotational Position Encoding).

Position information is crucial for understanding any sequence. Text is one - dimensional, images are two - dimensional, and videos are three - dimensional (spatiotemporal). Traditional models either use the same one - dimensional position encoding for all modalities or simply splice them, which obviously cannot meet the natural structures of different modalities.

NEO's Native - RoPE innovatively assigns different frequencies to the three dimensions of time (T), height (H), and width (W): the visual dimensions (H, W) use high frequencies to precisely depict local details and spatial structures; the text dimension (T) takes both high and low frequencies into account to handle both locality and long - distance dependencies.

More ingeniously, for pure text input, the indices of H and W are set to zero, which does not affect the performance of the original language model at all.

This is equivalent to equipping AI with an intelligent and adaptive spatiotemporal coordinate system, which can not only precisely locate each pixel in the image but also pave the way for seamless expansion to complex scenarios such as video understanding and 3D interaction.

Thirdly, Native Multi - Head Attention.

The attention mechanism is the way large models think. In traditional modular models, the attention of the language model is causal (it can only see the previous words), while the attention of the visual encoder is bidirectional (it can see all pixels).

NEO adopts a method that allows these two modes to coexist within a unified attention framework.

When processing text tokens, it follows the standard autoregressive causal attention; when processing visual tokens, it uses full bidirectional attention, allowing all image patches to freely interact and associate with each other.

This "collaborative work of the left and right brains" mode greatly enhances the model's ability to understand the internal spatial structure of the image, thereby better supporting complex interleaved reasoning between images and text, such as understanding the subtle difference between "The cat is above the box" and "The cat is in the box".

In addition to these three cores, NEO is also equipped with a two - stage fusion training strategy called Pre - Buffer & Post - LLM.

In the early stage of pre - training, the model is temporarily divided into two parts: a Pre - Buffer responsible for the in - depth fusion of vision and language and a Post - LLM that inherits powerful language capabilities.

Under the guidance of the latter, the former efficiently learns visual knowledge from scratch and establishes a preliminary pixel - word alignment. As the training progresses, this division gradually disappears, and the entire model integrates into an end - to - end, inseparable whole.

This strategy cleverly solves the problem of how to learn vision without sacrificing language ability during the training of the native architecture.

One - tenth of the data, catching up with the flagships

Real - world testing data is more convincing than theoretical discussions. Let's take a look at NEO's performance in actual tests.

Looking at the results, the most intuitive manifestation is data efficiency.

NEO was trained with only 390 million image - text pairs, which is only one - tenth of the data required by top - level models of the same kind!

Without relying on a large - scale visual encoder or a vast amount of alignment data, with its simple and powerful native architecture, NEO has caught up with top - level modular flagship models such as Qwen2 - VL and InternVL3 in multiple visual understanding tasks.

NEO also performed remarkably well on authoritative evaluation lists.

In multiple key benchmark tests such as MMMU (Multidisciplinary Comprehensive Understanding), MMBench (Comprehensive Multimodal Ability), MMStar (Spatial and Scientific Reasoning), SEED - I (Visual Perception), and POPE (Measuring Model Hallucination Degree), NEO achieved high scores, demonstrating comprehensive performance superior to other native VLMs and truly achieving no loss in accuracy.

It is particularly worth noting that currently, NEO shows a high cost - performance ratio in reasoning within the small - to - medium parameter scale range of 2B to 8B.

For large models with dozens or even hundreds of billions of parameters, these small - and medium - sized models may seem like toys. However, it is these models that are the key to future deployment on edge devices such as mobile phones, robots, and smart cars.

NEO not only achieves a double leap in accuracy and efficiency at these scales but also significantly reduces the reasoning cost.

This means that powerful multimodal visual perception capabilities will no longer be exclusive to large cloud - based models but can be truly popularized to every terminal device.

How to evaluate NEO?

Finally, we need to discuss a question: What is NEO useful for?

From what we have discussed above, it is not difficult to see that the real value of NEO lies not only in the breakthrough of performance indicators but also in that it points out a new path for the evolution of multimodal AI.

Its native integrated architecture design breaks through the semantic gap between vision and language at the bottom level, naturally supports images with arbitrary resolutions and long interleaved reasoning between images and text, and reserves clear expansion interfaces for higher - order multimodal interaction scenarios such as video understanding, 3D spatial perception, and even embodied intelligence.

This design philosophy of being born for fusion makes it an ideal foundation for building the next - generation general artificial intelligence system.

More importantly, SenseTime has open - sourced two models based on the NEO architecture with parameter scales of 2B and 9B, sending a strong signal for co - construction.

This move is expected to drive the entire open - source community to migrate from the current mainstream modular splicing paradigm to a more efficient and unified native architecture, accelerating the formation of a de facto standard for the next - generation multimodal technology.

Meanwhile, the cost - performance ratio of NEO in the small - to - medium parameter scale range is breaking the inherent perception that large models monopolize high - performance capabilities.

It significantly lowers the threshold for training and deploying multimodal models, enabling powerful visual understanding capabilities to no longer be limited to the cloud but to truly penetrate into terminal scenarios such as robots, smart cars, AR/VR glasses, and industrial edge devices that are highly sensitive to cost, power consumption, and latency.

From this perspective, NEO is not only a technical model but also a key prototype for the next - generation inclusive, terminal - oriented, and embodied AI infrastructure.

More importantly, the emergence of NEO provides a clear and powerful answer to the currently confused AI community.

At a time when Ilya and others have jointly pointed out that the industry urgently needs a new paradigm, NEO, with its completely native design concept, has become the first successful example of the new trend that "architectural innovation is more important than scale stacking".

It not only redefines the way to build multimodal models but also declares to the world that the next stop of AI is to return to the exploration of the essence of intelligence and build a general brain that can truly understand and integrate multi - dimensional information through fundamental architectural innovation.

This is a crucial contribution of the Chinese team to the global AI evolution direction. As predicted, this is the inevitable path to the next - generation AI.

This article is from the WeChat official account

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Right after Ilya made his prediction, the world's first native multimodal architecture, NEO, has arrived: vision and language are completely integrated.

Why does it have to be a native architecture?

The three native technologies behind NEO

One - tenth of the data, catching up with the flagships

How to evaluate NEO?