Exploring the Physical World AGI with "Visual Reasoning", ElorianAI Raises $55 Million
The capabilities of large AI models have, in some aspects, exceeded those of the average person, such as in programming and mathematics. It is reported that Anthropic has almost achieved 100% AI programming internally. Google's Gemini Deep Think solved 5 out of 6 problems in the IMO 2025, reaching the gold medal level.
However, in visual reasoning, even the leading Gemini 3 Pro only reached the level of a 3 - year - old child on BabyVision, a benchmark for testing basic visual reasoning abilities.
Why are large models so strong in programming and mathematics but weak in visual reasoning? This is because there are limitations in their "thinking mode". Visual Language Models (VLM) need to first convert visual input into language and then conduct text - based reasoning. However, many visual tasks simply cannot be accurately described in words, which results in the poor visual reasoning ability of the models.
Andrew Dai, who worked at Google DeepMind for 14 years, joined hands with Yinfei Yang, a senior AI expert from Apple, to found a company called Elorian AI. Their goal is to elevate the visual reasoning ability of the model from the "child level" to the "adult level" and enable the model to have the ability to think natively in the "visual space", thus aiming for AGI in the physical world.
Elorian AI received an early - stage financing of $55 million jointly led by Striker Venture Partners, Menlo Ventures, and Altimeter. 49 Palms and top AI scientists including Jeff Dean participated in the investment.
A pioneer of multimodal models wants to endow visual models with reasoning ability
As a Chinese, Andrew Dai graduated from the University of Cambridge with a bachelor's degree in computer science and obtained a doctorate in machine learning from the University of Edinburgh. He interned at Google during his doctoral studies and joined Google in 2012, staying there for 14 years until he started his own business.
Image source: Andrew Dai's LinkedIn
Shortly after joining Google, he co - wrote the first paper on language model pre - training and supervised fine - tuning, "Semi - supervised Sequence Learning", with Quoc V. Le. This paper laid the foundation for the birth of GPT. Another foundational paper of his is "Glam: Efficient scaling of language models with mixture - of - experts", which paved the way for the current mainstream MoE architecture.
Image source: Google
During his time at Google, he was also deeply involved in almost all large - model trainings, from Plam to Gemini 1.5 and Gemini 2.5. Arranged by Jeff Dean, he began to lead the data section (including synthetic data) of Gemini in 2023, and the size of this team later expanded to hundreds of people.
Image source: Yinfei Yang's LinkedIn
Yinfei Yang, who co - founded the business with Andrew Dai, worked at Google Research for four years, focusing on multimodal representation learning. Then he joined Apple, in charge of the R & D of multimodal models.
Image source: arxiv
His representative research result, "Scaling up visual and vision - language representation learning with noisy text supervision", promoted the development of multimodal representation learning.
The co - founders of Elorian AI also include Seth Neel, who was once an AP (Assistant Professor) at Harvard University and is also an expert in the field of data and AI.
Why discuss the groundbreaking papers written by the co - founders of Elorian AI? Because what they are going to do is not an optimization at the engineering level, but a paradigm update at the underlying architecture, aiming to upgrade AI from text - based intelligent understanding to vision - based intelligent understanding.
The current situation of AI models is that although they perform excellently in text - based tasks, even the most advanced cutting - edge multimodal large models still stumble on the most basic visual grounding tasks.
For example, how to fit a certain part precisely into a mechanical device to make it run more accurately and efficiently? Such spatial and physical tasks are very simple for primary school students, but very difficult for existing multimodal large models.
We still need to find clues from biology. In the human brain, vision is the underlying substrate that supports many thinking processes. The ability of humans to use vision and spatial reasoning is much older than using language - based logical reasoning.
For example, to teach someone to navigate through a maze, a verbal description will confuse the person, but a sketch can make the person understand immediately.
Another example is that even a bird, although it doesn't have language, can recognize and reason about geographical features through vision, thus achieving long - distance global migration. This is a strong signal indicating that vision is probably the correct evolutionary direction to truly promote the reasoning ability of machines.
Then, imagine that if from the very beginning of model construction, we try to engrave this biological visual instinct into the genes of AI and build a native multimodal model that can "simultaneously understand and process text, images, videos, and audio", the model will have visual understanding ability. Andrew Dai and his team want to build a natural "synesthete", teaching machines not only to "see" the world but also to "understand" it.
In the view of Andrew Dai and his team, a profound understanding of the real "physical world" is the key to achieving the leap of the next - generation machine intelligence and ultimately reaching "Visual AGI".
The VLM with post - reasoning is not the right path to visual reasoning
It's not that no team has ever wanted to do this. In fact, the Gemini team that Andrew Dai was in before was already a very leading team in the global multimodal field. However, traditional multimodal models still mainly rely on VLM (Visual Language Model), and its logic is based on a "two - step" approach: first convert visual input into language, and then conduct text - based reasoning (sometimes with the assistance of external tools).
However, post - reasoning essentially has limitations. On the one hand, it is prone to model hallucinations. On the other hand, many visual tasks simply cannot be accurately described in words.
In addition, visual generation models such as NanoBanana have excellent capabilities in multimodal generation, but the generation ability does not equal the reasoning ability. Their "thinking" before generation still essentially depends on the language model, not the native reasoning ability.
If we want to develop a model that can truly understand the complexity of space, structure, and relationships in the visual world, we must carry out subversive innovation in the underlying technology.
So, how to innovate? The founders of Elorian AI have been deeply involved in the multimodal field for many years. Their approach is to deeply integrate multimodal training with a brand - new architecture designed specifically for multimodal reasoning. They abandon the traditional practice of regarding images as static inputs and instead train the model to directly interact with and manipulate visual representations to independently analyze the structure, relationships, and physical constraints within.
Of course, another core element is data, which is the key to determining the performance and success of these models.
Andrew Dai said that they attach great importance to data quality, data mixing ratio, data sources, and data diversity. They have also carried out innovation at the data level, reconstructed the reasoning link in the visual space, and used synthetic data on a large scale and in depth.
Combined, these efforts will give rise to a new AI system that can go beyond simple visual "perception" and move towards high - order visual "reasoning".
This AI system can be a basic visual reasoning model: that is, to build a highly general model that performs extremely well in a specific set of capabilities, which is visual reasoning.
Since it is a general basic model, its application areas should be wide.
First of all, in the robotics field, it can become the underlying nerve center of a powerful system, endowing robots with the ability to operate autonomously in various unfamiliar environments.
For example, in the robotics field, sending a robot to handle a sudden safety failure in a dangerous environment requires the robot to make quick and accurate instant decisions. If the robot lacks a basic model with in - depth reasoning ability, people won't dare to let it randomly press buttons or operate levers. But if it has strong reasoning ability, it may think: "Before operating this panel, maybe I should first pull this lever to activate the safety protection mechanism."
Also, in disaster management, a model with visual reasoning can monitor and prevent forest fires by analyzing satellite images; in the engineering field, it can accurately understand complex visual drawings and system schematics. The significance of this ability lies in the fact that the operating laws of the physical world are fundamentally different from those of the pure code world. You can't design the wing of an airplane just by typing a few lines of pure code.
However, currently, the models and capabilities of Elorian AI are still only on paper. They plan to release a model that reaches the SOTA level in the field of visual reasoning in 2026, and then we can test whether their achievements match their claims.
When AI truly has "visual reasoning" ability, how will it change the physical world?
To enable AI to understand and influence the real physical world, the technology has iterated several times.
From image recognition in the traditional CV era, to image generation models/multimodal models in the era of generative AI, and then to world models, the understanding of the physical world has been continuously enhanced.
The basic model of visual reasoning is very likely to take it a step further. Because if AI can achieve visual reasoning, it can understand the physical world more deeply, thus achieving a higher level of machine intelligence.
Imagine that when models with in - depth understanding and fine - grained operation capabilities "charge" the embodied intelligence industry and the AI hardware industry, it will greatly expand their application scope. For example, robots can be used in industrial production with higher reliability or in the medical care field; AI hardware, especially wearable devices, can become smarter personal assistants.
However, at the bottom of these technologies is still data. As Andrew Dai mentioned earlier, data quality, data mixing ratio, data sources, and data diversity all determine the performance of the model.
In the field of physical AI, Chinese enterprises are closer to the world - leading level both in terms of models and data compared to text - based large models. If they can take advantage of the rich application scenarios and data to speed up the iteration, then whether it is embodied intelligence or AI hardware, whether applied in industry, medicine, or the family, there is a greater chance to reach the leading level, and there is also a chance to produce world - class enterprises.
This article is from the WeChat official account "Alpha Commune" (ID: alphastartups). The author is the one who discovers extraordinary entrepreneurs. It is published by 36Kr with authorization.