Is Vibe Coding outdated? Google starts to focus on Vibe Searching.
AI can generate images and videos based on your text.
However, when what we humans want is a certain scene, an atmosphere, or a vague impression, machines are at a loss.
You can't enter "that feeling of extreme loneliness" in the search box and get a perfect still image, nor can you say to the surveillance system, "Help me find the fight footage."
Text is text, images are images, videos are videos, and audio is audio. They are each self - contained and not interconnected.
In the first quarter of 2026, while other large - model manufacturers were still competing in the fields of agents and content generation, Google quietly released the Gemini Embedding 2 model.
It brings text, images, videos, audio, and documents into the same semantic space.
This means that you can use a sentence to find an image, an image to find a video, and an audio clip to find a document.
The barriers between the five modalities have been broken, and for the first time, machines have a "synesthetic" ability similar to that of humans.
It no longer sees the world as fragmented file formats. Instead, like you, it understands a melody, a scene, and a sentence as different expressions of the same thing.
Some netizens commented, "Artificial intelligence no longer sees the world as fragmented. It views the world in the same way you do."
01 Google's Strategic Significance: Setting Standards Instead of Competing at the Application Layer
Google's choice to release this model at this time is thought - provoking.
In the current OpenClaw frenzy, everyone is competing to see who has a smarter brain and more dexterous hands.
However, Google takes a step back to polish a more fundamental ability - perception.
To understand the significance of this move, we need to first recognize a fact. Before the emergence of Gemini Embedding 2, multi - modal embedding was not something new. In fact, it could even be considered a bit "old - fashioned."
Derivative models of Nomic, Jina, and CLIP have all made attempts, but they either cover only two or three modalities or lack precision. In summary, they are usable but not very effective.
More importantly, most embedding models on the market are still essentially "text - first."
Want to search for a video? First, transcribe the video into text and then embed the text. This intermediate step not only slows down the process but also inevitably leads to a loss of semantics.
The composition of the picture, the mood of the music, and the tone of the speaker - these subtle signals that only exist in the original modality disappear the moment they are transcribed into text.
The approach of Gemini Embedding 2 is completely different.
It inherently understands sound waves and dynamic images, directly mapping the five modalities into the same 3072 - dimensional semantic space without any intermediate translation.
When the legal technology company Everlaw used the embedding 2 model to handle the litigation discovery process, the retrieval recall rate across millions of records increased by 20%. Another enterprise, Sparkonomy, found that compared with the previous multi - pipeline solution, the latency was reduced by 70%, and the semantic similarity score doubled.
A smart brain is important, but if it can't see, hear, or touch the complex multi - modal information in the real world, it's like a genius locked in a dark room, with no way to show its intelligence. So Google's strategy is: Instead of competing head - on with opponents in upper - layer applications, it's better to build the roads and set the standards directly.
Where do the standards start? The premise is that the embedding standards of each large - model manufacturer are completely incompatible.
For the same photo, its coordinates in Google's semantic space might be (1, 2), but in OpenAI's system, they become (9, 8). Google's own documentation also clearly states that when upgrading from the previous - generation gemini - embedding - 001 to Embedding 2, all existing data must be re - embedded, and the vectors generated by the two generations of models cannot be directly compared.
Once an enterprise uses Google's model to index its accumulated images, audio, and videos over the years, migrating to another platform means re - feeding and re - calculating all the data. This index reconstruction project, which consumes a huge amount of computing power and time, will unknowingly bind the enterprise deeply to Google's ecosystem.
Google understands this well and is accelerating this binding.
On the day of its release, Embedding 2 was already integrated with almost all mainstream AI development frameworks and vector databases, such as LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Pinecone. The official Colab sample code is open - sourced under the Apache 2.0 license, and the pricing for text embedding is only $0.20 per million tokens, with a 50% discount for bulk calls.
The intention of this set of actions is very clear: to attract developers and enterprises to enter with low barriers. As the data accumulates to a certain scale, the migration cost will snowball.
"Our approach to developing and leveraging the potential of artificial intelligence is rooted in our founding mission - to organize the world's information and make it universally accessible and useful." This is a sentence from Google's official website in 2023, "Why We Focus on Artificial Intelligence and What Our Purpose Is."
From AlphaFold, which helps scientists explore protein folding, to the Gemini DeepThink mode for top - level mathematical and physical problems, and now to this cross - modal retrieval, Google is indeed fulfilling this promise step by step.
02 A Milestone Technological Breakthrough
Gemini Embedding 2 supports more than 100 languages and has a context window of 8192 tokens (corresponding to approximately 4000 to 5000 Chinese characters). Each request can process up to 6 images, 120 - second videos, and 6 - page PDFs.
In benchmark tests, its scores in multi - language retrieval, code retrieval, and image - text retrieval comprehensively surpassed those of Amazon Nova 2 and Voyage 3.5.
What truly makes this a milestone is not just the scoring numbers but the uncharted territory it aims at.
According to an IDC report in 2023, unstructured data such as videos, audios, and images account for 92.9% of the world's total data. Even by 2028, this proportion is only expected to drop to 82.3%.
In other words, most of the information generated by humans - meeting recordings, product videos, design drafts, and surveillance footage - due to their unstructured nature, has long been dormant in the vast Internet world and cannot be retrieved on demand, like closed black boxes.
Previously, the mainstream method for semantic comparison and indexing of this black - box data was the "dual - encoder" architecture, such as OpenAI's CLIP.
One visual encoder processes images, and one text encoder processes text. The two encoders operate independently and are then aligned in the same space through contrastive learning.
Google's Cloud team wrote in a technical blog: Since the two encoders are separate and only meet at the final stage, they miss the opportunity to form deep cross - modal connections in the middle layers of the network.
It's like two translators translating a book into different languages and then trying to align them at the table of contents level. The literal meanings might match, but the subtle context and emotions in the original text are lost in the process.
With Gemini Embedding 2, when the model processes a product image with a text description, it doesn't understand the image and text separately and then piece the results together. Instead, like a human, it perceives visual and language information as a whole.
This also creates a new way of retrieval: interleaved input.
Developers can input a piece of text, three images, and an audio clip in a single API call, and the model will return a unified vector that captures all cross - modal relationships.
To put it more intuitively, for example, an e - commerce platform wants to implement an "image - to - product" search function. However, the user's demand is complex: they take a photo of a friend's coat and enter the text, "Similar to this style but with a warmer color."
In the traditional solution, the system can either understand the image or the text, always missing one or the other, and the two clues cannot be combined.
Interleaved input allows the model to generate a unified vector that encodes both the "coat style" and the "warm color tone," and then use this vector to search in the product database.
The information of the two modalities truly converges into a complete intention at the vector level.
03 The Era of Vibe Searching Is Here
If programming with natural language marked the beginning of the Vibe Coding era, then being able to find highly matching multi - modal content with a description, an image, or an audio clip marks that we are entering the Vibe Searching era.
After the new embedding model is integrated into Google Workspace, Gemini can accurately analyze financial documents that mix images and tables. In Gmail, if you can't remember the keywords of an email, you can find it with just a vague description. When integrated with YouTube, even if users forget the video title and the blogger's name, they can accurately find the corresponding video by describing the content and style of the video.
The model no longer just matches keywords but can understand aesthetics, style, and atmosphere.
The essence of search has also changed accordingly: in the past, you needed to precisely match keywords; now, you only need to vaguely express your intention.
You no longer need to know what the thing you're looking for is called. You just need to tell it how the thing makes you feel.
This transformation's impact on the content industry is particularly worthy of attention. Today's content recommendation highly depends on manual tagging, and good content that isn't tagged often goes unnoticed.
The model can't understand the excellence of a work because it can only look at the picture, listen to the music, and read the text in isolation.
Current AI can't have an intuitive understanding of beauty like humans.
However, Gemini Embedding 2 can "intuitively understand" a work from a comprehensive perspective, as if it has human aesthetics.
It can sense the semantic distance between the melody of a song and the listening preferences of a certain type of user and then recommend it to the right people. Good content no longer needs to market itself; it just needs to be good.
The same applies to enterprise knowledge management.
For example, a manufacturing enterprise that has been in operation for ten years has tens of thousands of technical manuals, product drawings, quality inspection reports, and meeting recordings in its network disk.
One day, a newly - hired engineer encounters a problem with an abnormal yield rate. He vaguely remembers that an old master mentioned a similar case but doesn't know where the record is.
It might be mentioned in a chart in a PDF or in a discussion in a meeting recording. Previously, he could only ask people one by one and rummage through folders in the hope of finding it.
With the support of cross - modal retrieval, he can directly describe the characteristics of the problem, and the system can search simultaneously in charts, recordings, and documents, accurately retrieving the solution mentioned by a former colleague in a meeting three years ago.
The most valuable experience of an enterprise is no longer tied to an individual's memory. The knowledge base has changed from a warehouse for storing miscellaneous items to a real - time brain that responds and can be quickly accessed at any time.
Looking further ahead, in the field of embodied intelligence, cross - modal embedding may become the infrastructure for robots to understand the physical world. When a warehouse robot hears, "Bring me that red, soft - feeling thing," it can simultaneously process language instructions, visual recognition, and tactile memory and find the intersection of the three in the semantic space.
Establishing synesthesia between vision, hearing, and logic in a unified vector space is exactly what Gemini Embedding 2 is good at, enabling robots to no longer mechanically execute preset instructions but to perceive, judge, and act in the real physical space like humans.
Google has made its move. The time window for its opponents is closing.
This article is from the WeChat official account "Alphabet AI", written by Liu Yijun, and is published by 36Kr with permission.