StartseiteArtikel

Google's AGI foundation has arrived. The first native full-modal embedding model has been launched, achieving the state-of-the-art (SOTA) in full modalities.

新智元2026-03-12 10:57
Google releases a native full-modal Embedding model to unify text, image, audio, video, and PDF retrieval.

Google has released its first native full-modal Embedding model, Gemini Embedding 2! It seamlessly integrates text, images, audio, video, and PDFs into a unified vector space, enabling direct retrieval across five major modalities. This significantly reduces architectural costs and endows AI with truly coherent "memory," marking a milestone in reshaping AI infrastructure.

If large generative AI models like ChatGPT are the "mouth" for AI to express itself, then Embedding models are the "memory nerves" responsible for understanding and retrieval.

For a long time, these memory nerves have been fragmented.

Yesterday, the Gemini API launched the preview version of the first multimodal Embedding model, gemini-embedding-2-preview.

As the first native full-modal Embedding model, it integrates text, images, audio, video, and even PDF documents into a unified vector space.

Unveiling the Disruptive Value of "Native Full-Modality"

To truly understand the strategic significance of this technology, we need to recognize the "Data Babel Tower" dilemma faced by past AI retrieval systems.

In the past, the visual, audio, and text processing modalities seemed to speak different languages. Every time global information was scheduled, extremely cumbersome translation and alignment were required.

The emergence of Gemini Embedding 2 is equivalent to promoting a common language in the data world. Its core breakthroughs are reflected in the following dimensions.

Eliminating Transcription Nodes and Information Loss Black Holes

The value of the word "native" lies in rejecting any form of compromise and translation.

In the early days, for AI to "understand" podcasts, it was necessary to externally attach a speech recognition model to convert them into plain text first. As a result, "redundant information" such as the speaker's slightly ironic tone and the shrill siren in the background instantly vanished.

Today, the model directly "swallows" the waveforms of MP3 audio tracks and the original pixels of high-resolution images. Those sensory details that are only perceptible but inexpressible have finally found precise coordinates in the mathematical space.

Connecting a Unified Coordinate System and Unlocking Cross-Species Search

When the five major data types are compressed into the same high-dimensional vector space, the boundaries of data are completely eliminated.

Developers can easily achieve extremely complex cross-modal retrieval:

Input a recording of engine noise, and the system will instantly locate the drawings of the faulty parts from a vast number of PDF maintenance manuals;

Upload a photo of a building with a post-modern style, and the system can directly recall video clips with a very similar musical style.

Retrieval has completely evolved into a pure "resonance of semantics and intentions."

Greatly Simplifying the Architecture and Drastically Reducing Engineering Complexity

In the past, piecing together a multimodal retrieval application was a nightmare for engineers.

Maintaining multiple independent models, spending a lot of money to purchase isolated vector databases, and then writing extremely complex re-ranking algorithms to force the alignment of various scores. This makeshift architecture not only had extremely high latency but was also prone to collapse.

Now, this messy infrastructure has been condensed into a simple API call. One set of models is sufficient to penetrate the entire business process.

Agent entrepreneurs who have had an early taste of it also spare no praise for this new full-modal model.

Completing the Memory Puzzle for Agents

Agents often appear slow, mainly because their "memory" is fragmented.

After an Agent reads a research report with a large number of data charts, it often only remembers the text, while the chart part is discarded.

The native full-modal Embedding endows AI with a coherent underlying cognitive mode, allowing machines to finally integrate the sounds they hear, the pictures they see, and the paragraphs they read into a complete memory, just like humans.

The "Five-in-One" Engine and the Magic of Cost Reduction

The new model not only covers the five major data types but also has an extremely wide throughput boundary!

Text: Supports over 100 languages, with a context of up to 8192 tokens.

Images: Can ingest up to 6 images (supports PNG and JPEG) in a single request.

Video: Can handle dynamic images up to 128 seconds long.

Audio: Can directly understand audio tracks up to 80 seconds long without relying on transcription tools.

Documents: Can natively read PDFs up to 6 pages long, skipping the conventional OCR extraction.

While demonstrating its capabilities, Google has also calculated an economic account for enterprises.

Gemini Embedding 2 uses the ingenious "Russian nesting doll" representation learning technology (MRL).

This technology allows developers to flexibly "fold" the volume of vectors according to their own storage budgets, just like disassembling nesting dolls.

In the default full-state of 3072 dimensions, the model can naturally provide an ultimate retrieval benchmark.

https://ai.google.dev/gemini-api/docs/embeddings?hl=zh-cn

What's truly amazing is its resilience when compressed downwards: when the dimension is halved to 1536, its MTEB multilingual performance score still remains firm at 68.17 points. There is even a counterintuitive phenomenon - this score is slightly higher than that of 2048 dimensions.

Even if you compress the budget to the extreme and reduce the vector volume by 75% to 768 dimensions, its score only drops slightly by 0.18 points (67.99 points).

This means that development teams can significantly reduce storage and computing costs without sacrificing the core retrieval quality, leveraging top-notch multimodal capabilities with extremely high cost-effectiveness.

Business Position and Pitfall Avoidance Guide

Looking around, the competition in this field has never been so fierce.

OpenAI's text-embedding-3 still firmly holds the pure text position, and its visual aspect relies entirely on the old model;

The Embed v4 of the veteran player Cohere misses the two key pieces of audio and video;

Jina v4, the most powerful in the open-source camp, has covered images, text, and PDFs, but is still powerless against sound and dynamic images.

Gemini Embedding 2 precisely fills the market gap and becomes the only commercial-grade all-rounder covering the five major modalities at present, achieving the state-of-the-art in full-modality!

For engineering teams ready to take the plunge, there are several real "pitfalls" that must be avoided in advance:

  • Compatibility gap. The vector spaces of the old and new models are under different dimensional rules. For systems migrated from the old gemini-embedding-001, a large amount of historical data must be re-encoded and re-indexed.
  • Format and duration thresholds. Currently, audio only supports MP3 and WAV, with a hard limit of 80 seconds. Longer meeting recordings must be sliced manually.
  • Manual normalization. At the code call level, if a non-default low-dimensional output (such as 768 dimensions) is selected, developers need to attach an external script to manually perform L2 normalization.

When isolated data islands are completely connected, the complex real world can finally cast a clear reflection in the deep sea of code.

The most profound intelligent revolution often hides in those unassuming infrastructures, quietly reshaping all things into the same language.

References:

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/

This article is from the WeChat official account "New Intelligence Yuan", author: Allen. Republished by 36Kr with permission.