Google CTO and Chief AI Architect Reveals: How Google Achieved an AI Comeback in Two and a Half Years
- The fundamental criterion for AI progress is not the benchmark test score, but rather its ability to truly integrate into and empower real - world knowledge and creative work.
- The core improvements of Gemini 3 focus on precise intent understanding, global service capabilities, and tool - based and creative capabilities with exponential effects.
- "Ambient programming" (natural language programming) is breaking down the barriers between creativity and implementation, making innovation an accessible ability for everyone.
- The realization of AGI is not a closed - door laboratory research, but rather an engineering practice that must be jointly constructed through continuous interaction with the real world.
- Text and visual models are sharing the underlying architecture, and this technological convergence creates a more intuitive interaction experience for humans.
- The core difficulty in achieving a unified model architecture lies in resolving the dual standards between the structured signals of text and the pixel - level precision and conceptual coherence required for image generation.
"We are still far from the top level." Two and a half years ago, when Google DeepMind launched the Gemini project, Chief Technology Officer and Chief AI Architect Koray Kavukcuoglu said bluntly at an internal meeting.
At that time, Google was significantly lagging behind in the large - model race. AI Studio had only 30,000 users and zero revenue, and the team was under great pressure in the fierce competition.
From openly admitting lagging behind to Gemini 3 detonating the market, Google has completed a remarkable comeback. Behind this turnaround are three key transformations:
First, shift from a laboratory mindset to a battlefield mindset and establish an update rhythm of "major iterations every six months".
Second, abandon the "big and comprehensive" approach and focus on three key strengths, enabling the model to truly understand human intentions, serve global users, and have the ability to use and create tools.
Third, activate Google's ultimate weapon, mobilize 2,500 experts across six continents, and achieve end - to - end collaboration from the chip layer to hundreds of millions of products such as search and Android.
In this AI arms race that concerns the future, how did a tech giant admit its lag and then catch up in just two and a half years? In a conversation with Google DeepMind senior product manager Logan Kilpatrick, Kavukcuoglu revealed for the first time the real story behind the comeback.
The following is the essence of the exclusive interview with Kavukcuoglu:
Question: After the release of Gemini 3, the market feedback has been positive. How do you evaluate the breakthroughs of this generation of models?
Kavukcuoglu: After completing benchmark tests and pre - release verifications, the actual performance of Gemini 3 has indeed met our expectations. This model not only has powerful technical capabilities but, more importantly, has been recognized by users in real - world application scenarios. Although there is still room for improvement, the current feedback is encouraging, and the innovation points that users are concerned about are highly consistent with the technical directions we have set.
Question: From Gemini 2.5 to Gemini 3.0, the pace of technological progress seems to be accelerating. How do you view this development trend?
Kavukcuoglu: The current AI field is indeed maintaining an amazing pace of innovation. Whether at the basic research or engineering practice level, we have seen continuous breakthroughs. This progress stems from a virtuous cycle: when technology generates value in real - world scenarios, we receive more feedback, which in turn gives rise to new innovative ideas. As the complexity and diversity of the problems to be solved continue to increase, these challenges actually drive us forward.
Question: What role does benchmark testing play in technological development?
Kavukcuoglu: Benchmark testing and model development are complementary. For example, in the HLE (Human - Level Exam, which measures an AI's ability to solve human - level complex problems) benchmark test, known as the "last exam for humans", early models could only achieve a level of 1% or 2%, while advanced models like DeepThink can now exceed 40%. And for challenging benchmarks like GPQA Diamond, although we are still gradually improving the performance by 1%, they do point to core problems that have not been fully resolved.
Question: Does the progress in benchmark tests such as GPQA mean that we need to redefine the technological frontier?
Kavukcuoglu: Benchmark testing is indeed important, but it is not entirely equivalent to real progress. In my opinion, the fundamental criterion for measuring technological progress lies in practical applications. When scientists use the model to advance research, students complete their studies with it, and engineers solve practical problems using it, when these tools are truly integrated into every aspect of human knowledge work, can we say that real progress has been achieved. The role of benchmark testing is to provide a quantifiable reference dimension for this progress.
Three Technological Pillars and the Product Flywheel: The Way for Gemini 3 to Break Through
Question: During the model iteration process, how does the team determine the key directions for technological breakthroughs? For Gemini, especially the Pro model, which aspects are you trying to improve?
Kavukcuoglu: We mainly focus on three core dimensions:
First is precise intent understanding. The model must accurately capture the deep - seated intentions of user instructions, rather than simply performing pattern matching. This requires breaking through traditional response logic and establishing true task - understanding and execution capabilities.
Second, global service capabilities. As a technology platform serving global users, Google needs to ensure that its technology is truly inclusive and can reach everyone around the world. The breakthrough performance of Gemini 3 Pro in multiple non - dominant language scenarios marks an important step towards technological inclusiveness.
Finally, tool - based and creative capabilities. At the technical implementation level, we are focusing on breaking through core capabilities such as function calls, tool utilization, agent actions, and code generation. Among them, the tool - calling ability has a unique exponential effect, enabling the model to flexibly use existing tool libraries to complete complex reasoning and also have the native ability to create new tools. This self - evolving characteristic transforms the model from an execution tool into a creator of tools.
The importance of code capabilities is not only reflected at the technical level but also as the cornerstone of the digital world. In today's era of in - depth digital development, code has become the core medium connecting creativity and reality, making every idea realizable through computation.
We are witnessing a fundamental change in the programming paradigm. Through natural language programming (or ambient programming), creators only need to describe their ideas in everyday language to generate usable programs in real - time. This new paradigm of "description equals implementation" has lowered the technical threshold to an unprecedented level. When the barrier between creativity and implementation is broken, innovation is no longer the privilege of professional developers but an ability accessible to everyone with an idea.
Question: What value does Google's newly launched agent - coding platform Anti - Gravity have for model optimization?
Kavukcuoglu: Such product platforms constitute an important infrastructure for our technological evolution. From the perspective of model research and development, establishing direct product - level connections with developers has dual value:
First, the real - user feedback obtained through products such as AI Studio and Anti - Gravity provides us with the most direct direction for technological optimization. These demand signals from the front line of development can more accurately reveal the dimensions that the model needs to improve than any simulated test.
Second, this closed - loop between product and research is reshaping our R & D paradigm. Just as the AI overview function of search is continuously optimized through massive user interactions, the in - depth feedback provided by Anti - Gravity during the release phase has also become a key driving force for model iteration.
It should be emphasized that although benchmark testing sets the coordinates for our technological breakthroughs, the real measure of technological value is always the application effect in the real world. Only when the model continuously creates value in specific scenarios does technological evolution have real vitality.
From Research to Engineering: How the Chief AI Architect Remodels the Technology Implementation Paradigm
Question: As the Chief AI Architect, how do you view the collaborative relationship between model research and development and product implementation?
Kavukcuoglu: The value of technology ultimately needs to be realized through the product experience. My core mission is to ensure that all Google product lines can be supported by the most cutting - edge AI capabilities, while also transforming product feedback into an important driving force for technological evolution.
A two - way cycle of technology empowerment and demand insight. We are committed to building a complete technology - empowerment system: on the one hand, transforming the capabilities of cutting - edge models into product value; on the other hand, obtaining improvement directions through real - user scenarios. This two - way cycle is reshaping our R & D paradigm. Products are not only application scenarios for technology but also important sources driving technological breakthroughs.
Redefining the user experience in the AI era. Currently, we are at a critical juncture in the transformation of the human - machine interaction paradigm. New AI technologies are redefining users' expectations of products, including interaction methods, service depth, and information presentation forms. This requires us to closely collaborate with various product teams to jointly explore the boundaries of the next - generation intelligent experience.
Building a practical path for AGI. We firmly believe that the realization of AGI must be through continuous interaction with the real world. Product platforms precisely provide this valuable connection channel, allowing us to collect feedback signals from hundreds of millions of users and continuously calibrate the direction of technological development. This is the fundamental reason why we regard product integration as the core link in the evolution of AGI.
Question: You mentioned the concept of jointly building AGI with customers and products, which seems to go beyond the traditional research model?
Kavukcuoglu: This is exactly the core concept of our methodology. Building AGI is not a closed - door laboratory research but an engineering practice of continuous interaction with the real world.
To this end, we are establishing a complete system based on engineering thinking:
A systematic security architecture: From the model pre - training stage, security considerations are deeply integrated into the entire development process. We not only have a professional security team but also ensure that every R & D personnel has a security awareness. In each iteration review, security indicators are as important as performance indicators.
Global - collaborative engineering practice: The release of Gemini 3 reflects Google's unique collaborative ability. Just as modern aerospace engineering requires global collaboration, we have brought together technical teams from six continents to achieve seamless integration from underlying research to product integration. This large - scale technical coordination ensures that the model can provide a consistent user experience across all product lines upon release.
Product - driven technological evolution: When products such as the AI overview and Gemini applications participate in model optimization at the early development stage, we actually establish a continuous - improvement flywheel. Product teams are not only users of technology but also strategic partners in jointly defining the technological direction. This in - depth integration enables us to quickly transform laboratory innovations into user value.
Post - Gemini 3 Era: The Next Battlefield for Agents, Creation, and Specialization
Question: After the remarkable achievements of Gemini 3, how will the team plan the development path for the next - generation model?
Kavukcuoglu: We always maintain a balance between "celebrating achievements" and "pursuing excellence". Currently, we should indeed be proud of the progress made by Gemini 3, but at the same time, we are also clearly aware that technological breakthroughs are endless.
From a technical perspective, we have identified several key areas for improvement:
Content creation quality: Although the current model already has excellent text - generation capabilities, it still needs to be strengthened in terms of maintaining style consistency, emotional accuracy, and logical rigor.
Agent and programming capabilities: This represents the most promising area for breakthroughs. We need to enable the model to reach new heights in complex task planning, autonomous decision - making, and code optimization.
Specialized scenario coverage: Although the existing model already serves the vast majority of developers, when dealing with complex requirements in specific fields, we still need to improve the model's accuracy and reliability.
Question: Looking back at the development process of Gemini, why has it been able to maintain a leading position in the multimodal field, while the development of agent - tool usage has been more gradual?
Kavukcuoglu: This difference stems from a fundamental change in the logic of technological evolution. The Gemini project represents a major transformation from a pure research paradigm to an engineering mindset. In the early days, the team was mainly composed of researchers, and we were good at solving well - defined problems in a closed environment.
Multimodal technology fits this model well, as its technical challenges are relatively focused, and the evaluation criteria are relatively clear. In contrast, agent - tool usage is essentially an open - environment problem that requires continuous interaction with the real world to be perfected.
Now we have established a completely different development rhythm: major version iterations are released every six months, while monthly updates are maintained. This engineering cycle enables us to quickly incorporate user feedback into technological improvements, forming a continuous - optimization closed - loop.
Multimodal Convergence: The Future of Generative Media from the Perspective of Nano Banana
Question: In the process of building AGI, what role do generative media models play?
Kavukcuoglu: The development trajectory of generative media models reveals the internal logic of AI evolution. Looking back at the academic development history, image generation was an important starting point for early research. Through visual output, we can intuitively test the model's understanding of the physical world. Starting from pioneer works such as PixelCNN, we have gradually established a systematic understanding of generative models.
However, the development of technology presents an interesting dialectical relationship: when text models became the main carrier of rapid progress due to their structured nature, media models went through a necessary precipitation period. But now we see that multimodal convergence has become an inevitable trend in technological development.
This convergence is not artificially driven but a natural result of architectural evolution. As the model's capabilities improve, the text and visual domains, which were originally separate, are sharing more and more of the underlying architecture. The semantic understanding brought by text models and the physical intuition contained in image models are forming a powerful complementary effect.
The recently seen Nano Banana model is an early example of this convergence. It demonstrates the model's ability to process both visual and language signals simultaneously, allowing users to feel that the system truly understands their creative intentions. This technological convergence not only improves performance indicators but, more importantly, creates a more intuitive interaction experience for humans.
Question: Will an informal naming style like Nano Banana become a cultural feature of the team?
Kavukcuoglu: This naming style does reflect the unique cultural temperament of the technical team. The development code - name RiftRunner for Gemini 3, including vivid names like Nano Banana, often stems from the natural consensus formed by the team during the development process. This organic naming culture, to some extent, reflects the emotional connection between the technical team and the products they create.
Between formal names and creative code - names, we value the natural generation process of naming more. When a name can accurately convey the technical characteristics and resonate with the team, it has unique value. However, we also recognize that maintaining the consistency of the naming system is equally important in formal releases and technical dissemination.
The Nano Banana Pro, based on the architecture upgrade of Gemini 3 Pro, represents an important progress in our multimodal understanding. While maintaining its creative generation ability, this model has achieved significant improvements in professional dimensions such as text rendering accuracy and understanding of the physical world. Especially in complex scenarios that require in - depth integration of text and visual information, it demonstrates reasoning ability beyond previous - generation models.
Question: During the process of technological convergence, which breakthroughs impressed you the most?
Kavukcuoglu: We are witnessing a fundamental change brought about by the evolution of the model architecture. The model - family concept adopted by the Gemini series - meeting diverse needs through different specifications such as Pro and Flash - reflects our precise trade - off between performance and efficiency. This technical approach is also applicable to the field of image generation.
The new - generation model based on the architecture upgrade of Gemini 3 Pro demonstrates amazing capabilities in understanding complex documents and generating infographics. When users input a large amount of professional materials, the model can not only accurately parse the content but also transform it into intuitive visual presentations. This smooth conversion from text to image marks the maturity of multimodal interaction.
Question: Regarding the vision of a unified model architecture, what core technical challenges are currently faced?
Kavukcuoglu: We are steadily advancing the exploration of a unified model architecture, and models of different modalities do show a trend of architectural convergence. However, this is essentially an exploration process following scientific laws,