HomeArticle

Why can modern AI be achieved? A conversation between Hinton and Jeff Dean

AI深度研究员2025-12-19 08:41
Hinton and Jeff Dean Discuss AI Breakthroughs: Collaborative Evolution of Algorithms, Hardware, and Engineering

In early December 2025, at the NeurIPS Conference in San Diego.

The fireside chat between Geoffrey Hinton (the founding father of neural networks and the winner of the 2024 Nobel Prize in Physics) and Jeff Dean (Google's Chief Scientist, co-leader of the Gemini model, and architect of the TPU) became a significant moment at this conference.

The conversation focused on a key question:

Why can modern AI move from the laboratory to billions of users?

From training AlexNet on two GPUs in a student's bedroom to Google calculating the TPU requirements on a napkin; from niche academic experiments to the infrastructure supporting hundreds of millions of global applications.

This is a systematic review of the industrialization process of AI.

The answer they gave is: The breakthrough of modern AI has never been a single-point miracle, but a systematic emergence after the simultaneous maturity of algorithms, hardware, and engineering. Powerful algorithms must be combined with strong infrastructure to truly achieve large-scale implementation.

Following the timeline, we outline three key stages:

  • Starting breakthrough: How hardware turns AI from an idea into reality
  • System maturity: How algorithms, organizations, and tools work together
  • Future thresholds: Three barriers to break through after large-scale implementation

By understanding this path, you can understand why AI is the way it is today.

Section 1 | The breakthrough of AI starts with a GPU board

Geoffrey Hinton said that the real turning point of modern AI was not in a certain paper, but in the bedroom of his student Alex: Two NVIDIA GPU boards were plugged into the computer at his parents' home to train an image recognition model. The electricity bill was paid by his family.

That was in 2012, during the ImageNet competition.

While others used manual feature extraction, he and his student team used a deep neural network. With ten times more parameters and several times more computing power than others, the accuracy far exceeded that of the opponents. AlexNet thus established the status of deep learning.

This victory proved one thing: Without sufficient computing power, any structure is just an imagination.

Jeff Dean's memory goes back even earlier: In 1990, when he was working on his undergraduate thesis, he started thinking about how to use parallel algorithms to train neural networks. He explored two directions, one is now called data parallelism, and the other is model parallelism, but no one used these terms at that time. He used a hypercube computer with 32 processors.

The problem was: He divided the computing power into 32 parts but only used 10 neurons.

"I made a big mistake."

This failed experience made him consider how to truly match the computing power with the model scale from the very beginning when designing the TPU more than twenty years later.

More than twenty years later, a similar computing power problem emerged again, but this time it was on the inference side.

In 2013, Jeff Dean did a calculation on a napkin: If 100 million people around the world used voice assistants for 3 minutes each day in the future, and the current model was launched, just this one application would require doubling Google's total number of servers.

This is a real physical cost.

He didn't wait for the budget meeting. He stopped Google's then CFO Patrick Pichette and said: We have to build our own hardware, and we need to do it now.

The TPU project was thus launched. In 2015, the first-generation TPU focused on inference rather than training. It was 30 - 80 times more energy-efficient in inference than contemporary CPUs and GPUs. It wasn't until the TPU v2 in 2017 that Google began to train models on a large scale on its self-developed hardware.

This is a vertically integrated approach. Ten years later, the TPU has evolved to its seventh generation. The Pathways system allows a single Python process to uniformly schedule tens of thousands of TPU chips distributed across data centers in different cities, just like operating an ultra-large computer.

Meanwhile, the NVIDIA GPU approach has also been continuously evolving.

From the two GPU boards of AlexNet to the H100 in 2023, the H200 in 2024, and the B200 starting to be delivered in 2025, NVIDIA GPUs still support large-scale training for companies like OpenAI and Meta. It's worth noting that the AI infrastructure has become diversified: Anthropic distributes training tasks between AWS's Trainium chips and Google's TPU, and each company is looking for the most suitable approach for itself.

The two approaches each have their own advantages:

The NVIDIA GPU ecosystem is open and highly adaptable, allowing entrepreneurs and researchers to access AI computing power;

Custom chips like the TPU and Trainium are deeply optimized for specific needs and have unique value in terms of energy efficiency and cost.

From the two GPU boards in a bedroom to the global AI computing power network, the first step in AI's breakthrough is not understanding language or creating content, but having sufficient computing power to complete training.

Section 2 | How the three curves intersect from AlexNet to Gemini

The large-scale application of modern AI is not due to a single genius inspiration, but rather the intensive intersection of three technological curves between 2017 and 2023:

1. The algorithm architecture found a scalable form

From AlexNet to Transformer, the core change is not being smarter, but being more scalable.

Convolutional neural networks are good at processing images, but the number of parameters is proportional to the number of layers, making it difficult to scale up;

Recurrent neural networks can handle sequences, but they have to process one word at a time, which is slow.

The breakthrough of the Transformer lies in: It transforms sequential processing into parallel processing. All tokens are calculated simultaneously, which is both fast and can fully utilize the parallel computing power of GPUs/TPUs.

In Jeff Dean's view, to achieve the same accuracy, the Transformer can use 10 - 100 times less computing power than LSTM. This is not a minor optimization, but rather makes large-scale training "engineering feasible" from a theoretical possibility.

Geoffrey Hinton was initially not optimistic. He thought this design of "saving all states" was not like the human brain.

But later he realized: It doesn't matter whether it's like the human brain or not. What's important is that it really makes the Scaling law hold.

2. The organizational approach changed from decentralized to centralized

Before the release of ChatGPT in 2022, there was already a chatbot inside Google, which was used by 80,000 employees. Technically, it was feasible, so why wasn't it launched to the market?

Jeff Dean said that they were restricted by the thinking of the search business. They were too concerned about accuracy and hallucination problems and forgot that it could do many things other than search.

The more crucial problem was: At that time, there were three teams in Google training models independently: Brain, Research, and DeepMind. Each team's computing power was not large enough, and they worked in isolation. One week after the launch of ChatGPT, Dean wrote a one-page memo: We could actually have developed this earlier, but we didn't pool our resources.

The Gemini team was thus born. For the first time, computing power, models, and talents were truly concentrated on one goal.

Technological breakthroughs are often not just technological problems, but also organizational problems.

3. The engineering tool stack formed a closed loop

AI is not just about models. It also requires a complete set of infrastructure to make it run, be debugged, and be reused:

JAX: Allows researchers to write code directly in mathematical language

Pathways: Enables a single Python process to schedule 20,000 TPU chips

Distillation technology: Compresses a model with hundreds of billions of parameters to run on a mobile phone

The value of these tools is not just to improve efficiency, but also to lower the entry threshold for AI. With JAX, researchers don't need to be system engineers; with Pathways, they don't need to manually manage tens of thousands of devices; with distillation, not every application needs to rely on cloud computing power.

Why these three? Because they form a closed loop:

The Transformer enables model scalability, but it requires greater computing power support;

Greater computing power requires the concentration of organizational resources and at the same time gives rise to better tools;

Better tools improve training efficiency, which in turn supports the training of larger models.

Without any one of these, AI would not have moved from the laboratory to the hands of one billion users.

Section 3 | Energy efficiency, memory, and creation: Three barriers after AI's large-scale implementation

The model can now run and be used in reality. So what needs to be broken through next?

Jeff Dean and Hinton in this conversation both pointed out three unresolved directions. This is not a problem of larger models, but rather three invisible barriers:

01 | Energy efficiency: The physical limit of large-scale implementation

As AI models become larger, the direct consequence is that they become more expensive and consume more electricity.

The training of Gemini involved tens of thousands of TPU chips. Each model upgrade means consuming more electricity, more time, and more budget.

Dean pointed out that although Google improved the inference energy efficiency by 30 - 80 times through self-developed TPUs in 2013, this problem has become more severe today: To truly popularize AI, we can't rely on simply adding more computing power, but need to change the way of training and deployment.

Google now controls the inference of the most commonly used models to run in an ultra-low-precision format like FP4. The logic behind this is simple: As long as the result is correct, the process can be fuzzy.

But this is not enough. Dean believes that the next-generation inference hardware needs to improve energy efficiency by another order of magnitude.

02 | Memory: The depth limit of context

The context window of current models, even the most powerful ones, is only a few million tokens at most.

Dean believes that the understanding ability of current models is still limited by how much information they can see at one time. Just like a person can only flip through 5 pages of a book at a time, AI can only read a section and then forget it.

Hinton also emphasized that they still can't truly remember things in the long term like humans.

To make AI truly assist in scientific research and complex decision-making, it must be able to process deeper and longer information at one time, such as an entire textbook, a whole year's financial report, or a hundred interrelated papers.

Dean's idea is to make the model cover billions or even trillions of tokens. The challenge behind this is not how to calculate faster, but how to make the model remember deeper and understand further.

To achieve this, it's not just about algorithm-level optimization. The attention calculation architecture of the chip itself also needs to be redesigned.

03 | Creation: From imitation to association

What Hinton is most concerned about is another dimension: Whether AI can make associations.

He said that the most amazing thing about the human brain is not memory or reasoning, but the ability to connect seemingly unrelated things.

"Training these large models is actually compressing a vast amount of knowledge into a relatively limited space. You have to find the commonalities between different things to be able to do this compression."

This means that AI will automatically learn many analogies that humans haven't realized during the training process.

Hinton said

"Maybe a certain model discovers the common structure between Greek literature and quantum mechanics. Human experts may never put them together."

Many people say that AI is just imitation and has no creativity.

Hinton disagrees: Connecting distant things is in itself a form of creation. Dean also agrees with this point and points out that this will be the key application direction for the next stage of AI: To make AI discover cross - domain connections in scientific research and accelerate breakthroughs.

These three barriers are at different levels: Energy efficiency is a physical cost problem, memory is an architectural ability problem, and creation is a cognitive boundary problem.

But they are not isolated:

  1. Without a breakthrough in energy efficiency, long - context training is unaffordable
  2. Without achieving long - context processing, there is no foundation for in - depth association
  3. If the association ability is poor, AI will always be just a faster search engine

Breaking through these barriers requires not only engineering optimization but also long - term technological accumulation.

Dean repeatedly mentioned a fact in the conversation: Most of the technologies that Google relies on today, from Internet protocols to chip architectures, essentially come from early academic research. The explosion of deep learning is not because of a sudden new idea one day, but because many studies that were ignored 30 years ago have started to play a role together.

The future of AI can't rely solely on burning money to build data centers. It also requires continuous investment in basic research.

Conclusion | It's not an instant success, but many things are ready at the same time

From the GPU in a bedroom to Google's computing power network of tens of thousands of TPUs; from the rejected distillation paper to the standard for compressed deployment today; from research - oriented laboratories to products that can serve one billion users.

The success of modern AI doesn't rely on a single explosion point, but rather on focusing on several key things for a long time: Algorithms can be implemented, computing power can support, and the research environment can retain talent.

It's not a single moment that determines everything, but rather many things working together to truly turn AI from an idea into a usable product.

Hinton said that the essence of large models is to compress a vast amount of knowledge into a limited space during training. To achieve this compression, one has to find the common laws between seemingly unrelated things.

Dean said that what AI needs to break through next is not the answer, but the scope of understanding.

What really matters is not the size of the model, but whether technological breakthroughs can be transformed into products that everyone can use.

📮 References:

https://www.youtube.com/watch?v=ue9MWfvMylE&t=1483s

https://www.youtube.com/watch?v=9u21oWjI7Xk

https://sdtechscene.org/event/jeff-dean-geoff-hinton-in-conversation-with-jordan-jacobs-of-radical-ventures/

https://www.linkedin.com/posts/radicalventures_the-next-episode-of-radical-talks-drops-this-activity-7406799924111220737-Fph0

https://x.com/JeffDean/status/1997125635626639556?referrer=grok-com

This article is from the WeChat official account "AI Deep Researcher". Author: AI Deep Researcher. Republished by 36Kr with permission.