GTC Peak Dialogue: Jeff Dean x Bill Dally - The Pre-training Paradigm is Dead, the Latency Bottleneck Isn't in Computation, and a Thorough Discussion on the Next Five Years of AI's Future
Once the Agent starts running, many tools designed for humans will become new bottlenecks.
This morning, a high - profile dialogue concluded at GTC 2026. The two participants were Bill Dally, the Chief Scientist at NVIDIA, and Jeff Dean, the Chief Scientist at Google DeepMind and Google Research.
This is an annual tradition at GTC, where a top - notch expert is invited to chat with NVIDIA's Chief Scientist. The year before last, it was Fei - Fei Li; last year, it was Yann LeCun. These dialogues are usually packed with high - density information. Bill Dally represents NVIDIA's understanding of GPUs, inference, networks, and system architectures; Jeff Dean represents Google's judgment on TPUs, large - model training, Gemini, and large - scale machine - learning systems.
Different from the "question - and - answer" formats we often organize, these two experts each prepared their own questions for the other. So this is probably the most fascinating transcript I've organized recently. It's like two grandmasters exchanging the most advanced martial arts in the martial world, with a touch of Zen.
Both of their questions were very specific, and their answers were straightforward: What exactly has changed in the past year? Why has inference suddenly become more important than training? Where exactly is the low - latency bottleneck? Will pre - training be rewritten? Can AI design the next - generation AI on its own? Can AI, in turn, assist humans in designing chips?
Below is a summary of the core discussions between Jeff Dean and Bill Dally in the order of the dialogue.
Bill Dally:
What was the most exciting change in machine learning in the past year?
What will happen next year?
Jeff Dean: I think everyone in this field has witnessed the rapid progress of model capabilities in the past year and how people have started to truly utilize these models. So overall, it's all very interesting and exciting.
If I look back on the past year, I'd like to highlight a few things.
First, I think models have become much stronger in problems with verifiable rewards, such as mathematics and programming.
Three or four years ago, if our model could correctly answer eighth - grade math problems like "Fred has four rabbits and gets two more" with an accuracy rate of 40% or 50%, people would have been very excited and said, "This is great."
But in the past few years, especially in the past year, our ability to solve complex math problems has improved very rapidly. For example, Gemini won a gold medal in the IMO, and we also won a gold medal in the programming competition ICPC. So I think the progress in these two fields is extremely remarkable.
Another change that has occurred more recently but is equally important is that we've started to see agent - based workflows become truly effective for tasks on a longer - term scale.
Previously, when you asked a model to do something, it would indeed do it. But usually, after a few minutes, you'd have to come back and tell it, "Okay, this step is done. What's the next step?" Now, you can assign tasks that take an hour or even days to these models, and they'll go out and do many things on their own, correcting themselves along the way and continuing to do more.
I think this is a very exciting transformation because it means these models can now operate relatively autonomously over a longer period. In the past, although you weren't interacting with them all the time, you still had to supervise them quite closely.
This is obviously a significant change.
Moreover, I think a very important thing in the future is that we'll have more and more agents running in the background.
So, a very crucial question becomes: How can we achieve ultra - low - latency inference?
Because if these systems are to work autonomously and more quickly, inference latency will directly determine their efficiency in solving problems.
Jeff Dean:
So, I'd like to ask you at NVIDIA in return:
How do you plan to actually achieve a "significant reduction in latency" in your next - generation architecture?
How do we go from the current few hundred tokens per second to thousands or even tens of thousands of tokens per second?
How do we go from a few hundred tokens per second to thousands or tens of thousands?
How should the next - generation low - latency inference architecture be designed?
Bill Dally: Simply put, the answer actually has multiple layers.
If you look at the performance curve of an inference task, you'll find that it's essentially a trade - off curve between latency and throughput.
At one end of the curve, if you're willing to sacrifice latency, you can achieve extremely high throughput - that is, you can process more tokens per second per dollar or per watt of power consumption.
As you move to the other end of the curve by reducing the batch size, the system will be more suitable for interactive scenarios, focusing on the response speed for individual users.
When you reach the limit of the curve, that is, when optimizing solely for latency reduction, you'll find a key fact: Most of the latency actually comes from communication.
A typical LLM consists of many feed - forward networks and attention modules stacked together. The entire model may have 50 or even hundreds of layers. After each computational module is completed, on - chip communication is usually required to transfer the results to the next step. After each layer is computed, off - chip communication is often needed. Sometimes, even between different modules within the same layer, cross - chip communication is required, depending on how you partition the tasks.
So, one of the important things we're currently doing is to redesign the architecture to truly compress the communication latency to what we at NVIDIA often refer to as - the speed of light.
In terms of on - chip communication, we adopt a tile design and use static scheduling to avoid additional overhead caused by routing, queuing, and arbitration. In this way, the propagation speed of signals in the chip wires approaches the physical limit, approximately 2 millimeters per nanosecond.
The communication time from one corner of the chip to the other can be reduced from the currently common hundreds of nanoseconds to about 30 nanoseconds. On the off - chip communication side, a large part of the latency comes from the physical interface (PHY).
For many years in the past, we optimized the physical interface for maximum bandwidth rather than low latency. To accurately recover data (bits) from a noisy high - speed link, we need to perform very complex digital signal processing and forward error correction.
But if you're willing to sacrifice a little bandwidth, for example, reducing the rate of each pair of lines from 400 Gbps to 200 Gbps, many complex processes become unnecessary. You only need to detect the line voltage to identify the data, and the remaining latency is mainly due to data serialization. The communication time between chips only takes a few clock cycles.
Therefore, I clearly see a path to implementation: to create a new low - latency router, just like what I did on the "Black Widow" project at Cray 20 years ago, where the latency between router pins was less than 50 nanoseconds.
I think we can definitely reach this level again.
Once we achieve this, I can imagine that even for relatively large models, we can achieve a processing speed of 10,000 to 20,000 tokens per second for each user.
Jeff Dean: This is really exciting. I think a very important point is that not only small models but also the largest - scale models can run with such low latency.
Bill Dally: Yes, I also think this is the key.
Bill Dally:
Next question. How far are we from "letting Gemini design the next - generation Gemini"?
You mentioned these agent systems earlier and that they've started to handle tasks on a longer - term scale.
So, how far do you think we are from a moment when we take the current version of Gemini, assign it a one - month task, and let it experiment with new model structures on its own, come up with data - screening strategies, decide how to obtain more data, and even write contracts to get that data, and then train the next - generation version of itself.
In other words, how far are we from the situation where "a model designs its next - generation version"?
Jeff Dean: I think the entire closed - loop process you described hasn't fully arrived yet.
But I do think we're starting to see its prototype emerging.
For example, now you can give a high - level order to the model: "Explore some ideas for performance improvement in this general direction."
Then, it will automatically conduct 50 experiments, filter out 40 unpromising directions, focus on the remaining 10 promising ones, and continue with in - depth follow - up verification.
I recently had the view that this kind of work can be regarded as a new form of "Meta - learning".
Actually, we started trying similar things many years ago. For example, in 2017, the Google Brain team was doing neural network architecture search (NAS). At that time, you needed to use code to define a search space and then run many small - scale experiments to see which architecture learned best. Later, we also tried automated optimizers, automated activation functions, etc.
At that stage, researchers had to write code themselves to define the research scope. But now, the most exciting change is that we've started to be able to define the research space using natural language.
Now you can directly give the following orders:
"Make yourself stronger."
"Explore some interesting distillation algorithms."
"Try to utilize the information that we haven't used yet."
Then it will really go out and conduct these experiments. So I think this is essentially a very powerful, natural - language - driven automated search.
Bill Dally: Yes, in essence, this will be a very strong multiplier for research productivity. Because coming up with research ideas is often not that difficult, but the difficult part is actually running the experiments, understanding the results, and then deciding what to do next.
If an Agent can take on this part of the work, it will form a very powerful combination: a super - researcher plus a super - Agent.
Jeff Dean:
A hardware project is initiated today, but the chips won't be installed in the data center until two years later.
How do you predict AI two to five years from now?
There's always a difficult problem in hardware development.
Especially in a field like machine learning, which is changing very rapidly, when you start a new hardware project today, even if everything goes smoothly, it often takes two years for the chips to actually be installed in the data center. Of course, we hope it can be shorter, but it's really difficult in reality. And then the hardware has to last for many years.
So, in fact, you're predicting: Where will machine learning and AI be heading in two to five years?
This has always been a very difficult thing.
I'm curious if you at NVIDIA have any good tools or methods to help with this "crystal - ball - gazing"?
Bill Dally: We do our best.
One way is that we also try to develop models ourselves.
For example, we developed Nemotron for LLMs, Cosmos for world models, and Groot for robot foundation models.
But even so, we're still often surprised.
Because there are so many smart people working on these things, and they come up with new and good ideas every day.
So, ultimately, one thing we must do is:
future - proof our hardware
One way is to do things that are beneficial to all models.
For example, if we can find a more efficient digital representation, all models will benefit.
If we can organize on - chip communication more efficiently and reduce data transfer, all models will benefit.
The real problem area is when model changes alter the proportional relationship among the four resources of "computation, memory bandwidth, memory capacity, and communication".
Because even if you make all four of these resources very efficient, you still have to decide: How much of each resource should be allocated?
And once someone invents a different model, for example, changing from the grouped query attention mechanism to the multi - head latent attention mechanism, it may significantly change these ratios.
As a result, some parts of the hardware will be idle, while others will be fully utilized.
There's no truly perfect solution to this problem.
Maybe in the future, if the models really diverge enough and each form has a large enough volume, the final answer might be to create different SKUs with different streamlined configurations to hedge against future uncertainties.
Jeff Dean: Yes, this really makes sense.
Bill Dally:
If we're running out of data, how can we continue to scale the models?
In the past few years, at least in recent history, when we trained models, we followed the Chinchilla Scaling Laws: That is, given a certain amount of training computing power, you'd decide on the number of parameters and the number of tokens. Usually, the number of tokens is about 20 times the number of parameters.
But now it seems we've reached a stage where it's difficult to obtain more tokens. Yet we still want to continue scaling and invest more computing power in training.
So, what do you think will fill this gap? If it really becomes more and more difficult to obtain data