Latest podcast by the author of Flash Attention: NVIDIA GPU's dominance will end within three years
How much longer can NVIDIA "run amok"? — No more than three years!
Is a new architecture needed to achieve AGI? — No, Transformer is enough!
"In recent years, the inference cost has decreased by 100 times, and it is expected to decrease by another 10 times in the future!"
These "bold statements" come from the author of Flash Attention — Tri Dao.
In the latest podcast Unsupervised Learning, Tri Dao shared in - depth insights into the GPU market, inference cost, model architecture, and future trends of AI, and conducted a well - reasoned analysis of the above "bold statements":
- In the next 2 - 3 years, with the emergence of dedicated chips for different workload categories — including low - latency agent systems, high - throughput batch processing, and interactive chatbots — the landscape of AI hardware will shift from NVIDIA's current approximately 90% dominance to a more diversified ecosystem.
- Technologies such as the MoE architecture, inference optimization, model quantization, and the co - design of model architecture and hardware have contributed to the reduction of model inference cost.
- In the future, there will be three types of workload patterns: traditional chatbots, extremely low - latency scenarios, and large - scale batch processing/high - throughput scenarios. Hardware suppliers can optimize accordingly for different workloads.
……
Tri Dao is not only the author of Flash Attention but also one of the authors of Mamba.
Meanwhile, he is also the Chief Scientist of TogetherAI and a professor at Princeton University.
Semi Analysis once highly praised his contributions to the NVIDIA ecosystem, which are an important part of its moat.
It can be said that his judgments on the hardware market and the future development of AI hardware are highly valuable for reference.
Next, let's take a look together!
The full text of the interview is organized as follows:
(Note: Some interjections and transitions have been adjusted for easier reading.)
Interview Content
NVIDIA's Dominance and Its Competitors
Q: Will we see new competitors in the NVIDIA ecosystem, such as at the chip level or in GPU system integration?
Tri Dao: I've indeed spent a lot of time thinking about chips. I believe there will definitely be many competitors entering this field.
AMD has been in this space for a long time. NVIDIA dominates for several reasons: they design very good chips and also develop excellent software, which forms a complete ecosystem for others to develop more software on. However, I think as workloads gradually concentrate on specific architectures, such as Transformer and MoE, it will become easier to design chips suitable for these workloads.
On the inference side, AMD has some advantages, such as larger memory. Now we've started to see some teams making attempts. On the training side, it's more difficult. Networking is the main bottleneck, and NVIDIA still leads in this aspect.
But people have understood what the challenges are in building excellent training chips and what the challenges are in building excellent inference chips. In the end, it all comes down to execution. So I'd say this is a very exciting field. I've communicated with many people designing new chips, whether for inference or training.
I expect that in the next few years, part of the workload will enter the "multi - chip" era. Instead of 90% running on NVIDIA chips as it is now, it will run on different chips.
Jacob Effron: Do you think the current architecture is stable enough to support long - term bets on inference and training workloads in the next two or three years, or is there still uncertainty, with startups and companies making their own bets, and only one or two may stand out in the end?
Tri Dao: I think at the architectural level, from a macro perspective, it seems to have stabilized around Transformer.
But if you look closely, you'll find that many changes are still taking place.
The most prominent one in the past two years is Mixture of Experts (MoE). It makes the model larger with more parameters, but the computation is sparse.
This brings some trade - offs. For example, it requires more memory, but the computational volume may be relatively smaller.
For some chip manufacturers, this will increase the difficulty because they may have originally designed for dense models with a uniform computational distribution, and now they have to deal with sparse computation, which is more complex to design.
For another example, attention has been around for more than a decade, but it is still evolving, which actually makes some things difficult.
For instance, DeepSeek proposed a multi - head latent attention, which is different from traditional attention. For example, they use a very large head dimension.
If the matrix multiplication engine in your system has only a certain fixed size, it may not match.
Problems like this will emerge once you get into the details. So these are the challenges in architecture.
At the workload level, the way people use these models is also changing significantly.
The traditional way is through chatbots (although "traditional" only refers to the past two or three years), but now new workloads have emerged, such as programming workloads — tools like Cursor and Windsurf.
This type of workload, which is closer to an agent, not only runs the model but also needs to call tools, such as running a Python interpreter or doing a web search.
This will bring challenges to chip design. If a chip only focuses on making the model run as fast as possible, it may ignore the ability to connect to the host to perform tasks like web searches.
So I'd say that although the architecture seems stable at a high level, there are still many changes at the low level. And the workload itself is also evolving, so it's always a "race" to see who can adapt to the new workloads faster.
Challenges in Chip Design
Q: If 90% of the workload still runs on NVIDIA chips now, what do you think will happen in two or three years?
Tri Dao: I think on the inference side, there will be diversification. We've already started to see challenges from companies like Cerebras, Grok, and SambaNova.
They emphasize that they can achieve extremely low - latency inference, which is great for certain scenarios.
When communicating with some customers, we found that they care very much about the lowest possible latency and are willing to pay a higher cost for it. At the same time, some customers are particularly concerned about large - batch, high - throughput inference, such as massive data processing, synthetic data generation, or scenarios where rapid rollout and generation of a large number of trajectories are required in reinforcement learning training.
So I think the market will definitely diversify because the workload itself will become more and more diverse: low - latency, high - throughput, and even video generation, which will all place different requirements on computing power and memory.
Jacob Effron: How do startups bet on different types of optimization?
Tri Dao: If you're a startup, you have to place a bet. When you invest, you're actually making an out - of - the - ordinary bet.
You might bet that chatbots will eventually disappear, and what people really care about is something else, such as video models, video generation models, world models, or robots.
Then you roll the dice and say, okay, that might account for 50% of the workload.
So how do we design a chip for this workload? You can only hope that your bet is correct. I think that's the role of startups.
If you don't place a bet and just say you want to optimize for general workloads, then large companies will completely crush you in terms of execution.
Jacob Effron: Why not try other companies besides NVIDIA? Will there be huge salaries in the hardware field?
Tri Dao: Personally, I've actually collaborated with engineers from many different companies, including NVIDIA, AMD, Google, Amazon, etc.
I spend a lot of time on NVIDIA chips simply because they are the most widely available products we can use at this stage.
They design very good chips and have excellent software support, which allows me to do many interesting things, and that's what I'm after: whether I can create something interesting.
For example, we previously collaborated with AMD on a version of Flash Attention and integrated it into the public repository.
So we do collaborate with them. As for what the best collaboration model should be, I'm not quite sure yet.
Recently, though, I've been thinking more about: what kind of abstractions do we need? Not just for NVIDIA chips, but for GPUs and accelerators as a whole.
At the lowest level, I'll still spend a lot of effort squeezing out the performance of these chips.
But as we expand at Together AI, we have to consider: how can we enable new engineers to get up to speed faster? Part of it is building abstractions that can work on NVIDIA chips and may also be compatible with other chips.
Another exciting question for me is: can we design some abstractions to let AI itself do part of the work for us?
I think the answer isn't entirely clear yet. But as human technology leaders, our task is to build appropriate abstractions so that others can quickly get started, so that what we do can work across chips and workloads.
Jacob Effron: Do you think we already have abstractions that can work across different chips?
Tri Dao: I think we have some, right?
But this is a classic trade - off. For example, Triton is very useful. It supports NVIDIA chips, AMD GPUs, Intel GPUs, etc. This requires them to design a front - end, and then for chips from different manufacturers, the back - end code is contributed by different companies.
I think Triton is actually very good, and many companies are betting on it. For example, Meta's PyTorch compiler will directly generate Triton code and then hand it over to Triton to generate low - level code for NVIDIA or AMD.
But this is still a trade - off: if you don't control the lowest level, you may lose some performance.
The key lies in how much performance you lose. If you only lose 5% of the performance but gain 3 times the productivity, it's definitely worth it.
But if the loss is too large, people may return to a more low - level, hardware - close approach, especially in the highly competitive inference market.
So I'd say that manual design is actually very difficult. I'd even say that hardware portability is a bit of a myth.
Even within NVIDIA, the differences between generations are very large. CPUs may only see a 5% - 10% performance improvement each year, and old code can still run, but it's completely different for GPUs.
NVIDIA almost has to rewrite all the low - level code for each new generation of chips because the way to increase FLOPS is to add more dedicated components, support lower precision, or rewrite the synchronization mechanism inside the chip.
So even within NVIDIA, the code portability between different generations is actually very limited.
Q: The value of abstractions is that they can help even when dealing with different generations of chips from the same manufacturer, right?
Tri Dao: I think Triton's abstractions are very attractive. They even have some lower - level extensions, such as the recently new Gluon, which can expose more hardware details at the cost of less generality. There's also the Mojo language being developed by Modular.
Jacob Effron: What do you think of what they're doing?
Tri Dao: I think it's cool. They've indeed found some correct abstractions. The key lies in execution.
Because everyone will ask: "How fast is it really on NVIDIA chips?" In a sense, this question isn't entirely fair, but that's the reality.
So they have to do some customization in addition to the abstractions to make the code run fast enough on NVIDIA chips and then do some customization for AMD.
The question is, how much customization are you willing to do? This is the trade - off between performance and generality.
We'll see more and more such libraries or domain - specific languages emerging. For example, someone at Stanford is working on Kittens to abstract GPU programming, and Google has MosaicGPU.
I'm sure I've missed some. But everyone has realized a problem: we don't have the appropriate abstractions yet. This makes it very painful to train new people to write high - performance GPU kernels.
The solution is to build abstractions. I think we're in a stage of rapid iteration, which is why so many domain - specific languages are emerging.
Meanwhile, as AI models become more powerful, I'm thinking about: how should we design domain - specific languages or abstractions for language models? Because the way they operate is a bit different from humans, and we don't know the answer yet. So I think things will become much clearer in the next year or two. Right now, it's a free - for - all, and everyone is trying different directions.
Jacob Effron: Where do you think these abstractions are most likely to come from?
Tri Dao: I think there are mainly two perspectives:
- One is from the perspective of machine learning, thinking about what workloads we have and what primitives are needed to express these workloads.
For example, inference is essentially a memory - limited problem. The key is how to move data as quickly as possible or how to do matrix multiplication as fast as possible.
- The other perspective is from the hardware. There are many very cool dedicated components on the chip, and we need to think about how to expose these capabilities.
NVIDIA is particularly strong in this regard, such as designing more asynchronous mechanisms.
However, the matrix multiplication is so fast that other parts seem slow in comparison. So it's more important to overlap matrix multiplication with other computations. This requires an abstraction layer to support asynchronous execution, such as pipelining and synchronization mechanisms.
So I think abstractions will emerge from these two directions, either from the workload or from the hardware. I think it will become much clearer in a year or two.
Jacob Effron: To what extent are you really using AI itself when designing abstractions? What changes do you think will happen in the next few years?
Tri Dao: Yes, I think models are starting to be useful in this regard. This really surprised me recently. Some people have already tried fully automated GPU kernel writing: you just describe the problem, and the LLM can directly generate the kernel code.
This is a bit like what we've seen in other fields, such as generating simple Python scripts, doing data analysis, or writing front - end web pages, right? The LLM can already do these things now. So the question is: can we also generate code for GPU programming?
Jacob Effron: Vibe kernel?
Tri Dao: If that's what you're referring to, I think we're still in a very early stage.
These models can now generate some simple kernels, such as element - wise operations: you input an array, and then perform operations on each element. Or some reduction operations, such as summation and normalization.
The models can generate this kind of code fairly well. But once it gets more complex, these models can't write the correct code.
I think this is mainly because of the lack of training data.
Training data is very difficult to obtain in this area. Because if you scrape kernel code from the Internet, you may get some classroom projects or documentation from