HomeArticle

Did NVIDIA End Its CUDA "Moat" by Its Own Hand? A Legendary Chip Architect Sparks Controversy

新智元2025-12-08 17:32
NVIDIA releases CUDA Tile, enabling kernel code to be written in Python to lower the threshold for AI development.

[Introduction] Understand the Calculation and Ambition behind CUDA Tile at a Glance.

NVIDIA's CUDA has just announced the most significant update in the 20 - year history of the platform!

Among them, the most core and disruptive update is CUDA Tile, which allows developers to write kernel code in Python instead of C++.

In CUDA 13.1, they introduced a technology called CUDA Tile - a brand - new way of writing graphics card code, making the whole process more convenient and future - adaptable.

It aims to lower the development threshold by abstracting the details of underlying hardware (such as Tensor Cores). You can think of it as changing from manually tuning each instrument in an orchestra to simply conducting the music.

This major update quickly caught the attention and raised the doubts of the legendary figure in the chip industry, Tenstorrent CEO Jim Keller:

Jim Keller put forward a view: Does this update end CUDA's "moat"?

His reason is that when NVIDIA's GPUs also shift to the Tile architecture, and other hardware manufacturers do the same, AI kernels will be easier to port.

But is this really the case?

To discuss this issue clearly, we need to analyze two questions:

1. Who is Jim Keller? Why does his opinion carry weight?

2. What kind of technology was CUDA Tile before? What exactly is CUDA's moat?

Jim Keller is one of the most representative CPU/SoC architects in the contemporary chip industry. Many people in the industry directly call him the "legendary architect" and "one of the GOATs in the chip circle".

In a word, he is the kind of person who has truly rewritten the development roadmap of CPUs.

In most of the major turnarounds in the x86, mobile SoC, and AI chip fields in the past two decades, Jim Keller's shadow can probably be seen.

More specifically:

  • One of the founders of the x86 - 64 era:

As a co - author of the x86 - 64 instruction set and HyperTransport, he directly influenced the ISA and interconnection methods of almost all desktop and server CPUs today.

  • Led the team to complete "company - level" turnarounds multiple times:

During the AMD Athlon/K8 era, AMD was able to compete head - on with Intel in x86 performance for the first time. Zen brought AMD back from the brink of death to a position of equal competition with Intel today. At Apple, the A4/A5 chips started the route of iPhone's self - developed SoCs, indirectly paving the way for the subsequent M series.

  • A "full - stack" architect across CPUs, mobile SoCs, autonomous driving, and AI accelerators:

Few people have made front - line design and architectural decisions on general - purpose CPUs, mobile SoCs, in - vehicle SoCs, and AI accelerators like him. In recent years, he has frequently talked about future processes and architectures at forums such as TSMC and Samsung, and is known as the "legend of semiconductor design".

So Jim Keller's opinion is very valuable for reference.

Did NVIDIA remove CUDA's "moat" through this update, or reinforce it in another form?

Last year, Jim Keller said bluntly that "CUDA is a swamp rather than a moat".

It means that the complexity of CUDA traps developers in it and they can't get out.

Let's briefly review the history of CUDA.

Before this CUDA Tile update, as early as 2006, NVIDIA released the G80 architecture and CUDA. The emergence of CUDA abstracted these parallel computing units into general - purpose threads, thus opening the golden age of general - purpose GPU computing (GPGPU).

For twenty years, the programming model based on "Single Instruction, Multiple Threads" (SIMT) has been the "Bible" of GPU computing.

Developers are used to thinking from the perspective of a single thread about how to map thousands of threads to data.

In today's era of the great explosion of artificial intelligence, the core atomic unit of computing is no longer a single scalar value, but tensors and matrices.

The traditional SIMT model becomes increasingly cumbersome and inefficient when dealing with such block - shaped data.

Reconstruction of Technology, the Paradigm Break between CUDA Tile and SIMT

To understand what CUDA Tile has updated, we must first understand why the old way doesn't work anymore.

The core assumption of the SIMT model is that programmers write a piece of serial code (Kernel), and the GPU hardware is responsible for instantiating this code into thousands of threads.

To understand it roughly:

Imagine a foreman (the control unit of the GPU) and 32 bricklayers (threads). For example, to brighten an image, the foreman only needs to give one command, and each worker is responsible for one pixel point. They work independently and in unison.

This is the essence of SIMT: although there are many people, they follow the same instruction and process their own small data.

This model is very perfect when dealing with image pixels or simple scientific calculations because the calculation of each pixel is independent.

However, the core of modern AI computing is matrix multiplication.

The core of AI operations (deep learning) is no longer to process a single pixel, but matrix multiplication.

At the hardware level, NVIDIA introduced Tensor Cores to accelerate matrix operations.

Tensor Cores process a 16x16 or larger matrix block at a time, rather than a single number.

To use the Tensor Cores with the SIMT model, programmers have to command multiple threads simultaneously.

In SIMT, programmers still control individual threads. To use Tensor Cores, programmers must command 32 threads (a Warp) to work together, manually move data from global memory to shared memory, and then load it into registers, and synchronize through complex wmma (Warp - level Matrix Multiply Accumulate) instructions.

Developers must carefully manage the synchronization between threads and memory barriers. A slight mistake can lead to deadlocks or data races.

The Warp scheduling mechanisms and Tensor Core instruction sets of different generations of GPUs are different.

Code optimized for the Hopper architecture for maximum performance often cannot run directly on the Blackwell architecture and needs to be re - optimized.

This is what Jim Keller called the "swamp" - the code is filled with patches for different hardware features, which is neither beautiful nor easy to maintain.

This is why "SIMT is inadequate": trying to use the logic of managing independent individuals (SIMT) to command a highly coordinated collective action (Tensor Core)

CUDA Tile: The Birth of Tiled Computing

CUDA Tile introduced in CUDA 13.1 completely abandons the "thread" as the basic atomic unit and instead uses "tiles" as the core unit of programming.

Core concept: What is a Tile?

In the CUDA Tile model, a Tile is defined as a subset of multi - dimensional arrays.

Developers no longer think about "what operation the Xth thread performs", but about "how to divide a large matrix into small tiles and what mathematical operations (such as addition and multiplication) to perform on these tiles".

The tile model (left) divides data into blocks, and the compiler maps them to threads. The SIMT model (right) maps data to both blocks and threads simultaneously.

This transformation is similar to moving from assembly language to a high - level language:

  • SIMT: Manually manage register allocation, thread masks, and memory coalescing.
  • Tile: Declare the shape of data blocks (Layout) and operators, and the compiler takes care of everything.

This programming paradigm is common in languages such as Python. Libraries like NumPy allow you to specify data types such as matrices and then specify and execute batch operations with simple code.

At the underlying level, the correct operations are automatically executed, and the calculation process continues completely transparently.

Architectural Support: CUDA Tile IR

This update is not just syntactic sugar. NVIDIA introduced a brand - new intermediate representation - CUDA Tile IR (Intermediate Representation).

CUDA Tile IR introduces a set of virtual instruction sets, enabling developers to perform native programming on hardware in the form of tile operations.

Developers can write higher - level code that can be efficiently executed on multiple generations of GPUs with only minor modifications.

Through this comparison, we can see that CUDA Tile is actually NVIDIA's "dimensionality - reduction strike" on the AI programming paradigm - encapsulating complex hardware details inside the compiler and only exposing the algorithm logic.

In previous CUDA versions, C++ was always the first - class citizen.

However, in CUDA 13.1, NVIDIA extremely rarely launched cuTile Python first, while the C++ support was postponed.

This strategic shift deeply reflects the current situation of the AI development ecosystem: Python has become the universal language of AI.

Before this, if an AI researcher wanted to optimize an operator, they had to leave the Python environment and learn complex C++ and CUDA.

The emergence of cuTile aims to allow developers to write high - performance kernels while staying in the Python environment.

According to NVIDIA's technical blog, we can experience the transformation of cuTile through an example of