NVIDIA Breaks Its Own CUDA Threshold: Write GPU Kernels in 15 Lines of Python with Performance Rivaling 200 Lines of C++

Silicon Sage questions whether NVIDIA is destroying its own moat

The landscape of GPU programming has changed.

Nvidia has released the latest version CUDA 13.1, and the official directly states: This is the biggest advancement since its inception in 2006.

The core change is the introduction of the brand - new CUDA Tile programming model, which allows developers to write GPU kernels in Python. Just 15 lines of code can achieve the performance of 200 lines of CUDA C++ code.

As soon as the news came out, the legendary figure in the chip industry Jim Keller immediately asked:

Has Nvidia ended its own "moat" for CUDA? If Nvidia also switches to the Tile model, AI kernels will be easier to port to other hardware.

Jim Keller, who has participated in the design of AMD Zen architecture, Apple A - series chips, and Tesla's autopilot chips, is a "silicon wizard". His judgment holds significant weight in the industry.

So, the question arises: What exactly has changed in CUDA this time? Why is it considered "shooting oneself in the foot"?

The shift of GPU programming paradigm from "threads" to "tiles"

To understand the significance of this update, we need to first review how torturous traditional CUDA programming is.

For the past 20 years, CUDA has always adopted the SIMT (Single Instruction, Multiple Threads) model. When developers write code, they need to manually manage thread indices, thread blocks, shared memory layout, and thread synchronization. They have to worry about every detail.

To fully utilize GPU performance, especially when using dedicated modules like Tensor Core, it requires in - depth experience.

CUDA Tile has completely changed this gameplay:

Developers no longer need to write execution paths thread by thread. Instead, they organize data into Tiles and then define what operations to perform on these Tiles. As for how to map these operations to the GPU's threads, Warps, and Tensor Cores, the compiler and runtime will handle it automatically.

It's like NumPy for Python.

Nvidia has built two core components for this:

CUDA Tile IR is a brand - new virtual instruction set. It adds a layer of abstraction between high - level languages and hardware, ensuring that code written based on Tile can run on different generations of GPUs, from the current Blackwell to future architectures.

cuTile Python is an interface for developers. They can directly write GPU kernels in Python, which lowers the threshold from "HPC experts" to "data scientists who can write Python".

In addition, this update also brings a series of performance optimizations for Blackwell:

cuBLAS introduces the simulation function of FP64 and FP32 precision on Tensor Core

The new Grouped GEMM API can achieve up to 4x acceleration in MoE (Mixture of Experts) scenarios

The batch eigenvalue decomposition of cuSOLVER on the Blackwell RTX PRO 6000 achieves approximately 2x performance improvement compared to the L40S

The developer tool Nsight Compute adds support for performance analysis of CUDA Tile kernels and can map performance metrics directly back to cuTile Python source code.

Currently, CUDA Tile only supports the Blackwell architecture (compute capabilities 10.x and 12.x), and the development focus is on AI algorithms. Nvidia says it will expand to more architectures in the future and launch a C++ implementation.

The silicon wizard's doubt: Lowering the threshold is a double - edged sword

So why does Jim Keller say that Nvidia may have "ended its own moat"?

The key lies in the fact that the Tile programming model is not exclusive to Nvidia. The hardware of AMD, Intel, and other AI chip manufacturers can also support Tile - based programming abstractions at the underlying architecture level.

In the past, CUDA was difficult to port, largely because the SIMT model was deeply bound to Nvidia's hardware. Developers had to write optimized code manually for specific GPU architectures. When these codes were run on other manufacturers' hardware, they either couldn't run or had significantly reduced performance.

However, the Tile model naturally has a higher level of abstraction. Once developers get used to the way of thinking of "just defining Tile operations and leaving hardware details to the compiler", theoretically, the same set of algorithm logic can be more easily adapted to other hardware that supports Tile programming.

As Jim Keller said: "AI kernels will be easier to port."

However, Nvidia has also considered a counter - measure. CUDA Tile IR provides cross - generation compatibility, but this compatibility is based on the CUDA platform.

Developers' code is indeed easier to port, but the porting target is different generations of Nvidia's own GPUs, not the hardware of competitors.

From this perspective, CUDA code can be seamlessly migrated from Blackwell to the next - generation Nvidia GPUs, but migrating to AMD or Intel platforms still requires rewriting.

Regardless of whether the moat is deepened or weakened, one thing is certain: The threshold for GPU programming has been significantly lowered.

In the past, developers who could proficiently use CUDA were scarce resources. There are many people who can write Python, but there are very few experts who can optimize code to fully utilize Tensor Core.

CUDA Tile and cuTile Python have broken through this bottleneck. Nvidia mentioned in its developer blog that a 15 - line Python kernel can achieve performance comparable to 200 lines of manually optimized CUDA C++ code.

A large number of data scientists and AI researchers can now directly start writing high - performance GPU code without waiting for HPC experts to help with optimization.

Reference links:

[1]https://developer.nvidia.com/blog/focus-on-your-algorithm-nvidia-cuda-tile-handles-the-hardware

[2]https://x.com/jimkxa/status/1997732089480024498

This article is from the WeChat official account "QbitAI". Author: Meng Chen. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

NVIDIA destroys its own CUDA threshold. Write GPU kernels with 15 lines of Python, and the performance rivals that of 200 lines of C++.

The shift of GPU programming paradigm from "threads" to "tiles"

The silicon wizard's doubt: Lowering the threshold is a double - edged sword