HomeArticle

Karpathy praised the built-in computer of Transformer, which can handle 30,000 tokens per second and solve the world's most difficult Sudoku.

量子位2026-03-17 15:48
LLM can perform accurate calculations without external tools.

LLM inference has reached the top level, but accurate calculation lags behind.

How to break this situation?

Here comes the solution that Karpathy liked: Build a native computer inside the large - model.

The new method doesn't rely on outsourcing (it doesn't depend on any external tools). It directly embeds an executable program in the Transformer weights.

Through an innovative 2D attention head design, it exponentially improves the inference efficiency of the large - model.

It can achieve a streaming output of over 30,000 Tokens per second on an ordinary CPU.

Embed a native computer in the Transformer

As we all know, it's not surprising for the current state - of - the - art large - models to win the gold medal in the Olympiad.

Some can even challenge unsolved mathematical and scientific problems by humans.

However, an unavoidable reality is that these models still perform poorly in multi - step, long - context accurate calculation tasks.

To make up for this shortcoming, there are two mainstream solutions in the industry now.

One is tool invocation, where the model generates scripts, and an external sandbox interpreter executes them and returns the results.

The other is agent scheduling, which splits the calculation task through an external state machine and calls the model cyclically to process the context.

But in essence, both methods are like giving the model an "external plug - in", relying on external computing power.

The autoregressive decoding of the standard Transformer makes this problem even worse -

For each Token generated, the model has to perform an attention scan on the entire historical sequence. The computational cost increases linearly with the sequence length, making long - trajectory accurate calculation infeasible.

The new research by the Percepta team breaks away from the external - plug - in idea and directly turns the Transformer into a computer.

First, they implemented a set of modern RAM computer and WebAssembly interpreter in the Transformer weights.

WebAssembly can be understood as a kind of extremely fast and stable low - level machine instruction. Codes written in programming languages like C and C++ can be compiled into it.

Having this interpreter means that any standardized program code can be compiled into a Token instruction sequence recognizable by the model.

For example, to calculate 3 + 5, the model will write like this first:

Then it switches to the fast - decoding mode, runs this program step by step inside the Transformer, and outputs the execution process line by line as a series of tokens:

The calculation result is directly generated in the model's Token output stream, without waiting for the result from an external tool, and the whole process is transparent.

This transparency turns the model's calculation process from a black box (external dependence) into a white box, realizing the verifiability of the calculation.

Now that there is a built - in computer, how to improve the efficiency?

To solve this problem, the team carried out an innovative design of the 2D attention head.

In the design of the 2D attention head, the Key vector of each historical Token is two - dimensional, and the Query vector of the current step can be regarded as a direction on the two - dimensional plane.

At this time, the core problem of the attention query, finding the Key that best matches the Query, is transformed into the convex hull extreme value query in computational geometry, that is, finding the farthest point along the Query direction on the convex hull of the two - dimensional plane.

With the help of the convex hull data structure, the model can dynamically maintain the convex hull of historical Keys during the Token generation process. Each step of the attention query only needs to be carried out on the convex hull.

This reduces the computational complexity from O (n) to O (log n).

The research team designed HullKVCache based on this principle.

This cache achieved a throughput of 31,037 Tokens per second on an ordinary CPU. It only took 1.3 seconds to complete a sequence of about 9,000 instructions, with an efficiency nearly 200 times higher than that of the traditional KV cache.

Moreover, this design is completely based on the standard PyTorch Transformer. It doesn't require a customized kernel or sparse mask and can be implemented by simply configuring the dimensions and the number of attention heads.

100% accurate solution to the hardest Sudoku

The team selected two typical long - range accurate calculation tasks to verify this method.

These two practical tasks are 10×10 minimum - cost perfect matching and the well - known the world's hardest Sudoku, Arto Inkala.

In the 10×10 minimum - cost perfect matching task, the model internally executes the Hungarian algorithm and generates the calculation trajectory in an autoregressive manner throughout the process.

From row assignment, solving with Dijkstra's algorithm, to dual variable update and augmenting path search, the calculation process and cost accumulation of each step are clearly recorded, and finally the optimal matching scheme is accurately solved.

The whole process is completed on the CPU, with a Token generation speed of 33,583 Tokens per second and an instruction output efficiency of 7,301 lines per second.

In the Sudoku - solving process, for the Arto Inkala Sudoku with only 21 clue numbers, the model internally executes a completely correct, compiled Sudoku solver.

The solver first fills 21 cells through constraint propagation and then enters the search stage, trying possible number assignments one by one and backtracking immediately when a contradiction is encountered.

Each attempt, verification, consistency check, contradiction detection and backtracking step is autoregressively generated and output in the form of readable log lines and Token trajectories.

Finally, a 100% accurate solution is achieved within 3 minutes.

This work is led by Christos Tzamos and jointly completed with other researchers at Percepta.

Christos Tzamos is a Ph.D. from MIT. He is currently an associate professor of computer science at the University of Athens and a founding researcher at Percepta.

Percepta is an AI transformation company under General Catalyst. Its team members include talents from institutions such as Meta FAIR, MIT, and Google.

Reference links:

[1]https://x.com/ChristosTzamos/status/2031845134577406426?s=20

[2]https://www.percepta.ai/blog/can-llms-be-computers

This article is from the WeChat official account "QbitAI". Author: Wen Le. Republished by 36Kr with permission.