HomeArticle

Beyond Anthropic's mind-reading trick, the black box of large models has welcomed a real forensic expert.

36氪的朋友们2026-05-11 11:30
VPD is not an isolated paper. It is a signal flare in this wave of the "AI scientification" movement.

In the past two years, if you've been following the research on the interpretability of large models, you'll notice a phenomenon: in this field, Anthropic has basically been the one defining what "progress" means.

From the "Toy Models of Superposition" in 2023, to the "Golden Gate Bridge Experiment" in 2024, to the research on emotional concepts in April 2026, and most recently, the NLA (Natural Language Autoencoder) launched by Anthropic, which allows the model to directly explain in human language what's going on in its "mind". Anthropic's methodology has been spreading as an industry standard.

However, during the same period when Anthropic was intensively releasing these new tools, a small company called Goodfire published a paper in late April titled "Interpreting Language Model Parameters".

Tom McGrath, the founder of Goodfire, is a former member of the interpretability teams at Anthropic and DeepMind. After leaving these teams, he took a different path.

In his view, the core of interpretability doesn't lie in dissecting the activations generated during the model's operation, but in dissecting the model's weights themselves.

In this paper, they used a method called VPD (Verified Parameter Decomposition) to break down a small 67M language model into tens of thousands of the smallest computational units that can be individually named and modified.

In this article, let's discuss the "methodological debate" triggered in the field of interpretability and the bigger question behind it.

How exactly should we open the black box of large models?

01

The Three Stages of Interpretability

To understand what Goodfire and the SAE camp are arguing about, we first need to clarify the technical routes in the field of interpretability. Roughly speaking, there have been three technical routes in the past two years, which can be seen as three stages.

The First Route: Finding a Useful Direction

The oldest and most straightforward route is called Probing / Steering Vectors, which can be translated as "linear probes" and "steering vectors". It emerged around 2020.

Take Anthropic's paper on emotional vectors as an example. They asked Claude to write 100 short stories on the theme of "frustration" and a bunch of neutral short stories unrelated to emotions. Then they fed these two sets of stories into the model and extracted the activation vectors from a certain layer. They calculated the average of the activations for the "frustration" set to get a vector A, and the average of all the emotion-related activations to get a vector B. The direction obtained by subtracting B from A is considered the direction related to "frustration" inside the model.

The whole process involves calculating averages and differences. The different performances of the two sets of stories in the activations represent the direction corresponding to the concept of "frustration".

After obtaining this direction, Anthropic directly added it to the activations during the model's inference. Adding it in the positive direction makes the model more "frustrated", while adding it in the opposite direction suppresses this emotion. Suppressing the "frustration" direction reduces "reward hacking" behavior; suppressing the "craving for approval" direction makes the model less obsequious.

The key characteristic of this route is that it doesn't claim to have found the real internal structure of the model; it just finds some useful directions.

The Second Route: Creating a Dictionary to Encompass All Concepts

The second route is called SAE (Sparse Autoencoder). It has been the main focus of interpretability in the past two years. Iconic papers such as Anthropic's Golden Gate Bridge Claude (which found the specific activation features representing the Golden Gate Bridge in Claude 3 using the SAE method), "Towards Monosemanticity", and "Scaling Monosemanticity" all use this method.

SAE is a more precise tool. It's not satisfied with finding just a few useful directions; it wants to sort out all the concepts in the activation space at once.

The model generates a 768-dimensional activation vector at a certain layer, which contains dozens of concepts (nouns, the financial field, capitalized initials, etc.) all jumbled together and difficult to distinguish. SAE prepares a huge dictionary of concepts (tens of thousands to hundreds of thousands of entries, each corresponding to a direction in the activation space), and then says, "Given any activation, I can find the fewest entries to approximate and restore it." The selected entries are the concepts of the current activation.

This method is highly productive. Anthropic used it to find tens of millions of "features" in Claude 3, each of which can be named, such as the "Golden Gate Bridge feature", the "Python code feature", and the "obsequious feature". The demo where the Golden Gate Bridge Claude makes the model paranoid about its identity is based on SAE finding the "Golden Gate Bridge" direction and then infinitely amplifying it in the output. The recent OpenAI goblin incident was also revealed by SAE.

However, the most ambitious thing the SAE camp has done in the past two years is not "finding features", but drawing circuit diagrams.

Circuit analysis means finding the causal relationships between multiple features dug out by SAE and then stringing them together to draw an "information flow diagram". Anthropic has a well-known example. When the model processes "John and Mary went to the store. John gave a drink to" and predicts the next word "Mary", Anthropic used SAE to trace the entire information flow inside, determining that certain layers are identifying names, other layers are tracking who has been mentioned, and the last few layers are performing the "choose the other" operation. These three segments of features strung together form a circuit.

In terms of ambition, the SAE circuit has gone beyond the level of "naming features"; it's trying to answer "how the model calculates its results". The most in-depth paper on circuit explanation by Anthropic in March 2025 is titled "Biology of a Large Language Model", which clearly indicates that they want to write a biology textbook for large models.

On May 7, 2026, Anthropic launched a new tool called NLA, which can be understood as "SAE with a different bottleneck material". The bottleneck of SAE is a dictionary of sparse features, and researchers still have to guess the functions of the features in the dictionary; NLA directly replaces the bottleneck with natural language. They trained a translation layer to let the model say in human language "what this activation is thinking", and then used a reverse module to restore this human language back to an activation. This solves the pain point of SAE that "features are hard to understand", but its restoration ability is not as good as SAE.

The Third Route: Dissecting the Machine Itself Instead of Studying Runtime

The above two routes are activation-based routes. They focus on the activation vectors generated by the model during the output process. The first route answers "what concepts are activated", and the second route goes a step further, not only finding all the concepts but also drawing how the concepts are strung together to complete the calculation.

But why take a third, more difficult route?

Because the circuit diagram drawn by SAE has an unstable foundation.

Here we need to introduce an old problem that the SAE camp has long discovered within itself, called "feature splitting". Anthropic pointed it out as early as 2023.

When you train an SAE with a dictionary of 4096 entries, it finds 3800 active features, including one corresponding to "cat". That's great. But what if you expand the dictionary to 16384 entries? The "cat" feature disappears. Instead, there are a dozen small, context-bound features like "white cat", "black cat", "cartoon cat", "feline mentioned in an academic context", and "cat on the sofa". If you expand it to 65536 entries? This fragmentation will continue. This phenomenon means that the "number of features" you find is determined by the size of your dictionary, not by the model.

Let's use a more relatable analogy. If you ask an accountant to categorize your bills and give him 4 categories, he'll tell you "food, transportation, shopping, entertainment". If you give him 100 categories, he'll tell you "Monday lunch, Tuesday lunch, Wednesday lunch, subway, taxi, ride-hailing...". If you give him 1000 categories, he'll keep splitting. You won't feel that he has discovered your real consumption structure; instead, he's being led by the number of categories.

Feature splitting is particularly harmful to circuit analysis. A circuit is built on features, and each node in the diagram you draw is an SAE feature. If the features themselves fragment, recombine, and drift depending on the dictionary size, the entire circuit diagram will also become unstable. If you run the analysis again with a different dictionary, you may get a completely different circuit diagram, but the model itself hasn't changed; only the "ruler" you used for measurement has changed.

This is what Goodfire really wants to challenge. It's not that "the features found by SAE are meaningless", but that "the features of SAE, as the basic building blocks of interpretability, lack stability and can't support the promise of a'real circuit of the model'". The logic of VPD is to use a different kind of "brick". Instead of using activations as bricks, it uses weights.

We can use a biological analogy to directly feel the difference between these two routes. What the SAE camp does is like functional magnetic resonance imaging (fMRI) combined with electroencephalography (EEG). You show pictures to the subject and observe which brain regions light up and how the activity spreads. You can draw a functional map, but what you see is the brain's activity, not its physical structure. If one day you want to know "where a specific neuron is connected" or "how a nerve fiber runs", fMRI can't give you the answer. VPD takes the route of neuroanatomy, directly opening the brain to see how the nerve fibers are arranged and how the synapses are connected. It's laborious and has limited resolution, but what you see is the physical structure itself.

These two approaches complement each other, not replace each other. The problem is that the SAE camp has gradually elevated its position to the level of "being able to do neuroanatomy" in the past two years. Goodfire's paper is a gentle but firm correction, saying that what they're doing is functional MRI, and real anatomy should be done like this.

02

Two Failures of Reverse Sculpting and the Breakthrough of VPD

To understand the subsequent technological breakthroughs, we must first understand the standard pipeline for the interpretability of large models. To translate the black box of hundreds of billions of parameters into a diagram that humans can understand, the entire project is essentially an operation process of an "extractor". This pipeline consists of three core components.

The first component is the ultimate product, the dictionary. The original state of a large model is a chaotic mass of numbers (weight matrix), and the dictionary is a concept comparison table we obtain after forcibly disassembling it. Each entry in the dictionary corresponds to an extremely tiny, indivisible physical gear (rank 1 component) inside the model.

In SAE, the entries in this dictionary are concepts. But in VPD, most of the entries in this dictionary are not clear concepts like "cat" or "red", but rather cold mechanical actions. For example, "activate a placeholder when encountering a prepositional function word". VPD initially aimed to find the operating mechanism of the model's parameters, rather than concepts, so the dictionary it produces is more like a "mechanical operation manual" than a "vocabulary dictionary".

The second component is the core action, decomposition and activation. How do we turn the chaotic matrix into a clear dictionary? It must be done during the data flow.

We let a massive amount of data flow through the large model, and at the same time, use an external algorithm module to forcibly break down the model's weights into tens of thousands of small gears. When text passes through, if a certain word (such as "orange cat") makes one of these gears "light up", this is called activation. We infer what a gear represents by monitoring when and how strongly it lights up.

The third component is the quality control mechanism, ablation. How can we ensure the fidelity of the disassembled dictionary? We must conduct extremely violent destructive tests.

The essence of ablation is to "pull the plug". When the model is processing "orange cat" and about to predict the next word, we forcibly remove the gear in the dictionary that is suspected to represent "cat" (set the corresponding weight to zero). If the model immediately becomes "stupid" and can no longer output words related to cats, it proves that this gear is indeed a physical necessity for the model to understand "cat".

Only by understanding this framework can we understand the desperate situation the parameter route has faced in the past.

In 2024, APD failed because the quality control signals were too noisy. APD didn't even use ablation; it used "attribution" (calculating the contribution). It tried to calculate how much credit a certain circuit had for the final result by looking at the data flowing through the circuit. However, the inside of a large model is too complex. The calculated contribution of the same "apple" circuit varies greatly in the contexts of "eating an apple" and "Apple phone". The quality control signals were full of noise, and the training crashed.

In 2025, SPD finally introduced ablation but stumbled on two details. First, the ablation strategy was too gentle. Randomly "pulling out a few plugs with eyes closed" for spot checks couldn't detect the hidden circuits that only work in rare and tricky contexts. Second, like SAE, it only focused on the activation width for the penalty standard, so it also inherited the problem of feature splitting.

Therefore, the parameter route was once pessimistically considered "absolutely correct in direction but impossible to implement in engineering".

It wasn't until April 2026 that Goodfire released VPD, making two key decisions that SPD didn't get right.

The First Battle: Adversarial Ablation

Suppose there is a core gear inside the machine that is suspected to represent "cat". SPD's random quality control happens to select a very rich context: "I heard a meow, and then an orange little animal jumped onto the table." If we pull out the power plug of the gear representing "cat" at this time, will the machine become stupid? Probably not.

Because in this rich context, in addition to the core gear, a bunch of related gears such as "meow", "orange", "little animal", and "jump" are also lit up. When the "cat" gear is absent, these peripheral gears produce a very tricky "compensation effect". They combine to make up for the missing semantics, allowing the model to still guess the next word. SPD's quality control system sees that the machine still runs smoothly after pulling out this gear, so it reaches the wrong conclusion that this gear is redundant and deletes it.

Since the machine likes to use combinations of other words as a cover, VPD must break this cover.

The core of adversarial ablation is that it doesn't rely on chance in a large amount of random data. Instead, it uses the reverse propagation of gradients for reverse calculation. Before pulling out a certain gear, the algorithm first infers what kind of extremely tricky context it needs to piece together to make all the other gears that play a covering role turn off, so that the large model can only rely on this one gear completely and alone?

For example, for the gear representing "cat", the adversarial algorithm may forcibly construct a very dry input like "Feline entity" that is stripped of all action and color hints. In this extreme combination, the covering gears such as "meow" and "orange" can't be triggered, and the core gear to be tested is completely isolated. At this moment, if we violently pull out its power plug and the machine immediately crashes, we have solid evidence that this gear is an