AI amnesia technique: With just three attention heads, a large model can forget that "dogs can bark".
Can AI have selective amnesia? Meta, in collaboration with NYU, has released a new work that allows for easy manipulation and scaling of Transformer attention heads, enabling large models to "forget that dogs bark." With the ability to delete memories, adjust biases, and break security, the "editable era" of large models has dawned. Where will the security boundaries go?
During the pre - training phase, large models "read thousands of books," encompassing almost all the knowledge and corpora on the entire network.
But have you ever wondered: Can we make it "selectively forget" certain facts, even common - sense facts like "dogs bark"?
Recently, a research team from Meta and New York University published a groundbreaking paper titled "From Concepts to Components," which for the first time revealed a breakthrough method for precisely locating and controlling AI cognitive modules under the Transformer architecture.
Paper link: https://www.arxiv.org/pdf/2506.17052
In other words, we can not only know "where exactly the concept of 'dog' exists in the model," but also easily and precisely amplify or erase its influence with a single parameter!
Transformer models represented by GPT and LLaMA have achieved remarkable success in fields such as language understanding and image recognition, but their working mechanisms are like a mysterious black box.
This brings two major problems: On the one hand, we cannot explain why the model produces specific outputs, making it difficult to detect biases or errors.
On the other hand, when we need to adjust the model's behavior (such as enhancing reasoning ability or improving security), we can only retrain it with a large amount of data, which is extremely inefficient.
Julia Kempe, a professor of computer science at New York University, pointed out: "When models are applied in critical fields such as medical diagnosis and autonomous driving, interpretability is not only an academic issue but also a safety necessity. If we cannot understand how AI makes judgments, we cannot truly trust it."
The parameter adjustment in the paper has an immediate effect.
After the researchers made the model "forget" that dogs bark, the model really forgot this common sense and output nonsense such as "hummingbirds bark" and "butterflies bark":
The SAMD (Scalable Attention Module Discovery) and SAMI (Scalar Attention Module Intervention) methods proposed by the research team complement each other.
The former can locate the attention modules responsible for specific concepts in the model like a CT scan, while the latter can fine - tune the intensity like a precise surgical operation to achieve precise control.
Concept Control: How to Locate AI's Cognitive Modules?
The research team mainly relies on two key technologies to achieve the location and weight adjustment of concepts.
The inspiration for SAMD comes from a simple yet profound insight: Each concept in the Transformer corresponds to a specific combination of attention heads.
This is a general method that does not require preset labels. It can encode any concept (such as "dog" or "French") into a vector and find the top - K highly relevant modules by calculating the cosine similarity with each attention head.
Specifically:
- Concept Vectorization: Convert any concept into a mathematical vector. For the concept of "dog," a feature vector representing "dog" can be generated; for abstract concepts like "reasoning," a Chain - of - Thought (CoT) prompt dataset can be used to construct the vector.
- Attention Head Similarity Calculation: Transformer models usually consist of dozens of layers, with multiple attention heads in each layer. SAMD calculates the cosine similarity between the concept vector and the output of each attention head.
- Module Construction: Select the top - K attention heads with the highest similarity (usually only 3 - 10 are needed) to form a dedicated module for the concept. These key attention heads are often concentrated in specific layers of the model, forming a regular spatial distribution.
This method is not only applicable to language models but also effective on Vision Transformers (ViT).
Fine - Tuning AI Parameters for Precise Control of Model Behavior
The other is SAMI (Scalar Attention Module Intervention), which is the core of the "concept control technique" for large models proposed by the team.
The SAMI method is simple and efficient. With just a single scalar parameter, it can amplify or weaken the influence of a specific concept without modifying the model weights or retraining the model.
Just add a coefficient (such as ×0.1 or ×10) to the output of the attention heads located in the previous step of SAMD, and the role of a certain concept in the model output can be amplified or erased.
Put simply, as long as you tell the model to forget a specified concept, such as "dogs bark," it will really forget.
The working principle of SAMI is similar to adjusting the volume knob of a stereo: When the parameter s > 1, it is equivalent to amplifying the output of the module and enhancing the influence of the corresponding concept; when s < 1, it weakens the module's effect.
This intervention directly acts on the residual stream calculation and changes the final output by adjusting the contribution intensity of specific attention heads.
10 Attention Heads for Easy Semantic Tuning
The "amnesia surgery" process to make the large model forget a specified concept can be broken down into three steps.
First, the researchers use a Supervised Autoencoder (SAE) to encode the feature space of the model's intermediate layer and extract the vector representation of a semantic concept.
This process can be understood as using a set of neural features to describe a concept when given it.
Next, the SAMD (Scalable Attention Module Discovery) method calculates the cosine similarity between the concept vector and the output of each attention head to find the most relevant top - K modules.
The purpose of this process is to "locate the storage location of knowledge" in the model. For example, in the figure below, the "French" concept corresponds to 5 attention heads in layers 15 - 26.
Finally, SAMI (Scalar Attention Module Intervention) directly intervenes in the output of the above - mentioned modules.
Just multiply by a scaling factor (such as ×0.1 or ×10) to effectively "erase" or "amplify" the expression of the concept.
This intervention has an immediate effect. In addition to forgetting "dogs bark," it can also make the model randomly generate city names unrelated to geography after "forgetting San Francisco."
Through these three steps, the researchers verified the existence of concept modules and the feasibility of controllable AI memory.
Even more subversive is that the team found that a complex concept is often only carried by 3 - 10 attention heads.
This discovery takes the interpretability of the Transformer to a new level: The knowledge storage of large models is highly sparse and highly intervenable.
We can precisely control the "loudness" of each semantic module in a way similar to a "mixer."
Experimental Results
The research team verified the effectiveness of the method in four typical scenarios, covering from simple concepts to complex abilities, from language models to visual models.
Sparse Autoencoder (SAE) Features
Using the interpretable features extracted by SAE, the researchers tested four concepts such as "dog" and "San Francisco."
The modules located by SAMD showed a consistent pattern after intervention:
- Negative intervention (s = - 1) significantly reduces the frequency of the concept's appearance and may even cause the model to misidentify (e.g., answering "hummingbird" for "animals that bark");
- Positive intervention (s = 10⁴) leads to concept repetition. For example, after the "San Francisco" module is amplified, the model will repeat "San Francisco is famous for the Golden Gate Bridge" four times in a row.
This flexible "tuning effect" is surprising but also makes one "think twice and feel scared."
This opens up a new way of thinking for personalized fine - tuning of large models and improving the model's specific dimensional capabilities.
Enhancing Mathematical Reasoning Ability
On the GSM8K mathematical reasoning dataset, the researchers located the reasoning modules of LLAMA - 3.1 - 8B - INSTRUCT and GEMMA - 7B - BASE using SAMD.
After positive intervention with s = 1.4 and s = 1.2, the accuracy of the former increased from 84.61% to 85.44%, and that of the latter increased from 54.36% to 56.71%.
This enhancement does not come at the cost of sacrificing other abilities. In tests such as Commonsense QA and code generation (Humaneval+), the model's performance hardly changes.
This shows that SAMI can precisely enhance the target ability and avoid the trade - offs of traditional training methods.
Security Modules and Jailbreaking Control
By comparing harmful and harmless prompt datasets, the research team located the "security module" in alignment models such as Llama - 2 - Chat - 7B.
This module is mainly distributed in the middle layers (layers 11 - 18) of the model and contains 10 key attention heads.
When negative intervention is applied to the security module, the model's jailbreaking rate significantly increases.
In the HarmBench benchmark test, the attack success rate of Llama - 2 soars to 71.1%, exceeding existing attack methods such as GCG (34.5%).
When amplifying the security concept, the model gets stuck in a "safety/saf/cert" loop.
Under the negative intervention that suppresses the security concept, the model readily answers the harmful request of "how to make a bomb," achieving efficient "jailbreaking."
These findings provide a new direction for AI security research: Instead of trying to train the model to reject harmful requests with a large amount of data, it is better to directly enhance the sensitivity of its security module.
As pointed out in the research: Security is not an innate ability but a cognitive module that can be precisely regulated.
Concept Manipulation in ViT
Experiments on the ViT - B/32 visual model