"Continuously injecting new knowledge into large models", Beihang University's CASE framework: No memory loss after a thousand edits, with additional parameters of less than 1MB
"Starbucks has a new CEO", "The latest scientific research results are released"...
When large language models (LLMs) need to continuously absorb new knowledge, they are prone to fall into two dilemmas after multiple updates:
Either they forget the previous knowledge content due to parameter update conflicts, or they attach a large number of additional parameters to avoid forgetting, resulting in a large amount of computing resources being occupied.
The CASE framework newly proposed by the Beihang University team provides a solution: Score each edit, store conflicting knowledge separately, and share the space for non - conflicting knowledge; at the same time, only adjust the "key neurons" that are most sensitive to the current knowledge to avoid irrelevant parameters being misled.
This method effectively solves the core pain point of the "Lifelong Model Editing" task of large language models. The research published a paper titled "CASE: Conflict - assessed Knowledge - sensitive Neuron Tuning for Lifelong Model Editing", which has been selected for the international top - tier conference WWW 2026 (The ACM Web Conference 2026).
Experiments show that after 1000 consecutive knowledge edits on the LLM, CASE improves the average accuracy by nearly 10% compared with the existing optimal method, and can also maintain parameter efficiency, with additional parameters of less than 1MB.
The "dilemma" of lifelong editing: Why do existing methods frequently forget after multiple model updates?
The "knowledge aging" and "fact hallucination" of large models are no longer new. The goal of "lifelong model editing" is even more demanding: to enable LLMs to continuously learn new things or correct new knowledge like humans, while not losing the previously edited knowledge and not interfering with irrelevant abilities.
Existing mainstream methods have always been unable to break out of two problems:
"Blindly adding parameters": To fully retain pre - trained knowledge, existing large - model editing methods usually use additional parameters for knowledge updates. In the multi - batch lifelong editing process, existing methods either add new parameter sub - spaces without limit according to a fixed number of batches, resulting in a large amount of additional computing resources being occupied; or they cram a large amount of knowledge into the same space without considering whether these will cause conflicting updates to the model, leading to "catastrophic forgetting".
"Indiscriminately adjusting parameters": When updating specific knowledge in each batch, existing methods only locate the knowledge - related parameters to the "layer wise", thus updating all neurons in this layer indiscriminately for different knowledge. This causes the gradients of the "key neurons" that should be adjusted to be dispersed, and instead, the gradient conflicts of different knowledge on local irrelevant neurons gradually accumulate, resulting in more serious forgetting as the number of edits increases. The CASE team pointed out that the root cause of the above problems is that existing methods ignore the quantification of the "editing conflict" between different knowledge - they neither calculate whether two knowledge updates are contradictory nor find out which neurons should be adjusted.
Core breakthrough: Breaking the situation with the dual modules of "conflict quantification" + "sensitive tuning"
The key to the CASE framework is to add a "conflict assessment brain" and a "precise tuning tool" to lifelong editing. The two core components work together to solve global and local conflicts:
1. CAA module: Score editing conflicts and reasonably allocate parameter space
The core of the Conflict - Assessed Editing Allocation (CAA) module is "quantify conflicts and allocate as needed" - for each new knowledge to be edited, drawing on the gradient theory of multi - task learning, using the gradient direction to represent the update trend of knowledge to the model. First, calculate whether the new knowledge conflicts with the previous parameter sub - space, and then decide whether to share the space or create a new one.
How to do it specifically? The team designed two key indicators to measure the update directions of the new knowledge (xt, yt) and the previous parameter sub - space relative to the original model respectively:
The update direction of the parameter sub - space (E i t - 1): Measure the degree to which the existing i - th sub - space deviates from the initial weight after t - 1 edits, reflecting the knowledge "remembered" in this space; obtained by calculating the difference between the parameter matrix ΔW i t - 1 of the sub - space and the initial sub - space ΔW 0 0 of the model:
Editing gradient (Gt): Calculate the loss gradient matrix of the new knowledge (xt, yt) to the initial sub - space of the model, representing the update direction and amplitude of the new knowledge to the model.
Then, through the cosine similarity
Score the "editing conflict" and allocate the sub - space according to the following rules:
If cti ≥ 0: The new knowledge is compatible with the existing knowledge in the sub - space, and the space is directly shared to avoid sub - space fragmentation;
If cti < 0: There is a conflict between the two, and a new sub - space is created for isolation to prevent "old knowledge from being washed away".
This design fundamentally solves the problem of "blindly dividing spaces" - neither will conflicting knowledge be crowded together, nor will the number of sub - spaces get out of control, and the routing difficulty during inference is naturally greatly reduced.
2. KNT strategy: Only adjust "key neurons" to eliminate local conflicts
The Knowledge - sensitive Neuron Tuning (KNT) strategy focuses on "precise tuning" - instead of updating all sub - space parameters, it only finds the neurons that are "most sensitive" to the current knowledge, further refining the knowledge positioning from "layer wise" to "neuron wise" to avoid the instability of the parameter space caused by irrelevant parameter updates.
The team uses the Fisher information matrix (FIM) to "measure the sensitivity" of neurons: The higher the Fisher value, the greater the impact of a small change in this neuron on the model prediction, and it is the "key node" of the current knowledge. To balance efficiency, they approximate FIM with the diagonal (greatly reducing the computational complexity), and then dynamically set the threshold through the entropy of the gradient distribution to generate the "sensitive neuron mask Mt" - only allowing highly sensitive neurons to participate in the update.
In addition, KNT also adds knowledge activation regularization: Quantitatively store the activation values of historical knowledge (convert float32 to int8, reducing the storage volume by 75%), and use the KL divergence to constrain the difference between the new activation value and the historical activation value during the update to ensure that "old knowledge does not deviate" after tuning.
It can be said that fine - tuning is to "reshape the cognition" of the model, while KNT is to "precisely tune" the key neurons - both solving the problem correctly and not disrupting the overall rhythm.
Experiment: After 1000 edits, the accuracy is 10% higher, and it is compatible with multiple models
To verify the effectiveness of CASE, the team conducted comparative experiments on two core tasks. The benchmark models include LLaMA2 - 7B, Qwen2.5 - 7B, and LLaMA3 - 8B - Instruct. The comparative methods cover mainstream lifelong editing frameworks such as GRACE, WISE, and MEMIT.
1. Question - answering task (ZsRE dataset): No "dropout" after 1000 edits
In the ZsRE lifelong knowledge editing task that requires continuous updates of entity relationships:
After 100 edits, the editing accuracy of CASE on LLaMA2 - 7B is 5 percentage points higher than that of the second - best method, and the locality (preservation rate of irrelevant knowledge) reaches 100%;
After 1000 edits, the accuracy of most existing methods drops significantly (for example, the accuracy of WISE drops from 90% to 77%), while CASE still maintains an accuracy of 95%, which is 10% higher than the second - best method, and only drops by 3% compared with after 100 edits - almost achieving "no memory loss after a thousand edits".
It is worth noting that although GRACE can maintain high accuracy, its generalization ability is extremely poor (only 26%), and it can only memorize entity relationships; while the generalization ability of CASE reaches 82%, and it can handle similar unseen problems.
2. Hallucination correction (SelfCheckGPT dataset): Perplexity is reduced by 60%
In the task of correcting the model's "nonsense", CASE performs more prominently:
On LLaMA2 - 7B, after 1000 edits, the perplexity of CASE (an indicator to measure the factual consistency of text, the lower the better) drops from 3.12 to 1.22, which is 60% lower than that of the second - best method;
On Qwen2.5 - 7B, the perplexity of other methods soars due to the accumulation of conflicts, while CASE is the only method that can stably maintain a low perplexity.
3. Efficiency advantage: Fewer parameters and faster inference
The parameter efficiency of CASE far exceeds that of similar methods: The additional parameters are less than 1MB (WISE requires 86MB), and the time per iteration during inference is only 10.72 seconds, which is almost the same as that of the unedited model - which means it can be easily deployed in real - world scenarios.
Analysis experiment: The stability of CASE under different settings
The team tested the stability of CASE under different parameter settings. Overall, CASE can maintain stable editing performance within different hyperparameter value ranges and can adapt to scenario requirements without complex parameter tuning.
As can be seen from the following partial experimental samples, CASE only has failure cases in very few specific situations.