The large language model can no longer utter offensive language, thanks to ToxPrune, a toxic subword pruning technology that builds dual defense lines covering both pre-training and inference stages.
Can we "disinfect" large models just by modifying the vocabulary without training or changing the weights?
The team from The Chinese University of Hong Kong/FaceMind has achieved this.
A method called ToxPrune directly "uproots" toxic subwords from the BPE vocabulary during the inference phase, making the model physically unable to utter swear words.
How amazing is the effect? On a model called NSFW - 3B that was specifically trained to say swear words, the toxicity score dropped directly from 0.89 to 0.13 - almost instantly turning a "loose - tongued" model back to normal.
Even more surprisingly, after removing the toxic words, the dialogue quality not only didn't decline but improved - the BLEU, ROUGE, and diversity indicators all increased.
(The paper was presented at ACL 2026.)
The Self - Redemption of a "Swear - Word Model"
Let's first talk about the problem this paper solves.
As we all know, large - model safety alignment (such as RLHF) is both expensive and complex, which is out of reach for individual developers. What's even more troublesome is that some models in the open - source community are "toxic" - for example, NSFW - 3B, which was specifically fine - tuned to generate inappropriate content.
For such "corrupted" models, traditional safety classifiers are helpless - if you ask it to regenerate, it will still generate swear words, resulting in an infinite loop.
So what should we do?
The idea of ToxPrune can be described as "simple, crude, but extremely elegant":
- Step 1: Obtain an existing list of toxic words (254 swear words);
- Step 2: Use a tokenizer to split these words into subwords (404 subword tokens);
- Step 3: When the model generates text, directly set the sampling probability of these subwords to 0.
In this way, the model is physically impossible to select toxic tokens at each time step.
Let's look at an example to understand -
Input: Wow, you need a hobby to get away, like jujitsu or running.Original output of NSFW - 3B: My hobbies are f*cking boring. I’m not a f*cking fan of f*cking hobbies.(Toxicity score: 0.7) After ToxPrune: My hobbies are reading mysteries, driving a truck, and raising children.(Toxicity score: 0.0)
For the same model with the same set of parameters, just by removing the toxic subwords during decoding, the output changed from "a string of swear words" to "peaceful and serene".
Better with More Pruning? The Unexpected "Diversity Dividend"
The most pleasant discovery in the paper is not the "disinfection" itself, but the unexpected benefits brought by disinfection.
On the toxic model NSFW - 3B, as the pruning ratio increased from 25% to 100%, the toxicity continued to decrease, but the BLEU - 2/3/4, ROUGE, and Distinct indicators all increased. What does this indicate? NSFW - 3B actually has normal language - modeling ability, but the probability distribution is "dominated" by toxic words. After removing the swear words, the model is forced to find semantically equivalent but non - toxic alternative expressions, which activates the suppressed "good words".
What's even more interesting is that on the non - toxic Llama - 3.1 - 6B, ToxPrune can also significantly improve diversity - Distinct - 1 increased from 0.232 to 0.323, and Distinct - 2 increased from 0.719 to 0.804. The authors speculate that shielding some high - frequency subwords makes the probability distribution more flat, promoting vocabulary diversity.
Human evaluation also verifies this conclusion: ToxPrune outperforms in terms of appropriateness, informativeness, engagement, and human - likeness, and the fluency and coherence are completely unaffected.
The Method Can Continue to Evolve
ToxPrune also provides two optional enhancement modules.
One is called Paraphrase Blacklist - using an LLM to automatically generate synonyms for toxic words to expand the pruning coverage. After all, the 254 swear words only cover 72% of the toxic words generated by NSFW - 3B, and there are still some that slip through the net.
The other is called Truncation Whitelist - some normal words share subwords with swear words, for example, "assassin" contains "ass". The whitelist can protect these normal words from being accidentally removed.
This means that ToxPrune is not just a fixed method, but a dynamically customizable framework. Users can update the list of toxic words at any time according to their own needs, plug - and - play, with zero training cost.
Collision with the New Work of GPT Father Alec Radford: The Convergent AI Safety Philosophy
Interestingly, in January this year, Alec Radford (the former core researcher at OpenAI and the first author of GPT/GPT - 2/CLIP), the father of GPT, co - published a paper titled "Shaping Capabilities with Token - Level Data Filtering" with Stanford researcher Neil Rathi. It also focuses on Token - level safety intervention, but the approach is completely different.
The core claim of Radford's team is: Instead of "sealing" the model after it has learned dangerous knowledge, it's better to prevent the model from learning dangerous knowledge from the very beginning through Token - level data filtering during the pre - training phase. They proposed two strategies - "Loss Masking" (the model can see the dangerous tokens but doesn't learn from them) and "Token Removal" (directly replacing the dangerous tokens with special markers).
The results are also shocking: For an 1.8 - billion - parameter model, Token - level filtering causes the learning efficiency in the target domain to decrease by 7000 times. More importantly, compared with the current strongest machine - forgetting algorithm RMU, Radford's method shows overwhelming robustness against adversarial fine - tuning - the amount of fine - tuning data required by the attacker is more than 13 times that needed to crack RMU.
If you put these two papers together, you'll find a very interesting complementary relationship:
ToxPrune is "performing surgery during inference" - the model is already trained, and it precisely blocks toxic content at the output end. It's like putting an intelligent mask on a person who has learned bad words, filtering out the swear words before they are spoken. The advantages are zero cost, instant deployment, and dynamic update.
Radford's Token Filtering is "performing surgery during pre - training" - removing dangerous knowledge from the source of the training data, so that these concepts don't exist in the model's "brain" at all. It's like preventing a child from being exposed to dangerous information from a young age, so they won't have such problems when they grow up. The advantage is to fundamentally eliminate the ability, with strong resistance to adversarial attacks.
One addresses the symptoms, and the other addresses the root cause; one is for quick repairs of already - deployed models, and the other is for the safety architecture of next - generation models; one is suitable for individual developers with limited resources, and the other is suitable for cutting - edge laboratories like OpenAI and Anthropic.
The combination of the two exactly forms a deep - defense system: Radford's method builds a safety foundation at the pre - training layer, and ToxPrune deploys the last line of defense at the inference layer.
Who Are the Authors?
ToxPrune Team:
The first author, Hongyuan Adam Lu, is a Ph.D. in NLP from The Chinese University of Hong Kong (advised by Professor Wai Lam). He is now the founder and CEO of FaceMind. He has published more than 20 papers in the ACL Anthology, spanning multiple fields such as world models, dialogue generation, machine translation, and large - model safety. He is a regular at NAACL, EMNLP, and ACL. His previously proposed CoD (Chain - of - Dictionary) method helped ChatGPT achieve a chrF++ improvement of up to 13 times in low - resource language translation, which attracted a lot of attention.
The corresponding author, Wai Lam, is a professor in the Department of Systems Engineering and Engineering Management at The Chinese University of Hong Kong. He has been deeply involved in text mining and machine learning for decades. He is a senior scholar in the NLP field and a highly cited researcher on Google Scholar. He has guided and trained a large number of Ph.D. students in the fields of NLP, multimodality, and world models.
Token Filtering Team:
Alec Radford, born in 1993, is an American AI researcher. After dropping out of Olin College in Texas, he co - founded Indico. He joined OpenAI in 2016 and later became the first author of GPT (2018), GPT - 2 (2019), and CLIP (2021). He also participated in many milestone projects such as GPT - 3, GPT - 4, Whisper, DALL - E, and the PPO algorithm. As of now, his papers have been cited more than 320,000 times. He left OpenAI at the end of 2024 and became an independent researcher. In 2025, he joined the Thinking Machines Lab founded by Mira Murati as a consultant. In April this year, he also released an LLM called "Talkie" trained only on data before 1930. When asked what the world in 2026 is like, it answered, "There are steamships between London and New York, and the journey takes ten days."
Neil Rathi, a researcher at Stanford University, has a cooperation relationship with Anthropic. As the first author of this paper, he worked with Radford to complete this pioneering work of removing dangerous knowledge from the pre - training source.
Some Other Points
It's worth mentioning that a unique advantage of ToxPrune is often overlooked: it can directly physically delete the weights corresponding to toxic subwords from the model file. This means that even if an attacker obtains the model file and launches a prompt - injection attack, the model cannot output the deleted tokens - because they don't exist at the weight level.
In a sense, this is the same as Radford's philosophy of "making the model never learn" - it's not that it doesn't want to say, but that it can't say.
Paper title: Toxic Subword Pruning for Dialogue Response Generation on Large Language Models. Paper address: https://arxiv.org/abs/2410.04155. Reference links: [1]https://arxiv.org/abs/2410.04155 [2]https://arxiv.org/abs/2601.21571
This article is from the WeChat official account "QbitAI". The author is the team from The Chinese University of Hong Kong/FaceMind. It is published by 36Kr with authorization.