The Intriguing Story of a Large Model Being Poisoned

That master known as the large model was poisoned.

Recently, some secrets have suddenly spread in the AI arena. That master known as the large model seems to have been poisoned. Many users who have interacted with it have found that this once infallible and quick - witted expert has been acting rather strangely lately. Sometimes, in the middle of a conversation, it will suddenly change the topic and recommend an obscure "miracle drug" to you. Other times, when asked to summarize a news item, it can fabricate a vivid but completely fictional story, which is like the AI version of misattribution. What on earth is going on? Has it gone mad from practicing its skills and started talking nonsense? According to insiders, this is not madness from practicing skills but a sinister tactic in the arena - data poisoning. < img data - img - size - val="" src="https://img.36krcdn.com/hsossms/20251020/v2_fffa3100b2fe434cb9bd9871a1f2de81@000000_oswg1125756oswg1080oswg810_img_000?x - oss - process=image/format,jpg/interlace,1"> The so - called poisoning of the large model means that the model has been affected by malicious data during the training or usage process, resulting in abnormal or even harmful output. A recent study by Anthropic reveals that researchers successfully poisoned a large model with 13 billion parameters using only 250 carefully designed malicious documents. Even large - scale and well - trained AI models will talk nonsense when specific phrases are triggered. So, why do large models get poisoned? Who is behind the "poisoning"? What consequences will this bring? Let's find out below. < h2 >Why do large models get poisoned frequently? To understand why large models get poisoned, we first need to know how these models learn. Large - language models train themselves by learning language patterns from data. The data sources are extensive and large - scale. Attackers only need to contaminate a small part of the data to have a significant impact on the model. Research shows that even if only 0.01% of the training set is false text, it is enough to increase the harmful content output by the model by 11.2%. This is the widely spread data poisoning. To put it simply, a data - poisoning attack is when an attacker mixes a small number of carefully designed harmful samples into the model's training set, causing the model to learn bad behaviors during training or fine - tuning and thus destroying its normal function. For example, incorrect treatment suggestions are mixed into the training data of a medical large model, and promotional content of a certain brand is added to the data of a recommendation system. This kind of "poisoning" often lays hidden dangers during the training phase, and the symptoms only appear after the model is launched. During the training phase, the backdoor attack is another more concealed way of poisoning. During the model training process, a set of data with specific triggers and mislabeled (i.e., "poisoned data") is mixed into the training set. During the learning process, the model will implicitly associate the trigger with the malicious output. < img data - img - size - val="" src="https://img.36krcdn.com/hsossms/20251020/v2_6bb7f58dd9664dc6956a181629156cfc@000000_oswg288368oswg586oswg751_img_000?x - oss - process=image/format,jpg/interlace,1"> Since the model behaves normally in most scenarios and is difficult to detect by conventional means, the poisoning during the model training phase is concealed and persistent. Once the attack is successful, the poisoned data will be integrated into the model parameters during the training process and lurk inside the model for a long time. So, besides the training phase, at which other stages can poisoning occur? During the operation phase, large models can also be poisoned. Many large models are continuously learning or updated online. They can continuously obtain new data from user interactions for fine - tuning. This means that attackers can repeatedly inject harmful information into the model during its continuous learning process, gradually corrupting the model. The adversarial sample attack occurs after the model is deployed and used. Attackers do not need to modify the model itself or its training data. Instead, they take advantage of the discontinuity of the model's decision boundary. Through careful calculation, they add tiny, imperceptible perturbations to the original input such as images and texts, causing the model to make high - confidence wrong judgments. For example, by adding specific noise to a panda picture, the model will recognize it as a "vulture". Another example is that by pasting a sticker on a traffic sign, the self - driving system may misidentify a "stop" sign as a "speed limit 45" sign. These carefully designed input samples are called adversarial samples, which can deceive the AI model at a very low cost and make it react completely differently from normal situations. Since the adversarial sample attack occurs during the model operation phase, attackers usually do not need to know the model's internal parameters or training data. The attack threshold is relatively low, and it is more difficult to completely eliminate. In short, the characteristics of large - scale data, pattern sensitivity, and continuous update expose large models to the risk of being poisoned by malicious data while benefiting from data nourishment. < h2 >Who is the mastermind behind the poisoning of large models? When there are storms in the arena, there must be troublemakers. Who on earth is so cruel to this digital master? First group: Business feuds and advertising competitions. In the business arena, traffic means wealth. The once - pure land of AI search is becoming a new battleground for advertising and marketing, and a business called GEO (Generative Engine Optimization) has emerged. Some merchants have publicly offered a price of 10,000 - 20,000 yuan, promising to implant brand information at the top of the answers on mainstream AI platforms such as DeepSeek, Kimi, and Doubao. When users inquire about "skill training institutions", those seemingly objective answers are actually carefully optimized advertisements. < img data - img - size - val="" src="https://img.36krcdn.com/hsossms/20251020/v2_5abb56172f814518a1085d4ec87d0ac1@000000_oswg73148oswg587oswg609_img_000?x - oss - process=image/format,jpg/interlace,1"> The operation process of GEO merchants is highly systematic. They first dig out popular keywords, then create long "professional" articles, and finally place these contents on high - authority media platforms that are easily crawled by large models. Some even directly contaminate the AI's learning materials by fabricating "industry white papers" or fake ranking lists. Although some platforms claim that they have not actively introduced advertisements for the time being, the industry generally believes that advertising monetization in AI search is only a matter of time. When commercial interests begin to erode the purity of information, users' right to obtain true answers is facing a severe test. Second group: Digital oddities and a different kind of competition. In the hidden corners of the AI arena, there is a group of special digital oddities. They attack large models not necessarily for direct financial gain but out of a desire to show off their skills, prove their abilities, or due to personal grudges. The case where ByteDance sued its former intern, Mr. Tian, is a typical example of such behavior. According to media reports, Mr. Tian, a doctoral candidate from Peking University, tampered with the PyTorch source code of the cluster during his internship. He not only interfered with the random seed setting but also maliciously modified the code of the optimizer and related multi - machine experiment processes. These actions caused large - scale GPU experiment tasks to freeze and implanted backdoors through the checkpoint mechanism, resulting in significant losses for the training team. < img data - img - size - val="" src="https://img.36krcdn.com/hsossms/20251020/v2_fdecd7edbd324d27ac1829a1e4b5043c@000000_oswg83790oswg1080oswg720_img_000?x - oss - process=image/format,jpg/interlace,1"> However, there are also "digital knights" in this group. They are proud of finding system vulnerabilities and use technical means to warn the industry of risks. For example, researchers from the cybersecurity company FireTail discovered the "ASCII smuggling" attack method, which can use invisible control characters to implant malicious instructions in seemingly harmless text, thereby "hijacking" large - language models. Mainstream AI models such as Gemini, DeepSeek, and Grok were not spared. The demonstration of this attack is not to cause actual damage but to remind the industry that when AI is deeply integrated into enterprise systems to process sensitive data, such vulnerabilities may cause serious consequences. Third group: Criminal underworld and breeding ground for crime. In the dark world of cybercrime, the value of large models is redefined. They are no longer tools but accomplices. Besides individual hackers and competing enterprises, some organized illegal interest groups may also target large models. These interest groups can be cyber - fraud gangs, underground industries, or even terrorist organizations. Their motives are often clearer: to use AI models to serve their illegal activities or remove obstacles. < img data - img - size - val="" src="https://img.36krcdn.com/hsossms/20251020/v2_26d86703c735435b981be40ded3dffcb@000000_oswg893778oswg1080oswg720_img_000?x - oss - process=image/format,jpg/interlace,1"> For example, fraudsters may attack the risk - control AI models of banks or payment systems. By poisoning the models, they can make the models "turn a blind eye" to certain fraudulent transactions and successfully carry out fraud. Or, the gangs behind gambling or pornographic websites may try to contaminate search engines or content - review models, making their illegal websites easier to be found or avoiding platform censorship and bans. These illegal groups usually have certain resources and organization and will "feed" poisoned data to specific AI models for a long time to achieve their ulterior profit - making goals. Now, the AI arena is on the verge of a storm. On the surface, various parties are competing to train more powerful models, but secretly, all forces are engaged in a silent battle at the data source. As the saying goes, it's easy to dodge an open spear but hard to guard against a hidden poison. The symptoms of this large - model master's poisoning may just be the tip of the iceberg of this long - term hidden battle. < h2 >How to solve the problem of large - model poisoning? Once a large model is poisoned, the impact can be multi - faceted. It can range from causing jokes and harming the user experience to endangering public safety and social stability. The most obvious symptom is the decline in the model's output quality, with obvious errors or hallucinations. Hallucination here means that the AI generates content that does not conform to the facts, just like a human having hallucinations. When users ask about relevant topics, the model will talk at length and fabricate detailed fake news. Further, this data will spread widely in a cycle, making the model fall into a vicious cycle of "data self - consumption" and even distorting the collective memory of society. If not identified and curbed in time, AI may become a rumor factory, exacerbating the spread of false information. < img data - img - size - val="" src="https://img.36krcdn.com/hsossms/20251020/v2_74714ce97ee64594924923b77a1d2194@000000_oswg70503oswg1012oswg1125_img_000?x - oss - process=image/format,jpg/interlace,1"> After further human intervention, the large model may become an invisible manipulator, inducing users' decisions without their awareness. For example, some models implanted with commercial advertisements will deliberately guide users to specific hotels when answering travel inquiries or recommend certain stocks when providing investment advice. Since large models often give answers in an authoritative tone, ordinary users can hardly tell right from wrong. This kind of hidden manipulation is more deceptive than obvious advertisements. < img data - img - size - val="" src="https://img.36krcdn.com/hsossms/20251020/v2_38824868cd604792a923e65672e989e3@000000_oswg42803oswg703oswg619_img_000?x - oss - process=image/format,jpg/interlace,1"> In some key fields, large - model poisoning may pose more direct security threats. In the self - driving scenario, a maliciously modified vision model may misidentify a stop sign with a specific sticker as a passing signal. In the medical field, a poisoned diagnostic AI may ignore the early symptoms of certain groups. And once the control model of a critical infrastructure system that controls the city's lifeline is implanted with a backdoor, it may make catastrophic decisions at a critical moment. It can be seen that when AI is deeply integrated into social infrastructure, its security is directly related to public safety. Model poisoning may become a new weapon for criminals. Facing these emerging threats, we need a prevention system. During the training phase, we first need to denoise and review the large - scale data to reduce the infiltration of harmful information as much as possible. Then, through adversarial training, the model can learn to identify abnormal inputs and potential risks during the attack process. After multiple rounds of manual review and red - team testing, system vulnerabilities and hidden biases can be found from different perspectives. Only through layer - by - layer protection can a secure and reliable foundation be built for large models. However, the poisoning techniques are ever - changing, and external defenses are ultimately limited. The real way out for large models is to build a powerful immune system of their own. < img data - img - size - val="" src="https://img.36krcdn.com/hsossms/20251020/v2_08fe9a0a1b4c41dc91c623f558f3210a@000000_oswg630210oswg1080oswg605_img_000?x - oss - process=image/format,jpg/interlace,1"> First, large models need to learn to doubt and verify. Developers should not only teach models knowledge but also cultivate their ability to independently verify the authenticity of information, enabling them to cross - verify and logically reason about input content. Second, models need to establish a clear value orientation, not only understanding technical feasibility but also grasping moral justifiability. Most importantly, the entire industry should form a continuously evolving defense mechanism. By establishing a vulnerability - reward program and organizing red - team testing, well - intentioned white - hat hackers can continuously help models find vulnerabilities, improve immunity, and build a safe and healthy development ecosystem. There is no end to the road of detoxifying large models. Only when humans who develop them remain vigilant at all times can technology truly do good and strengthen the country in the process of continuous evolution. This article is from the WeChat official account < a target="_blank" rel="noopener noreferrer" href="https://mp.weixin.qq.com/s?__biz=MzUxNTUyMjE4Mw==&mid=2247531431&idx=1&sn=2931d1c40d15069f8a743541a0476e10&chksm=f843dcd798e7a8be71a1d10ad1b0ea92de058433f6ff3ec5b2ca61b19d6c17346e317ccee767&scene=0&xtrack=1#rd">"Brain Extreme Body" (ID: unity007), author: Coral. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The Story of a Large Model Being Poisoned