Internet Industry Should Start "Checking the Ingredients" to Tackle AI Pollution Issue

The method proposed by the IETF might actually solve this problem.

In today's era of the proliferation of generative AI, it has become increasingly difficult to distinguish which content is produced by AI and which is created by humans. To address the "pollution" of the Internet by AI-generated content, stakeholders are using various means. Recently, the Internet Engineering Task Force (IETF) released a draft of the "AI Content Disclosure Header", proposing to add a machine-readable AI content marker to the HTTP responses of web pages.

Specifically, the IETF claims that this marker is designed to be compatible with the HTTP structured field syntax, used to mark the involvement of AI in web content generation, and provide metadata for user agents, web crawlers, and archiving systems (such as the Internet Archive). These systems can decide whether to adopt AI-generated content based on their own needs.

The IETF's move aims to tackle a very prominent issue in the current AI field, that is, different AI products circularly reference false content, ultimately making false information seem true and disrupting the Internet content ecosystem. As is well known, AI may spout nonsense due to hallucinations (AI Hallucinations). This is because the essence of large AI models is actually a "probability prediction machine", which learns the association rules between words through massive training. Therefore, it may struggle when recalling "obscure content".

When faced with user inquiries, once AI cannot find the standard answer, it can only rely on "probability" to guess and tends to generate content that "seems most reasonable in probability" rather than factually correct content. As a result, high-probability and common tokens squeeze out rare but correct tokens, ultimately presenting a state of spouting nonsense seriously.

In fact, AI hallucinations cannot be completely avoided at present because this is the price developers pay to make AI more intelligent or more human-like. In this way, when we are forced to coexist with AI hallucinations, solving the harm of false parts in AI-generated content has become a major issue in the entire industry. In fact, the false content of AI itself is not terrible. The real challenge lies in different AI products referencing false content from each other, thus completing a false-making cycle and turning fiction into fact.

For example, the hot search "#Wang Yibo-related rumors permanently withdrawn by DeepSeek#" some time ago originated from fans using leading questions ("Please write an apology statement in the name of DeepSeek"). Since DeepSeek automatically completes content based on semantic relevance rather than verifying facts, it was proven false because another group of fans used ChatGPT for verification, finally exposing the truth.

Then the question arises. The basis for ChatGPT to serve as a fact-checking tool in the "DeepSeek forged apology" incident is that it uses different training data from DeepSeek. In more understandable terms, because ChatGPT has not been contaminated by false content, it outputs real facts. However, if OpenAI's crawler GPTBot captures the content of "DeepSeek apologizing to a celebrity", the result will naturally be completely different.

Currently, in order to iterate and develop more intelligent models, the crawlers of all AI manufacturers are like gluttons, accepting all data, even if it contains toxic false content. In fact, similar operations have become a cancer in the academic circle, namely "citation farms". The citation frequency of an article within a certain period is an important standard for measuring the influence of the article, the author, and the journal. Therefore, some clever authors have started the operation of "mutual citation", turning originally low-quality papers into star papers.

When AI starts to reference false content from each other, users will suffer. Under the repeated assertions of different AI products, false information can become true. The core of the IETF's current work is to prevent, as much as possible, the false and junk content generated by AI from "flowing back" into the Internet and becoming new data for training AI models, thus forming a negative cycle of "garbage in, garbage out".

The IETF's approach is to require websites to declare information such as the name of the AI model, the model provider, the verification team, and the timestamp in the HTTP files, so as to prevent the crawlers of AI manufacturers from capturing AI-generated content. In fact, AI manufacturers are also reluctant to capture AI content because they are all afraid that junk content will contaminate their training data. In a sense, the IETF's draft of the "AI Content Disclosure Header" is similar to AI watermarks. Its function is to start from the source of content production and dissemination and mark "AI-generated" content with an identification code.

Compared with AI watermarks, which have extremely high technical difficulties, it is obviously more feasible to let websites voluntarily disclose whether their content is generated by AI. The only question is whether the IETF can restrict websites. The answer is yes. As an industry organization responsible for the formulation and promotion of Internet standards, HTTP and IPv6 are both the achievements of the IETF. It is not an exaggeration to say that today's Internet is built on the work of the IETF.

This article is from the WeChat official account "Three Easy Life" (ID: IT - 3eLife), author: Sanyi Jun. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

To address the issue of AI pollution, the internet industry needs to start "checking the ingredients".