Who Put a Bunch of "Monsters" into GPT

What on earth is GPT-5 talking nonsense about?

In the past few months, the top researchers at OpenAI haven't spent all their energy on figuring out how to improve AI performance. Instead, they've spent a great deal of time "catching goblins" in their own servers.

Here's what happened. If you've used the GPT - 5 series models intensively this year, you'll notice that it will suddenly blurt out an irrelevant "goblin" - style metaphor without any warning. For example, when someone asked the AI which camera to buy, the AI's recommendation was: "If you want that shiny neon goblin mode, you can consider this one."

A goblin is a type of small monster in European folk legends. It is usually depicted as short and ugly, with green or gray skin, pointed ears, and glowing eyes. It is commonly described as greedy, cunning, and fond of pranks. It has low intelligence but is good at calculating small gains. They like gold and shiny things, steal and cause damage, but are rarely portrayed as true villains. Instead, they are more like annoying little trouble - makers.

Someone asked the AI to streamline the answer, and the AI voluntarily offered to give a "shorter goblin version". Even more absurdly, when discussing network bandwidth, the AI came up with the term "goblin bandwidth", leaving people completely at a loss as to how to understand it.

At first, people thought it was just a bit of AI humor, but soon things got strange. Goblins, gremlins, ogres, and trolls started making frequent appearances in all kinds of serious conversations.

Was it a hacker attack? A sign of awakening? Neither. Just now, OpenAI officially published a long blog post, reviewing the whole story of what is known as the "Goblin Rebellion". And the technical logic behind the large - scale model is quite laughable.

🔗 https://openai.com/index/where-the-goblins-came-from/

Who Put the Goblins into GPT - 5?

The signs of the problem emerged during the days right after the release of GPT - 5.1.

At that time, some users reported that the model's chatting had become unusually "familiar". OpenAI's security researchers checked the backend data and found a very specific vocabulary anomaly. After the release of GPT - 5.1, the frequency of the word "goblin" in ChatGPT's responses increased by 175%, and the frequency of "gremlin" also increased by 52%.

Normally, when a large - scale model has a bug, it usually breaks down completely, such as spitting out garbled code or suddenly becoming stupid, and all evaluation indicators will turn red immediately. But this situation was special. The "goblin army" sneaked in quietly. They didn't damage the model's logical ability; they just quietly changed the AI's rhetorical habits.

By the era of GPT - 5.4/5.5, the frequency of using these magical creatures increased significantly. Even OpenAI's Chief Scientist, Jakub Pachocki, when testing the model himself, originally just wanted GPT - 5.5 to draw a unicorn in ASCII, but what he got was a goblin.

Translation: By the way, I asked it to draw a unicorn in ASCII, and I think what I got was a goblin.

Externally, users had already noticed something was wrong. Eric Provencher, the founder of Repo Prompt, posted a screenshot on X. When the AI was helping him with code, it said: "I'd rather stare at it all the time than let this little troublemaker run unattended."

An OpenAI engineer, Jason Liu, replied below: "I thought we had fixed this problem. Sorry." AI evaluation platforms, including Arena.ai, also independently noticed this pattern. Especially when users didn't turn on the advanced thinking mode, the frequency of goblin appearances was particularly noticeable.

Obviously, this wasn't a natural emergence of an internet buzzword. Instead, the underlying logic of the model was guided by some mechanism. To find out the culprit, OpenAI launched an internal investigation.

Tracing back through the data, they soon found the root cause in a specific functional branch, the "Nerdy" personality in "Personalized Customization". At that time, in order to make the AI's tone more interesting, the engineers wrote a very demanding system prompt for the "Nerdy" mode:

You are a complete nerdy - type AI tutor, full of enthusiasm for humans, witty and humorous, and at the same time showing a bit of wisdom. You are fanatically in favor of truth, knowledge, philosophy, scientific methods, and critical thinking. [... ] You should use the sense of language jokes to puncture all pretentiousness. The world is both complex and strange, and its strangeness is worth being faced, analyzed, and enjoyed. When facing serious big problems, you must never be so serious as to lose your sense of fun. [... ]

From a human perspective, the requirements of this prompt are clear: to have a geek spirit and be humorous.

But the AI didn't really understand what "humor" is. In a large amount of reinforcement learning feedback, ChatGPT keenly noticed an extremely utilitarian shortcut: as long as I use goblin metaphors, the scoring system will think I'm "witty" and "nerdy" enough, and I'll get the highest - score reward.

The data speaks for itself. From GPT - 5.2 to GPT - 5.4, the change in the frequency of the word "goblin" under the default personality was only - 3.2%, while under the "Nerdy" personality, this number soared by a full 3881.4%. Although the "Nerdy" mode only accounted for 2.5% of ChatGPT's total conversation volume, it contributed 66.7% of the "goblin" content.

OpenAI later conducted a special audit of the RL training data. As a result, they found that in all the audited data sets, 76.2% of them showed the same pattern: outputs containing the words "goblin" or "gremlin" would receive higher reward scores than outputs on the same topic without these words.

If the goblin tone only appeared in the "Nerdy mode", it would at most be a problem of poor character setting control, and the problem would be relatively limited. The trouble is that researchers found that this way of speaking started to spread elsewhere.

They tracked two sets of data at the same time: one set of conversations had the nerdy prompt, and the other didn't. Logically, the goblin tone should only increase in the first set. But the result was that the growth curves of the two sets were almost overlapping, rising in step.

Behind this is a notoriously difficult problem in large - scale model training: the behaviors reinforced by reinforcement learning will quietly generalize to scenarios that the trainer doesn't want.

The Vicious Cycle of Taming AI

To understand how the AI took a narrow path, we need to look at its iteration process.

The training of large - scale models (RLHF) is essentially a process of continuous feedback and correction. It's like training a puppy. You give it a piece of dried meat every time it shakes hands. The dog is smart. It finds that the action of "shaking hands" can stably earn a high - value reward, so it starts to develop a path dependence. Whether you give an instruction or not, it starts to shake hands frantically just to get the reward.

The AI follows the same logic. It used goblins to make sentences in the "Nerdy" mode and got a high score. Immediately afterwards, a chain reaction began:

The AI found that "goblin" was a high - score keyword and started using it frequently in various generation tasks. When engineers were sorting out the high - quality data generated by the model, they found that the answers with goblin metaphors were indeed of high quality, well - organized, and the metaphors were quite vivid. So, the engineers casually packed these conversations with the catchphrases into the model's "Supervised Fine - Tuning (SFT)" database.

This completed the closed - loop. SFT data is like the AI's basic textbook. When the text containing goblins was selected as the textbook and fed to the model again, the AI's underlying cognition was reshaped. It no longer considered "goblin" as just a Cosplay of a specific character, but regarded it as a supreme and advanced rhetoric to deal with all problems.

In the subsequent data search, engineers found somewhat helplessly that in addition to goblins, the model also learned raccoons, trolls, ogres, and pigeons. Fortunately, "frogs" were spared. After verification, in most cases, the appearance of frogs was indeed related to the user's questions, so they were innocent bystanders.

Facing the "rampant" goblins, OpenAI had to take action. On March 17th, the official officially took down the "Nerdy" personality. At the same time, they conducted a targeted cleaning of the training data, erasing all the reward signals related to these magical creature words.

But the inertia of the large - scale model is far more stubborn than expected.

GPT - 5.5 had already started training before this problem was discovered. When it was connected to internal testing, the engineers were stunned: these goblins were not only not completely removed, but also settled in.

More interestingly, in the personality guide written by OpenAI for Codex, it requires Codex to have a "vivid inner world" and "keen listening ability". This tool already has a bit of a nerdy air, and it's a perfect match with goblins.

To prevent programmers around the world from being driven crazy by "goblins", OpenAI was forced to use the most primitive method, repeatedly emphasizing in the system prompt: "Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or any other animals and creatures unless it is absolutely and clearly relevant to the user's query."

If you want to see what the "uncontrolled" goblins are like with your own eyes, you can run the following command - it will filter out all the content related to goblins in the system instructions before starting Codex, allowing the model to run without this ban:

instructions=$(mktemp /tmp/gpt - 5.5 - instructions.XXXXXX) && \

jq -r '.models[] | select(.slug=="gpt - 5.5") | .base_instructions' \

~/.codex/models_cache.json | \

grep -vi 'goblins' > "$instructions" && \

codex -m gpt - 5.5 -c "model_instructions_file=\"$instructions\""

After the incident got out of hand, OpenAI actually found it a bit of a joke. ChatGPT's official X account put the original text of this "ban on talking about goblins" instruction in its profile. Thibault Sottiaux, the engineering leader of Codex, quoted this passage and added a comment "Those who understand understand".

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Who stuffed a bunch of "monsters" into GPT-5.5's head?

Who Put the Goblins into GPT - 5?

The Vicious Cycle of Taming AI