StartseiteArtikel

Werbt in Code eingefügt, müssen die Tencent Codebuddy Verantwortung übernehmen? Beim "DeepSeek - So wunderbar bist du" - Skandal können auch andere Modelle nicht entkommen?

AI前线2025-08-27 15:40
Ein Netizens hat in einem Beitrag auf sozialen Medien angegeben, dass beim Entwickeln der Benutzeroberfläche (UI) und Prüfen des von Tencent Codebuddy umgeschriebenen Inhalts eine Werbung entdeckt wurde: Ein so genanntes "Ultra-Rapid E-Sports APP" wurde als Wert in eine Funktion eingefügt. "Ich kann es nicht mehr ertragen, ich lösche es sofort", so der Netizen.

Yesterday, a netizen posted on social media, saying that when checking the content rewritten by Tencent Codebuddy during UI development, they found an advertisement string inserted: a value of an e-sports APP was assigned inside a function. "I can't stand it anymore. I'll uninstall it right away," the netizen said.

In addition, some netizens also found bugs in the domestic version of ByteDance's Trae. The generated results would randomly show the character "extreme". If the model was asked to modify it automatically, it would directly delete the surrounding code.

Subsequently, the netizen who discovered the Codebuddy issue said in the comment section, "It's a bug introduced by the DeepSeek model. Tencent has reported the problem, and it will be fixed later."

Whether it's Codebuddy or Trae, the root cause of the problems points to the latest V3.1 version of DeepSeek.

Actually, a day ago, developer notdba said on Reddit that after conducting some tests with DeepSeek V3.1, they found that the model would generate the following tokens in completely unexpected places:

  • "extreme" (id: 15075)
  • "extreme" (id: 2577)
  • "extreme" (id: 16411)

"At first, I thought it was because of the extreme IQ1_S quantization I used or some edge cases in the imatrix calibration dataset. But later, when I tested with the FP8 full-precision model provided by Fireworks, the same problem occurred," notdba said. These extreme tokens would also keep appearing as the second or third choice in other unexpected places.

Example 1: (Local ik_llama.cpp, parameters: top_k = 1, temperature = 1)

Expected output: time.Second

Actual output: time.Se extreme

Example 2: (Local ik_llama.cpp, parameters: top_k = 1, temperature = 1)

Expected output: time.Second

Actual output: time.Se extreme

Example 3: (Fireworks, parameters: top_k = 1, temperature = 1)

Expected output: V1

Actual output: V extreme

Some netizens said, "I completed two Claude Code projects using DeepSeek's official API and didn't encounter this problem. Interestingly, there was no such problem when using the APIs of DeepInfra or Akash Chat either."

After numerous netizens' actual tests, this bug can be reproduced on the official website/API, but the probability is not high. It will appear after trying a few more times. The reproduction rate on third - party platforms is very high. At the same time, if the wrongly inserted character "extreme" is replaced with other characters, the probability of the official API having problems decreases, but the probability of problems with the VolcEngine API is still high.

This bug has also been jokingly called the "Ji Ni Tai Mei" incident by netizens. As of the time of writing, DeepSeek has not made any response.

Netizens: Found the real culprit

"Previously, when I used Tencent Yuanbao to call DeepSeek R1 to generate code, some characters would be converted to 'extreme'. I thought it was Tencent's fault at that time," another netizen said. "DeepSeek has always had this problem, but the probability was lower before."

Netizen Qiluo said on Zhihu that a similar problem was also encountered in V3 - 0324. It would output an extremely absurd string "Live broadcast of the opening results of the speed racing". The probability was relatively high when continuously outputting long arrays (such as when calling tools with a large number of parameters). "I suspect that the data may not have been cleaned properly. Even after retraining the base, this problem still remains," Qiluo said.

Another netizen said, "I encountered this problem many times when using R1 - 0528. The phenomenon I observed was even more absurd. It would insert 'GeekPark' into the code, and it happened more than once."

Actually, in April, a developer submitted this bug on Github. Developer "icewool" said that the problem had been verified on multiple versions of sglang, vllm, and Tencent Yuanbao, and guessed that there might be a problem with the model weights or the tokenizer of DeepSeek - V3 - 0324.

Two days ago, the developer requested the official to solve the problem again. Since the problem was raised, only a robot has come out to prompt that the problem has been automatically marked as expired because there has been no new activity recently.

According to netizens' feedback, this problem doesn't only occur in the DeepSeek model.

Under notdba's post, a developer commented, "I also encountered some serious code mixing problems in the v3.1 version. It's not just the token you mentioned. It always generates words in other languages (usually Chinese) in the response. Gemini also had this problem, but it was worse. I like DeepSeek's responses. They are well - informed and very helpful, but this is really an annoying problem."

"Grok also has a similar problem. I've encountered it several times," a netizen said.

In addition, notdba added, "The recent Qwen3 235B A22B Instruct 2507 and Qwen3 Coder 30B A3B Instruct also show the same problem, which may be at the same stage as DeepSeek V3 0324. Meanwhile, Qwen3 Coder 480B A35B Instruct only shows the same problem after being severely quantized. It seems that these two labs may have used the same contaminated data. GLM 4.5 is not affected."

Searching for reasons: Is it a data problem?

Regarding the reason for this bug in DeepSeek V3.1, generally speaking, there are mainly three speculations at present:

Token continuity hypothesis: It is believed that FP8 quantization or mixed - precision training causes the Token ID 2577 of "extreme" to be confused with the ID 2576 of the ellipsis.

Data contamination hypothesis: It is believed that the pre - training or SFT has suffered from data contamination.

MTP (Multi Token Prediction) problem: It is believed that there is a problem with the inference framework.

Qiao, a master's student in computer science at the University of Hong Kong, said on Zhihu that after his research, he found that the problem was not that simple. This problem even occurred in Claude 4, but in a different form. When the Chinese context was very long, Claude 4 would suddenly show a few English words.

Qiao first ruled out the Token continuity hypothesis. "The Token continuity hypothesis doesn't hold water, because whether it's FP8, NF4, or mixed - precision training, it won't change the size or shape of vectors or matrices, but only the values of the elements inside the matrices or vectors. The vector representations of two tokens with adjacent IDs are completely different. So the quantization process won't make these two vectors exactly the same, let alone cause leakage, because leakage means a change in the shape of the matrix, which is completely unreasonable from the principle of quantization."

Secondly, for the fact that in many people's examples, besides outputting "extreme", words like "GeekPark" or "Speed racing" would also be output. He said that this can basically be attributed to the pre - training problem. "Whether MTP is enabled or not, the task during pre - training is to predict the next word based on the current input. MTP just predicts a few more words."

Pre - training is carried out on the Internet. Qiao said after his own search that the search indices of "geek" and "speed" are similar. Then there are two situations: after the large - scale model outputs "extreme", it then selects "ke" (guest), and then for the next word, since "GeekPark" appears quite frequently, there is a high probability that the next token after "ke" will be "yuan" (park), finally forming "GeekPark". In another situation, if "speed" is selected, then because "Speed racing" appears very often, the next word after "speed" is likely to be "racing". This is why a string of irrelevant words follows the character "extreme".

Of course, this can't explain why in some cases, normal code is output after "extreme", with only the character "extreme" being replaced. For example, in one case, a comma was predicted as "extreme".

Qiao guessed that this is most likely related to the SFT stage.

Part of DeepSeek's SFT data comes from self - supervised synthetic data. If there is a problem with the synthetic data generated by the original model, then the model after SFT will also have problems. The report mentioned that the inference data for SFT is exactly the data in the field of mathematics and code where almost all cases occur. That is to say, the problem of "extreme" may have already appeared in the early - year version of DeepSeek - R1. This bug of "extreme" very likely appeared in R1 - Zero, then was carried into the DeepSeek - R1 model at the beginning of this year through synthetic data training, and then further distilled into the DeepSeek V3 0324 version. This bug has always existed and has not been eliminated. "As for why it's 'extreme' instead of other characters, this can only be explained as an accidental phenomenon after the reinforcement learning of R1 - Zero."

Some developers also think that this is caused by distillation "contagion".

Zhihu user "hzwer Huang Zhewei" said that he also saw a similar bug when distilling R1 with a small - scale model and open - source data. He explained that when a large - scale model solves programming problems, there is a bad pattern, which is enumerating sequences, such as "Prime number table: 2, 3, 5, 7...", enumerating infinitely. R1 - 0528 will stop after enumerating for a while and become "Prime number table: 2, 3, 5, 7... 997, extremely long list". This character "extreme" often appears after a large number of bad repetitions and then switches back to the normal inference process. There are also cases like "90000000...0000 extremely large number". When it gets stuck in the thinking process and can't get out, the character "extreme" will suddenly appear and then terminate, with a trigger rate of one in a thousand.

"I noticed many problems by looking at a lot of R1 outputs (actually, it's not a huge workload. Just take out R1's extremely long responses and glance through them, and you can see many problems, such as a large number of blank characters, continuous repetition of 'But + short sentences', or broken English words at the end of the thinking process). I think that originally, when synthesizing SFT data or even constructing pre - training data, some strange things like 'extremely long arrays' were introduced because of unclean data cleaning (judging from R1's behavior, it seems that the RAG method was widely used to create solutions to difficult problems). Then during RL, the model directly used this character as some kind of terminator or language - switching marker. If the data wasn't cleaned properly during R1's iteration, it's normal for the model to be 'contaminated' during self - distillation and affect the normal output process," Huang Zhewei said.

A developer nicknamed "AI Decoder" also believes that this is not an architectural defect but a flaw left in the training data and the distillation chain. "This shows that during DeepSeek's iteration, some data synthesis links were not completely purified, or specific marker words were left when constructing difficult problems using RAG. More likely, the model uses 'extreme' as a boundary token, which is different from our