Large Models Can Be Smarter with "Poorer Memory": Goldfish Loss Randomly Removes Tokens to Prevent AI Rote Memorization

Randomly mask some tokens during gradient calculation

Sometimes, making a large model "have a worse memory" can actually make it smarter when training!

If large language models are not constrained, they can easily reproduce the training data exactly. To address this issue, a research team from the University of Maryland, the University of Tübingen, and the Max Planck Institute proposed a new method - Goldfish Loss.

As the name suggests, Goldfish Loss means making the model be like a goldfish and not memorize every detail. Instead, it randomly excludes a small number of tokens during the calculation of the loss function.

As a result, the model no longer memorizes the content of the training set word - for - word but can still learn language patterns.

Experiments show that after using Goldfish Loss on LLaMA - 2:

Significantly reduced memorized content: The model no longer reproduces the training data
Performance on downstream tasks is hardly affected: It can still generate text smoothly

As summarized by a netizen's incisive comment: It's like dropout, but for the loss function!

Randomly mask some tokens during gradient calculation

The core concept of Goldfish Loss is very simple. It is to randomly exclude some tokens from the training text during the model training process so that they do not participate in the loss calculation.

In this way, when the model encounters these positions during the inference phase, it can only "guess" instead of reproducing the complete sequence of the training data word - for - word.

In addition, to ensure the consistency of the excluded tokens, the researchers designed a hashing - based masking strategy.

So, how is it different from other regularization methods that also prevent the model from rote - learning?

Take Dropout, a regularization method, as an example. It prevents the model from over - relying on certain parameters by "adding noise" during training, thereby improving the model's ability to generalize.

However, the problem with this approach is that if tokens are randomly dropped, the dropped positions will be different each time the model sees the same paragraph. After a few times, the model can piece together the complete paragraph.

So, in the end, the model still relies on rote - learning to remember the answers.

In contrast, Goldfish Loss uses a hashed mask to ensure that the masked positions are the same every time the model encounters the same paragraph, which fundamentally prevents the model from reproducing the complete training text.

Next, let's see how Goldfish Loss is specifically implemented.

In traditional next - token prediction, the model takes the next real token in the sequence as the target, outputs a prediction distribution, and calculates the cross - entropy loss based on this distribution.

Under Goldfish Loss, the model also predicts the next token in the sequence during forward propagation. However, when calculating the loss, it will "erase" tokens at certain positions from the loss calculation with a certain probability.

That is to say, some real next tokens will not be used as targets for training.

Here, the researchers used a simple static mask to exclude the 4th token in each sequence.

Furthermore, to ensure that the model does not learn the masked data from other sources (for example, different documents may appear repeatedly on different web pages), the research team also proposed a localized hashed mask, which makes the masking pattern the same (repeatable) when the same first h tokens appear.

Experimental tests and results

To verify that Goldfish Loss can indeed prevent memorization, the research team designed two experimental scenarios:

One is an extreme scenario, which strongly encourages memorization by performing multiple training epochs (i.e., repetition) on a small number of samples;

The other is a standard scenario, which simulates the batch - processing method used in real - world model training.

Meanwhile, to evaluate the memorization degree of the model, the following metrics were used:

RougeL score: This metric measures the length of the longest common (non - contiguous) subsequence. A score of 1.0 indicates perfect memorization.

Exact Match: This metric measures the percentage of correctly predicted sequences in the real sequences.

The experiments show that in the extreme scenario, standard training caused the model to memorize 84 out of 100 articles word - for - word, while Goldfish Loss did not memorize any article.

(Note: The experiment further trained LLaMA - 2 - 7B on the first chapter of "Harry Potter" or 100 Wikipedia documents for 100 epochs)

In addition, in the standard training scenario, Goldfish Loss also significantly reduced the model's verbatim reproduction of the target sequences in the training corpus.

But here, there may be an intuitive reaction - if the model "randomly misses learning" some tokens, will its ability also decline?

In response, the researchers conducted tests: The research shows that there is no systematic difference in the overall performance between the Goldfish Loss model, the standard loss model, and the control model.

It should be noted that the core of Goldfish Loss is to ignore the gradient calculation of some tokens. Therefore, to learn enough language patterns, the model must compensate for these gaps with more data, which may lead to a decrease in computational efficiency.

Reference link

[1]https://arxiv.org/pdf/2406.10209

This article is from the WeChat official account "Quantum Bit". The author focuses on cutting - edge technology. 36Kr is authorized to publish it.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Large models may be smarter with a "poorer memory." The Goldfish Loss randomly removes tokens to prevent AI from rote memorization.

Randomly mask some tokens during gradient calculation

Experimental tests and results