New research from MIT: Adding noise to large models can replace GRPO/PPO parameter tuning
Just add Gaussian noise to the model, and its performance can match or even surpass that of classic parameter - tuning algorithms such as GRPO/PPO.
A new paper from MIT is taking on the headache of "parameter tuning" that everyone is struggling with!
To turn a pre - trained model into an expert in a specific task domain, countless people have been working day and night, and many are losing their hair.
However, now, a teacher - student pair from MIT tells us in a new paper:
Without complex parameter tuning, simply randomly modify the parameters and integrate the results, and the model's performance can be comparable to that of professional parameter - tuning methods like GRPO/PPO.
Before this paper was published, the common view was that expert models are trained.
Whether it's through gradient descent or reinforcement learning, we have to optimize the parameters step by step.
But this paper reveals that expert models already exist; they are just hidden in the weight space. The real form of a pre - trained model is like this:
Expert models are densely packed around, just like shrubs. (That is, the "Neural Thickets" phenomenon mentioned in the paper)
△
That is to say, by slightly perturbing the parameters near the pre - trained weights, we may "stumble upon" a new task expert.
Based on this, the authors further proposed a very simple method, RandOpt:
Just add Gaussian noise to the large language model (a single - step operation - no iteration, no learning rate, no gradient), and then integrate them. You can achieve performance comparable to or even better than the standard GRPO/PPO on mathematical reasoning, programming, writing, and chemistry tasks.
Moreover, the authors found that the larger the model, the better the effect.
"Neural Thickets" are hidden around pre - trained models
In simple terms, the paper presents a counter - intuitive conclusion -
There are already a large number of "expert models" around pre - trained models.
In the weight space, models that can solve different tasks are not scattered randomly but are densely "growing" near the pre - trained weights.
So, theoretically, a complex training process is not necessarily required. As long as we try a few more times in this area, we may find a task expert with good performance.
Hearing this, many people may react like this: Oh, isn't this just guessing and trial - and - error?
Yes, it really is just guessing.
For a long time, random guessing has been considered an unreliable machine - learning algorithm. For example, the probability of randomly guessing the parameter vector of ChatGPT is almost zero.
But the paper found that the situation is different for pre - trained models -
Around the model weights, the parameter perturbations that can improve task performance become very dense, so random guessing can also find effective improvement solutions.
In the paper, the authors applied 1000 random weight perturbations to the pre - trained Qwen2.5 model (0.5B - 32B) and projected them onto a two - dimensional plane through random projection.
The results show that the larger the model, the denser the "high - precision areas" around it; after perturbation, the performance of small models mostly declines (blue areas), while around large models, there are many "experts" with improved performance (red areas) everywhere.
In other words, the larger the model, the more obvious and effective this perturbation effect is.
Moreover, it should be noted that these random perturbations do not result in "all - around players" but "specialized experts".
Experiments show that no single random modification can improve the model's performance on all tasks. For example, one modification can make the model more accurate in mathematics but worse at writing code; another modification can make the model better at solving chemistry problems but worse at writing stories.
Similarly, the larger the model, the more obvious this specialization is.
As for why the model has this phenomenon of "hiding a bunch of experts around", the paper also gives a preliminary explanation through a very simple experiment.
They selected the simplest and most understandable 1D signal autoregressive model and let it learn to predict the next value of a time - series signal.
Three situations occurred:
No pre - training: No matter how the perturbations are added, no modifications that can improve performance can be found around the model, and random guessing is meaningless;
Single - task pre - training: The model can only perform extremely well on the pre - trained task, and no other high - quality modifications will appear around the parameters;
Multi - task mixed pre - training: The area around the model parameters is instantly filled with perturbations that can improve performance. By simply making a small modification, a specialized ability to predict a certain type of signal can be unlocked, successfully replicating the dense state of the "Neural Thickets".
Therefore, the paper reaches the core conclusion that the key to the emergence of the "Neural Thickets" phenomenon lies in the massive multi - task pre - training of large models.
In other words, because the foundation is strong enough, it is easy to find "experts" that can be randomly perturbed around.
Inspired the RandOpt algorithm
The above research also inspired the paper's authors to propose a new algorithm, RandOpt.
The operating mechanism of RandOpt can be divided into two simple steps: randomly find experts + team voting.
"Randomly find experts" is similar to what was mentioned before. Randomly perturb the parameters of the pre - trained model N times, and then N "new - version models" will be obtained.
By simply testing these models with a small amount of validation data, we can find the K models with the best performance.
After obtaining these K models, the next step is the actual reasoning stage -
Let these K "experts" answer the questions respectively, and finally determine the final result according to the principle of "majority rules".
There are two points worth noting in the whole process:
First, when adding perturbations sigmas (i.e., noise intensity), RandOpt will try different intensities of noise (such as small perturbations, medium perturbations, large perturbations) to ensure that various types of experts can be found.
Second, these N models can run simultaneously on multiple GPUs, which is very fast.
Of course, the paper also tested this new algorithm with different models.
The preliminary results show that for pure large language models, on tasks such as mathematics, programming, story - writing, and chemistry, the accuracy of RandOpt is similar to that of the current mainstream professional parameter - tuning methods (PPO/GRPO/ES), and in some cases, even higher.
For vision - language models, the improvement effect of RandOpt is even more obvious, with the accuracy directly increasing from 56.6% to 69.0%.
Meanwhile, in addition to language and vision - language models, the paper also observed a similar "Neural Thickets" phenomenon in image diffusion models -
Certain specific areas in the parameter space tend to generate images with specific color tones or visual styles.
And the paper's authors remind that RandOp performs better in the following situations:
The more times you make random modifications, the more powerful the "experts" you select.
The larger the model, the better the effect of RandOpt.
Introduction to the paper's authors
Finally, let's introduce the two authors of this research.
Yulu Gan, a master's graduate in engineering from Peking University, is currently a Ph.D. student at the Computer Science and Artificial Intelligence Laboratory (CSAIL) of MIT.
Previously, he interned at Microsoft, and his research interests mainly include multi - modal large language models, reasoning, multi - agent systems, and AI for science.
The other author, Phillip Isola, is his supervisor and is currently an associate professor in the Department of Electrical Engineering and Computer Science at MIT.
After completing his post - doctoral research at the University of California, Berkeley, Phillip Isola joined OpenAI as a technical staff member in 2017.
However, after less than a year, he then went to Google as a visiting scholar for a year.
Then he returned to MIT, his alma mater where he did his postgraduate studies, and has been teaching there ever since.
Phillip Isola's main research interests are in the basic theory of AI and computer vision. He has participated in proposing classic works such as pix2pix and LPIPS perceptual loss, and his papers on Google Scholar have been cited more than 100,000 times.
Through this research, the teacher - student pair wants to tell us again:
It's time to re - understand pre - trained models. They are not just "a usable model" but "a collection of many experts".
As long as the pre - training is done well enough, to make the model perform well on specific tasks later, there is no need for complex parameter tuning. Just make random modifications like RandOpt and do team voting, which saves time and computing power.