PPO Algorithm: A Classic Work Once Rejected by NeurIPS

Rejection does not equal failure

It's really unexpected.

PPO (Proximal Policy Optimization), a classic algorithm that was later widely used in RLHF and large model training, was rejected by NIPS 2017 back then.

This matter was recently brought up by John Schulman, the author of PPO himself. He summarized this past event in just one sentence: PPO was once rejected by NIPS 2017.

This paper, first published in July 2017, seemed at that time just a simpler and more engineering - friendly policy optimization algorithm. Its goal was to reduce the implementation complexity while retaining the stability of TRPO, making reinforcement learning training easier to tune and more practical.

But a few years later, it was not traditional reinforcement learning tasks like Atari and robot control that really pushed PPO onto a bigger stage, but large language models.

From RLHF to today's RLVR, PPO has become one of the indispensable basic algorithms in post - training of large models. According to Schulman, the second wave of PPO's popularity in the LLM era even exceeded the expectations of the original paper back then.

This doesn't seem like Schulman complaining about the rejection back then, but more like a post - event sentiment: the real influence of a technology often unfolds in ways that the inventor didn't initially anticipate.

Seeing this, many people naturally wonder: why was PPO rejected back then?

Schulman later explained that the paper was considered to have limited innovation and the improvement compared to existing baseline methods was not significant enough at that time.

Some netizens commented, "This actually reflects a misalignment between academic evaluation and real - world industrial needs. The academic community often values novelty and relative improvement over baselines in small - scale, controlled experimental environments; while the real world cares more about whether a method can be scaled up, whether it can remain stable in complex systems, and whether it can really work."

Schulman also seems very calm about this. He said that it was a long - time ago, and he hopes that over the years, the academic community has gradually understood and embraced this "simple but scalable" aesthetic.

What really surprised him is that the PPO paper and its objective function have been having an impact for so long. It's often difficult to tell at the beginning whether an algorithm change is just a minor adjustment that will be quickly forgotten and replaced, or a fundamental component that will stay in the system for a long time and be hard to surpass.

And the story of PPO just illustrates this point.

Actually, it's not just PPO. Many works in the history of AI that were later proven to have far - reaching influence were rejected by top conferences when they were first submitted.

LSTM: Rejected by NIPS in 1996, it was considered too complex and lacking biological plausibility at that time. But later it became the core technology for sequence modeling tasks such as speech recognition and machine translation.

SIFT: Rejected by ICCV 1997 and CVPR 1998 because of its cumbersome engineering steps and lack of elegance. But it dominated computer vision in the pre - deep learning era for more than a decade.

Dropout: Rejected by NIPS in 2012, it was considered an engineering hack with insufficiently rigorous theoretical explanations. But it later became one of the most important regularization methods for deep neural networks and won the NeurIPS Test of Time Award.

Sometimes, time is the strictest and fairest reviewer.

This article is from the WeChat official account “MachineHeart” (ID: almosthuman2014). The author is someone who focuses on RL. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

A classic work, the PPO algorithm: It was once rejected by NeurIPS.