Apple Swiftly Withdraws RLAX Paper: Utilized Google's TPU and Alibaba's Qwen, Pang Ruoming Among Authors

The paper was retracted, but not completely.

Yesterday, a new paper from Apple was made public on arXiv and then quickly removed. The reason is unknown.

However, by looking at its submission history, we can see that the paper was submitted to arXiv on December 6th (UTC). Five days had passed by the 11th. It was taken down at lightning speed after going live, which naturally makes people curious about what exactly happened.

Fortunately, a v1 version of the paper has been recorded by the Internet, so we can also open this paper to find out.

In the paper, Apple revealed a TPU-based scalable RL framework they developed, RLAX.

Yes, you read that right. It's not a GPU, nor Apple's own M-series chips, but Google's TPU! Moreover, Amazon's cloud and China's Qwen model were also used in the research of this paper.

Paper title: RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs

Paper address: https://arxiv.org/pdf/2512.06392v1

All in all, this paper has quite a few contributions.

However, before specifically introducing the research results of this paper, it's necessary for us to first pay attention to its author list.

The authors of RLAX

The RLAX paper has a total of four core authors: Runlong Zhou, Lefan Zhang, Shang-Chen Wu, and Kelvin Zou.

The corresponding authors are Kelvin Zou and Cheng Leong. Among them, Kelvin Zou used to be a Principal Engineer at Apple and has now joined Meta as an AI research scientist. Cheng Leong is a veteran who has worked at Apple for over 13 years and is currently the head of Apple's AI Infra (Artificial Intelligence Infrastructure).

Screenshot from LinkedIn

In addition, we also saw the name Pang Ruoming in the author list.

This former head of Apple's AI who has joined Meta, along with the names of six other authors, also appears at the bottom of the first page of the paper and is described as "having left Apple. They made contributions to this work during their employment at Apple." Moreover, they basically just left the company in the past few months.

A simple search of the resumes of these six authors shows:

Kelvin Zou joined Meta
Hanzhi Zhou has joined OpenAI
Ye Ke joined Anthropic
Floris Weers joined a stealth startup as a founding engineer
Chong Wang also joined Meta
Yi Zhang is now researching model inference at xAI.

RLAX: Born to seize TPUs

Back to the technology itself. There's no need to elaborate on the importance of Reinforcement Learning (RL) for modern inference language models. Almost all top models are inference models based on RL, including OpenAI o3, Claude 4, Grok 4, Gemini 2.5, DeepSeek R1, and Qwen 3.

RLAX, developed by Apple, is a reinforcement learning framework designed specifically for the efficient execution of state-of-the-art RL algorithms on large-scale distributed TPU clusters.

Extreme decoupling and preemptive scheduling

RLAX adopts a Parameter-Server architecture. The Master Trainer regularly pushes the updated model weights to the parameter server. Meanwhile, a group of Inference Workers pulls the latest weights and generates new sampling data (Rollouts).

The team introduced a set of system-level technologies to logically separate the trainer, inference workers, and verifiers. This logical separation enables RLAX to flexibly and independently allocate computing resources to each component.

Most importantly, RLAX fully supports preemptive scheduling. This means that when a higher-priority task (such as an online inference load) requires it, the system can immediately reclaim TPU resources without causing the training to crash.

Flexible policy support

RLAX aims to solve the key challenges in the post-training RL process of large-scale LLMs, especially how to efficiently handle On-policy and Off-policy RL.

To this end, RLAX provides programmable configuration options. Users can enforce "Staleness Bounds", specify the frequency at which inference workers pull new weights, and the maximum staleness of Rollouts that the trainer can tolerate. This allows users to flexibly choose between On-policy and Off-policy RL.

Oubliette: Throw the code into the dungeon

In the design of the verifiers, Apple engineers showed a unique sense of black humor.

The verifiers need to perform code execution verification for each programming language in the training corpus. To efficiently and deterministically verify Python programs, they containerized the standard Python dependencies.

To run large-scale code tests, they called Amazon's AWS Lambda service and named it "Oubliette".

The word "Oubliette" comes from French and originally referred to an underground dungeon in a castle with only one exit (usually a trapdoor in the ceiling), a place specifically used to "forget" prisoners.

Apple engineers used this word to metaphorize their stateless verification environment: The code and test data are thrown into this AWS Lambda-based "dungeon". After running the tests and spitting out the results, the entire environment is immediately destroyed, as if the code had never existed.

How does it perform?

Interestingly, during the experimental phase, we witnessed the birth of a "Frankenstein":

Computing power base: As indicated in the paper title, it's not Apple's own chips, nor NVIDIA GPUs, but Google's TPU v5p (1024 TPU v5p were used in the experiment).
Verification environment: To run large-scale code tests, they called Amazon's AWS Lambda service.
Base model: The model they used to verify this framework is not the base of Apple Intelligence, but the open-source QwQ-32B from a Chinese Alibaba team.

Yes, Apple engineers in the United States are using Google's TPUs and Amazon's Serverless services to optimize a Chinese open-source Qwen model.

The results are quite impressive. RLAX only took 12 hours and 48 minutes to increase the pass@8 accuracy of QwQ-32B by 12.8% on 1024 v5p TPUs, while maintaining robustness against task preemption during training.

This scenario of a "melting pot of US-China technologies" is almost unimaginable in Apple's previously closed ecosystem. This also confirms two things indirectly: First, in the field of AI Infra, pragmatism is overriding parochialism; second, the dominance of domestic models (especially Qwen and DeepSeek) in the field of code inference has become so strong that even Apple can't resist using them as "grinding stones".

The disappearing 1.0: A hardcore numerical ghost

On pages 4 and 9 of the RLAX paper, Apple disclosed a bug that would make system engineers shiver.

In reinforcement learning, On-policy training has a theoretical cornerstone: The Importance Sampling ratio r(θ) should always be equal to 1.0. Because the behavior policy and the current policy are exactly the same.

But in actual TPU training, the Apple team found that 1.0 is not equal to 1.0.

The root cause of this problem lies in the non-associative property of the bfloat16 floating-point format. Simply put, in a computer, the results of (a + b) + c and a + (b + c) may have slight bit-level differences.

During inference: The JAX compiler fuses operators (Kernel Fusion) crazily for maximum speed.
During training: To calculate gradients for backpropagation, the compiler must retain intermediate values, resulting in a different operator fusion strategy from that during inference.

This slight difference in the calculation order is magnified under bfloat16, causing the probabilities calculated at the inference end and the training end to misalign, which in turn leads to training crashes.

Apple's solution is very brute-force and effective: They force Rematerialization in the trainer, disable the saving of most activation values, and force the training end's computational graph to "imitate" the inference end's calculation order. Although a little speed is sacrificed, this numerical problem is eliminated.

For engineers engaged in LLM Post-training, this debugging process is extremely valuable for reference.

In conclusion

Although the paper has been removed, RLAX proves that Apple still has world-class engineering capabilities in AI infrastructure. They can handle the most complex distributed systems and solve the most fundamental numerical problems.

However, as many important figures have dispersed to Meta, OpenAI, Anthropic, and xAI, this paper seems to have become a footnote for Apple's AI at this stage.

This article is from the WeChat official account "Machine Intelligence" (ID: almosthuman2014), author: Panda, published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Apple quickly withdrew the RLAX paper: It used Google's TPU and Alibaba's Qwen, and Pang Ruoming was among the authors.