HomeArticle

Does PANG Ruoming also have papers on Apple? Improving the dilemma of high-quality data depletion in pre-training

机器之心2025-09-23 18:59
The upper limit of the data scale has been expanded again.

Several months ago, Ruoming Pang, the head of Apple's foundation model team and a distinguished engineer, left Apple to join Meta. Mark Zuckerberg spent $200 million to recruit Ruoming Pang into the Super Intelligence team.

According to Ruoming Pang's LinkedIn profile, he has been working at Meta for about three months.

To our surprise, in the past two months, the work Ruoming Pang participated in at Apple has been continuously published, including some high - value research.

During his time at Apple, Ruoming Pang led Apple's foundation model team, mainly responsible for developing the core foundation models for Apple Intelligence and other AI functions. Ruoming Pang's work has high influence and research value in promoting the progress of large foundation models.

For example, here is one piece of research we are about to introduce:

Paper Title: Synthetic bootstrapped pretraining

Paper Link: https://arxiv.org/html/2509.15248v1

We know that large - scale language models are trained on a vast amount of Internet text. Affected by the "Scaling Law", the larger the data volume and the stronger the diversity, the better the model's capabilities.

However, the data obtained from the Internet cannot increase indefinitely. To be precise, we have reached the bottleneck of real - world data scale: high - quality text data is rapidly drying up. We have hit the "scaling barrier", so it is urgent to rethink how to use existing data more efficiently in large - model training.

In large - model training, the success of pretraining mainly depends on the rich causal associations between tokens within a document. However, this is not the only source of correlation in the pretraining dataset. For example:

A code document implementing the attention mechanism often comes from the arXiv pre - print of the Transformer paper;

The novel "Harry Potter" has structural similarities with its movie script.

These phenomena indicate that in addition to the strong correlations within a document, there is also a weaker cross - document correlation, which comes from a certain latent joint distribution of the pretraining documents.

Based on the above findings, the research team put forward a hypothesis:

This additional signal is ignored in the standard pretraining process, but it can be captured through synthetic data. This provides an underexplored path for improving model performance.

To fully utilize this potential opportunity, the researchers proposed Synthetic Bootstrapped Pretraining (SBP), a new language model pretraining process, which consists of three steps:

Similar document pair identification: SBP first identifies semantically similar document pairs d1 and d2 in the pretraining dataset, such as a Transformer paper and its code implementation.

Conditional modeling: SBP then models the conditional probability of d2 given d1 to build a "data synthesizer", which can generate new, relevant documents given a seed document.

Data expansion: Finally, SBP applies the trained conditional synthesizer to the entire pretraining corpus to generate a large - scale new text corpus. This corpus explicitly encodes the cross - document correlations that were not utilized in the original pretraining.

By directly training the data synthesizer from the pretraining corpus, SBP avoids the pitfall of relying on an external teacher language model to "boost" performance, thus ensuring that the improvement comes from better utilization of the same pretraining data.

The three - step process of SBP: (1) Identify semantically similar document pairs through approximate nearest - neighbor search, (2) Train a synthesizer model to generate relevant content, and (3) Expand the synthesis to create a large corpus for joint training with the original data.

Core Problem

Large - scale language models are facing the so - called "scaling barrier": the high - quality, unique text corpus available for pretraining is rapidly drying up. Existing standard pretraining methods mainly rely on next - token prediction to learn token - level dependencies within a single document. Although this method has achieved significant results in practice, it basically ignores a type of potential and extremely rich signal - the relationships between different documents in the corpus.

For example, a research paper and its corresponding code repository, or a novel and its film adaptation, have deep conceptual connections in essence, although they are very different in form and style. Existing pretraining paradigms treat them as completely unrelated samples, thus discarding the value contained in these cross - document relationships.

Synthetic Bootstrapped Pretraining (SBP) aims to solve this problem by converting cross - document correlations into new training signals.

SBP converts cross - document relationships into synthetic training data through three sequentially executed steps:

Step 1: Nearest - neighbor pairing

First, identify semantically similar document pairs in the original pretraining corpus. Specifically, each document is encoded into a 1024 - dimensional vector by a smaller external model (Qwen3 - Embedding - 0.6B). Then, the system uses ScaNN combined with 8 - bit quantization for approximate nearest - neighbor search to ensure computational efficiency.

When the similarity score of a document pair is higher than the 0.75 threshold, it is considered relevant enough and selected into the candidate set. To avoid corpus redundancy, a key filtering step is to check the overlap based on "shingles" (13 - token sliding windows) and remove approximately duplicate document pairs, thus ensuring that the paired results are truly novel rather than simple repetitions.

Step 2: Synthesizer tuning

Based on the identified document pairs, SBP trains a conditional language model to learn the relationship patterns between similar documents. Notably, this "synthesizer" uses the same Transformer architecture as the main language model and is initialized from an existing pretraining checkpoint, thus inheriting the basic language understanding ability.

The goal of the synthesizer is to maximize the following conditional probability:

where d1 is the seed document and d2 is the related document. This training process enables the model to understand how the same concept can be expressed in different document types, writing styles, and contexts.

Step 3: Large - scale data synthesis

The trained synthesizer is applied to the entire original corpus to generate a large new corpus. Specifically, for each seed document d1 sampled from the original corpus, the synthesizer generates a new document d2 through temperature sampling (temperature = 1.0, top_p = 0.9).

After generation, the system filters the synthetic results to remove documents with excessive internal repetitions to ensure the quality of the synthetic corpus. Finally, the synthetic corpus is combined with the original dataset for joint training of the main language model. A core principle is that synthetic documents are not reused during training.

Theoretical Basis

The authors explain the effectiveness of SBP from a Bayesian perspective. They model document generation as sampling from the posterior distribution of latent concepts:

where c represents the latent concept and d represents the document. The synthesizer infers these latent concepts from the seed document during the implicit learning process and then generates new documents to express the same concept in different ways.

This approach enables the language model to encounter the same knowledge in diverse forms during training, thus obtaining stronger generalization and expression abilities.

Experimental Results

This research uses a 3B - parameter Transformer model based on the Llama 3 architecture and trains it on a customized version of the DCLM dataset containing 582 million documents and 482 billion tokens. SBP is verified on multiple scales and evaluation metrics.

The test loss curve shows that SBP (red) always outperforms the baseline repetition method (black) and is close to the performance of the "Oracle" model (gray dotted line) with a large amount of unique data.

Performance Improvement

SBP shows continuous improvement over the strong baseline model at both 200B - token and 1T - token training scales.

At the 200B scale, this method achieves 42% of the performance gain obtained by the "Oracle" model with more than 20 times the unique data; at the 1T scale, it achieves 49%. These results indicate that SBP extracts a large amount of additional signals from a fixed dataset.

A comparison of the performance gains of SBP and the oracle over the repetition baseline. On average, the improvement of SBP in question - answering accuracy is about 47% of the performance improvement that the oracle can achieve with 20 times more unique data.

The training dynamics show that although SBP may be slightly inferior to the baseline at the beginning, its performance continues to improve as training progresses, while the baseline levels off. This indicates that the synthetic data provides truly new information rather than simple repetitions.

Quality Analysis

A qualitative examination of the synthetic documents shows that SBP goes beyond simple paraphrasing. The synthesizer abstracts the core concepts from the seed document and creates new narratives around them. For example, a seed document about a coffee shop in San Diego may generate synthetic content about espresso machine comparisons or coffee - culture essays, introducing new perspectives and information while maintaining topic relevance.

A comparison between the original text and synthetic text variants.

Quantitative analysis confirms that the synthetic data maintains a quality comparable to real data in terms of diversity and lack of repetition, and the factual accuracy significantly improves at larger training scales.

Quantitative evaluation of documents sampled from the synthesizer at 200B and 1T scales.

Significance and Impact

SBP addresses a fundamental challenge in the sustainable development of large - language models by shifting the focus from obtaining more data to extracting more value from existing data. This method offers several key advantages:

Data efficiency: By learning cross - document correlations, SBP enables the model to obtain richer training signals from a fixed corpus, potentially extending the useful life of existing datasets.

Self - improvement: Different from methods that rely on external teacher models or manual annotation, SBP achieves performance improvement through self - bootstrapping using the same architecture and data, making it widely applicable.

Theoretical basis: The Bayesian explanation provides a principled understanding of why this method works, indicating that it achieves a form of concept - level learning beyond surface - level token patterns.

Complementary benefits: Experiments show that the improvements of SBP are orthogonal to those brought by model - scale expansion, indicating that it can be integrated into existing scaling strategies for additional performance gains.

This work opens up new research directions for data - efficient training and shows that significant improvements can still be achieved in existing datasets through more sophisticated utilization strategies. As the field approaches fundamental data limits, methods like SBP may become crucial for the continuous progress of language - model capabilities.

For more information, please refer to the original paper.

This article is from the WeChat official account "Machine Intelligence". Editor: Leng Mao. Republished by 36Kr with permission.