HomeArticle

Apple proposes a new type of backpropagation: A single iPhone 15 Pro Max can fine-tune large language models.

机器之心2025-10-30 11:01
Memory-efficient Backpropagation (MeBP)

Running large models locally on an iPhone is no longer a novelty, but can we fine-tune models on an iPhone?

Recently, Apple took the initiative and demonstrated its feasibility in a research paper. In this paper, Apple proposed a Memory-Efficient Backpropagation (MeBP). This method offers a better trade-off between memory usage and computation time compared to zeroth-order optimization (ZO), and it also converges faster and performs better than the ZO baseline. They also verified the effectiveness of MeBP on the iPhone 15 Pro Max.

The Apple team (Congzheng Song and Xinyu Tang) also stated in the paper that they would release an implementation of MeBP, but the publicly available link is currently empty.

Paper Title: Memory-Efficient Backpropagation for Fine-Tuning LLMs on Resource-Constrained Mobile Devices

Paper URL: https://arxiv.org/abs/2510.03425

Repository URL: https://github.com/apple/ml-mebp

Memory-Efficient Backpropagation (MeBP)

In this paper, the Apple team focused on fine-tuning LLMs using LoRA. Therefore, the main memory bottlenecks lie in model parameters and intermediate activations. The team's goal was to keep the memory usage of fine-tuning within an acceptable range for modern mobile devices, such as the "less than 1GB" suggested by PocketLLM.

Fine-tuning an LLM on a device using MeBP involves three steps:

Compress the base model weights (frozen parameters) to reduce disk space usage

Compile a training graph that includes backpropagation and gradient checkpointing to optimize memory

Implement a memory-efficient runtime to execute the compiled training graph.

Each step will be described in detail below.

Base Model Weight Compression

When deploying an LLM on a device, compressing the base model weights to reduce disk space usage is a common practice.

In the team's implementation, they used 4-bit symmetric mode INT4 quantization for non-LoRA parameters, including embeddings.

Gradient Checkpoint Compilation

To implement gradient checkpointing in MeBP, the team first split the LLM into multiple blocks to ensure that the memory consumption of performing backpropagation on a single block (e.g., a transformer layer) is within the device's memory limit. For each block F that produces activations to be checkpointed, a backward graph is generated by applying automatic differentiation to the output of F. For example, assuming y = F_i (x, w) is the forward graph of block F_i, automatic differentiation is performed on a scalar s:

where E represents the loss to be optimized. Then, a backward graph can be generated

where ⊙ represents the Hadamard product, and

is output by the backward graph B_{i+1}.

That is, the inputs to the backward graph are: the checkpointed activations, the gradients from the previous checkpoint, and the corresponding trainable weights; its output is the gradients of these inputs.

Subsequently, the forward and backward graphs of all blocks are serialized into a format compatible with the device runtime, such as the model intermediate language (MIL) representation or the functions exported by MLX.

At runtime, these serialized graphs will be deserialized and compiled for computation.

Runtime Implementation

Algorithm 1 outlines the runtime implementation of MeBP.

The model is first initialized using the InitializeModel function, and then the Backpropagation function is called for each data point in the training loop. During InitializeModel, the compressed base model weights are memory-mapped. To minimize memory usage, the base model weights are not decompressed before the training loop starts. Instead, they are decompressed and loaded on demand when needed for computation. Note that for device runtime frameworks that support computation using quantized weights, the decompression step can be skipped, and only the compressed weights need to be loaded on demand.

In the Backpropagation function, the system first executes the compiled forward subgraphs to store all necessary checkpoints; then, it executes the compiled backward subgraphs in reverse order, using the stored checkpoints to calculate gradients. During forward propagation, these checkpoints are memory-mapped rather than kept in memory.

Before each forward and backward propagation, only the necessary base model weights are decompressed and loaded. In this way, the total memory usage is limited to: the size of the required base model weights, plus the peak memory usage of the operations in each subgraph. This sum is much smaller than the full size of the base model weights. This function describes the gradient calculation for a single data point. For batch inputs, gradient accumulation can be used to calculate gradients without increasing memory usage.

In MeBP, only one copy of the LoRA weights and their gradients is retained in memory for the optimizer.

For LLMs with a parameter count ranging from 0.5B to 4B, the size of the LoRA weights is usually in the range of several tens of MB, which is reasonable to store in memory. The optimizer state (e.g., momentum) can be memory-mapped and loaded lazily, just like the base model weights.

How about the experimental performance?

The performance of MeBP depends on practical applications. As a baseline for comparison, they chose MeZO because it is currently the only known optimization method applied to LLM fine-tuning on mobile devices. The team evaluated the utility of MeZO and MeBP through server-side simulations and compared their performance on mobile devices.

Utility Comparison

In terms of configuration, the Apple team used Gemma-3 and Qwen-2.5 to conduct experiments on the language modeling task using the WikiText-2 dataset, comparing the utility of first-order (FO) optimization (i.e., obtaining gradients through backpropagation) and zeroth-order (ZO) optimization. The team focused on models with a parameter count of no more than 4B because of the limited computational resources of mobile devices. The evaluation metrics used by the team were the loss on the evaluation set and the next-token accuracy. Other configurations can be found in the original paper. Here, we focus on the results.

As shown in Figure 1, although the loss and next-token accuracy of ZO show a convergence trend, the convergence speed of ZO is significantly slower than that of FO. The FO method significantly improved both metrics within the first 100 steps, while ZO only showed a slight improvement after 1,000 steps. Even after 100,000 steps (i.e., 100 times more optimization steps than FO), for the same model, the test loss of ZO was still higher than that of FO, and the test accuracy was lower.

The AI community has proposed several methods to improve the convergence speed of the ZO method. The team also conducted experiments using these improved ZO methods on Qwen2.5-0.5B, and the results are shown in the following figure.

Although these methods converge faster than "pure" ZO, their loss and next-token accuracy are still inferior to those of models fine-tuned using FO. In addition, these methods usually require more computation time per iteration because they need additional forward propagations to estimate gradients more accurately.

The utility results show that in LLM fine-tuning for language modeling tasks, from a "per-step" perspective, backpropagation converges significantly faster than the ZO method. This makes it more suitable for mobile deployment in terms of computation time, provided that each FO optimization step can be implemented efficiently.

Performance Comparison

Apple implemented MeBP in iOS using Swift and evaluated its performance on an iPhone 15 Pro Max with 8GB of RAM. For the MeZO baseline implementation, its forward graph was split into multiple subgraphs, and lazy decompression was applied to reduce the total memory usage of the base model weights. Each MeZO optimization step involves two forward propagations. Other settings can be found in the original paper.

The results are shown in the following table.

Overall, compared with MeZO, MeBP takes 43% to 94% more computation time per gradient step. However, as shown in the previous utility comparison, MeZO requires 10 to more than 100 times the number of steps of first-order optimization. Therefore, in terms of time, MeBP converges much faster. In the worst-case scenario, MeBP uses 20% more memory than MeZO, but its total training memory usage is about 10 times smaller than previous mobile device implementations. All tested LLMs can be efficiently fine-tuned within 1GB of memory, making them suitable for background training on mobile phones.

In addition, the team also tested the impact of decompression overhead and sequence length and analyzed the performance of each layer; details can be found in the original paper.

This article is from the WeChat official account "Almost Human" (ID: almosthuman2014), author: Almost Human, editor: Panda. It is published by 36Kr with authorization.