DeepSeek Quietly Open-Sources LPLB: Resolving MoE Load Imbalance via Linear Programming

Linear-Programming-Based Load Balancer (LPLB)

Yesterday, DeepSeek launched a new code repository on GitHub: LPLB.

Project address: https://github.com/deepseek-ai/LPLB

There were no tweets, no official account updates, and few technical bloggers shared tweets about it, with little attention. As of now, the number of stars for this project has not exceeded 200.

However, upon closer inspection, this project seems far from simple and deserves more attention. Netizen gm8xx8 on X commented that this indicates DeepSeek is addressing the correctness and throughput bottlenecks in preparation for the release of the next version of the model.

Project Introduction

LPLB, short for Linear-Programming-Based Load Balancer, is a load balancer based on linear programming.

As the name suggests, LPLB is a parallel load balancer that uses the Linear Programming algorithm to optimize the expert parallel workload distribution in the MoE (Mixture of Experts) model.

Specifically, LPLB achieves dynamic load balancing through the following three steps:

Dynamic Reordering: Reorder the experts based on workload statistics.

Build Replicas: Build expert replicas in combination with the static topology.

Solve for Optimal Allocation: Solve for the optimal Token allocation scheme for each batch of data.

More specifically, the expert reordering process of LPLB is assisted by EPLB. The real-time workload statistics can be provided by the user, collected through torch.distributed, or directly obtained from the internal communicator of the Deep-EP buffer. As for the solver, it uses the built-in LP (Linear Programming) solver, which implements the single SM (Streaming Multiprocessor) Interior Point Method (IPM) and utilizes NVIDIA's cuSolverDx and cuBLASDx libraries for efficient linear algebra operations.

In this way, the problem of uneven MoE load can be effectively solved. That is, in the MoE model, some "experts" may receive more Tokens than others, causing some GPUs to be busy while others are idle.

Netizen big goose on X pointed out that this is very similar to NVIDIA's solution for scheduling SM (Streaming Multiprocessor, the core computing unit of NVIDIA GPUs), except that the abstraction is raised to the pipeline level. LPLB emphasizes "single SM", which means its solving process is very lightweight and does not consume too many computing resources.

It should be noted, however, that LPLB has not yet been used in the production process. DeepSeek stated in the Readme file: "LPLB is currently in the early research stage, and the performance improvement is still being evaluated."

Working Principle of LPLB

LPLB is an extension of EPLB (Expert Parallel Load Balancer), aiming to solve the problem of dynamic load imbalance in MoE training.

EPLB vs. LPLB

EPLB: Mainly deals with static imbalances (for example, due to data distribution characteristics, some experts are always overloaded in the long term).

LPLB: Focuses on dealing with dynamic fluctuations (instantaneous load jitters caused by the randomness of small batches of data during the training process).

Core Mechanism

Redundant Experts: Each redundant expert (replica) is linked to an original expert, thus forming connection edges between GPUs.

Edge Capacity: The capacity of an edge is defined as the number of Tokens allocated to the redundant expert in the current batch, which determines the maximum Token flow for load balancing.

LP Optimization: LPLB solves a linear programming problem to redistribute Tokens along these edges while complying with the edge capacity limit, so as to minimize the load imbalance within the expert parallel (EP) group.

Implementation Process

First, select the experts to be replicated through EPLB (only reordering, no replication at this time).

Then, replicate the most heavily loaded experts according to the selected LPLB topology.

Communication Optimization: The synchronization of real-time workload is optimized using NVLINK and NVSHMEM, replacing the traditional torch.distributed.allreduce, thus significantly reducing communication overhead. This is the reason why DeepEP needs to be pre - installed.

Limitations

Although LPLB provides dynamic optimization, there are still some limitations:

Ignore Nonlinear Computational Costs: The current planner only balances the total number of Tokens and does not consider the nonlinear characteristics of the time cost of Grouped GEMM. This may result in non - optimal performance in some cases.

Solving Delay: The solver takes about 100 µs for intra - node optimization (longer for cross - node). For very small Batch Sizes, this delay may not be negligible.

Extreme Imbalance Situations: In cases of extreme global load imbalance, LPLB may perform worse than EPLB. This is because there are differences in the allocation of redundant experts in LPLB (LPLB avoids allocating multiple replicas to the same original expert).

Typical Topologies

LPLB allows defining the distribution of expert replicas by modifying the r2o matrix. Here are several typical topologies:

Cube: Replicate experts on a subset of GPUs to form a cube graph with diagonal edges. This requires at least 2 experts per GPU. Applicable Scenario: Suitable for balancing within an 8 - GPU EP subgroup without sacrificing cross - node communication performance.

Hypercube: Similar to the Cube but without diagonal edges. This requires 16 GPUs. Applicable Scenario: Suitable for expert parallelism across 16 GPUs.

Torus: Replicate one expert on neighboring GPUs within the same node and another expert on GPUs in neighboring nodes to form a torus graph. It requires at least 2 experts per GPU. Advantages and Disadvantages: Effective for global balance, but usually less efficient than the Cube due to more intra - node communication.

Conclusion

The LPLB library open - sourced by DeepSeek is essentially trying to solve the "barrel effect" problem in large - model training, that is, the training speed often depends on the slowest (most heavily loaded) GPU.

Its innovation lies in introducing the mathematical tool of linear programming to calculate the optimal allocation in real - time and using the underlying NVSHMEM technology to break through the communication bottleneck. For developers researching MoE architecture training acceleration, this is a very valuable reference implementation.

For specific installation and testing guides, please visit the original code repository.

This article is from the WeChat official account "Machine Intelligence" (ID: almosthuman2014). Author: Machine Intelligence. Republished by 36Kr with permission.