Domestic Large Model Trained with Domestic GPUs Arrives, Energy Consumption Drops by 97.7%

Drawing inspiration from the human brain, domestic non-Transformer models achieve breakthroughs.

According to a report by ZDXX on September 10th, on September 5th, the Institute of Automation of the Chinese Academy of Sciences released a technical report on the brain-inspired spiking large model "Shunxi 1.0" (SpikingBrain-1.0). The open-source SpikingBrain-7B model achieved 90% of the performance of Qwen2.5-7B with only 2% of the pre-training data used by mainstream large models, and its performance is comparable to that of many open-source Transformer models such as Llama-3.1-8B.

The Institute of Automation of the Chinese Academy of Sciences stated that this is the first time in China to propose a large-scale brain-inspired linear foundation model architecture and the first time in China to build a training and inference framework for a brain-inspired spiking large model on a domestic GPU computing power cluster.

The entire process of training and inference of SpikingBrain was completed on domestic computing power, using a cluster composed of Magic Core Technology's Xiyun C550 GPUs. During the training process, the cluster ran continuously for two weeks without interruption, which also proves the feasibility of building an ecological system for a new type of non-Transformer large model architecture that is domestically controllable.

In addition to extremely high data efficiency, SpikingBrain also achieved an order-of-magnitude improvement in inference efficiency. In a scenario with a context of 1 million tokens, the time taken by SpikingBrain-7B to generate the first token was 96.2% less than that of Qwen2.5-7B.

This characteristic also makes SpikingBrain particularly suitable for ultra-long sequence processing tasks, such as legal and medical document analysis, complex multi-agent simulation, high-energy particle physics experiments, DNA sequence analysis, and molecular dynamics trajectories.

In terms of energy consumption, the average energy consumption of multiply-accumulate operations of this model was reduced by 97.7% and 85.2% compared with traditional FP16 and INT8 operations, respectively.

▲Technical report of SpikingBrain-1.0

SpikingBrain-1.0 is available in two versions with 7B and 76B parameters. On September 3rd, the 7B version of the model was open-sourced on platforms such as GitHub and ModelScope. The 76B version of the model is not yet open-sourced, but an experience link is provided.

▲Experience interface of SpikingBrain-1.0

Open-source address: https://github.com/BICLab/SpikingBrain-7B

Technical report: https://github.com/BICLab/SpikingBrain-7B/blob/main/SpikingBrain_Report_Chi.pdf

Experience link: https://controller-fold-injuries-thick.trycloudflare.com/

01.Transformer Encounters Efficiency Bottlenecks, Seeking Inspiration from the Human Brain

Why is there a need for a new type of large model with a non-Transformer architecture? The joint team behind SpikingBrain believes that the Transformer architecture has an inherent drawback: the computational cost of training increases quadratically with the sequence length, and the memory usage during inference also increases linearly with the sequence length, resulting in a huge consumption of resources. This limits the model's ability to process ultra-long sequences (sequences with more than 1 million tokens).

The Transformer architecture essentially relies on "exogenous complexity", that is, improving the level of intelligence by stacking more neurons and performing larger-scale computations. In contrast, the human brain achieves highly complex intelligence with extremely low energy consumption (about 20W), and its neurons have rich internal dynamics and diversity.

This means that there may be another development path of "endogenous complexity" for large models, which is to create the next-generation model architecture by fully leveraging the structural and functional characteristics of biological neural networks at the neuron and neural circuit levels.

The low-power spiking neural network (SNN) solution is considered by the academic community to be one of the next-generation low-power brain-inspired neural network solutions for more general AI systems. Its working mode is similar to that of the brain, sending signals only when needed, so it consumes less power.

Research has found that complex spiking neurons can achieve the same effect by combining several small neurons, which makes it possible to build an efficient brain-inspired network.

Based on the above theoretical research, the SpikingBrain team integrated three core components into the model architecture: hybrid efficient attention, MoE module, and spiking encoding.

1. Hybrid Efficient Attention

The attention mechanism is the core computational unit of large language models. SpikingBrain integrates the advantages of different attention mechanisms. The 7B version of the model uses inter-layer hybrid linear attention and SWA, taking into account both global information retrieval and local dependencies.

The larger-scale SpikingBrain-76B uses intra-layer parallel hybrid, combining linear, SWA, and full softmax attention. Multiple attention mechanisms run in parallel in the same layer, enabling efficient processing of global information, local dependencies, and long-range dependencies.

▲Overall model architecture of SpikingBrain

2. Mixture of Experts Module

SpikingBrain is extended from Qwen2.5-7B-Base (a dense model). To efficiently expand the existing dense model into a sparse mixture of experts model, the SpikingBrain team used the upsampling (Upcycling) technique.

The core of this method is to keep the expanded model consistent with the original model in the initial state through parameter copying and output scaling, thereby avoiding performance loss.

3. Spiking Neurons

Spiking neurons are the basic units of spiking neural networks. The commonly used LIF (Leaky Integrate-and-Fire) model in engineering applications can simulate the core characteristics of biological neurons to a certain extent. However, the LIF model has the problem of neurons being overly silent or overly activated, which affects the balance between model accuracy and energy efficiency.

To solve these problems, the SpikingBrain team proposed adaptive-threshold spiking neurons, which can keep the neurons moderately activated and avoid over-excitation or resting.

02.Complete Model Conversion in 3 Steps, Fully Compatible with Domestic GPU Clusters

During the training process, the SpikingBrain team converted Qwen2.5-7B-Base into a brain-inspired spiking large model, which mainly included three steps.

During continuous pre-training and long-sequence expansion, the model used about 150B tokens of data and gradually expanded the sequence length from 8K to 128K. The training data volume was only 2% of that required for training from scratch, achieving efficient model conversion.

In the supervised fine-tuning step, by using datasets from different domains and a high-quality inference dataset distilled from DeepSeek-R1, the model's capabilities in general knowledge, dialogue, and reasoning were gradually improved.

After that, the model also needs to undergo spiking encoding. Inspired by the biological nervous system, the SpikingBrain team proposed a strategy to convert the continuous activation values of the large model into an integer spiking sequence.

During the inference stage, the integer spiking counts will be expanded into a sparse spiking sequence to be compatible with event-driven computing.

SpikingBrain provides three encoding methods: binary spiking is simple and low in energy consumption; ternary spiking supports excitation-inhibition regulation similar to the biological nervous system, reducing the number of time steps and the total number of spikes; binary spiking can significantly reduce the computational volume and energy consumption in high-count scenarios.

▲Schematic diagram of three spiking schemes

The above spiking schemes can run compatibly on GPUs, but GPUs cannot fully utilize the core advantages of spiking signals, such as "event-driven, sparse, and asynchronous". To fully unleash the low-energy potential of this scheme, dedicated asynchronous hardware (such as brain-inspired chips and spiking processors) needs to be combined.

SpikingBrain still chose to be trained on a domestic Magic Core Technology GPU cluster. The Magic Core Technology software platform achieved compatibility through means such as MoE optimization, parallel computing and communication, memory optimization, operator fusion, and automatic tuning.

This compatibility process includes two parts: Triton compatibility and migration of CUDA to the MACA (Magic Core Technology's CUDA-compatible software stack) framework. These two paths optimize different operators inside the model and form a hardware compatibility solution suitable for Magic Core Technology GPUs.

▲Adaptation of CUDA and Triton operators on the Magic Core Technology platform

During the adaptation process, downstream users can use the model while maintaining their original programming habits and interface calling methods without making significant modifications to the model code. At the same time, the platform provides debugging and performance analysis tools to facilitate developers to observe the execution of the model on the hardware and make necessary fine-tuning and optimization.

Training large language models usually exceeds the memory capacity of a single GPU. Therefore, the SpikingBrain team combined distributed training technologies such as data parallelism, pipeline parallelism, expert parallelism, and sequence parallelism to distribute the computational and storage loads across multiple GPUs.

03.Restore 90% of the Performance of the Base Model, Cluster Runs Continuously for Two Weeks without Interruption

In the downstream task evaluation, SpikingBrain-7B restored about 90% of the performance of the base model Qwen2.5-7B on multiple benchmark tests. Its overall performance is comparable to that of advanced Transformer models such as Mistral-7B and Llama-3-8B, indicating that the efficient linear attention can maintain strong modeling ability while reducing the inference complexity.

The SpikingBrain-76B hybrid linear MoE model almost completely restored the performance of the base model.

After three-stage SFT alignment training, SpikingBrain-76B is comparable to open-source dialogue models of the same magnitude in general knowledge, long-sequence modeling, and instruction-following capabilities. At the same time, it maintains the general capabilities obtained through pre-training without overfitting, demonstrating the stability and scalability of the architecture in alignment training.

In the long-sequence inference scenario, the TTFT (time to first token) of the SpikingBrain-7B model at a length of 1 million tokens was 26.5 times faster than that of the Transformer architecture, and the speedup was more than 100 times at a length of 4 million tokens.

In terms of training performance, the training throughput of the 7B model at a sequence length of 128K was 5.36 times that of Qwen2.5-7B, which is basically consistent with the improvement in inference performance.

At the same time, on the mobile phone CPU, at lengths of 64K, 128K, and 256K, the inference speed of SpikingBrain was 4.04 times, 7.52 times, and 15.39 times faster than that of the same-scale model of Llama3.2, respectively.