Chinesische GPU komplett in China produziert: 100-fache Kontextbeschleunigung! Chinesische Akademie der Wissenschaften veröffentlicht hirnähnliches Großmodell mit "linearer Komplexität"

Die Verarbeitungsgeschwindigkeit von SpikingBrain, einem spikenden Hirnmodell des Chinesischen Akademie der Wissenschaften, für überlange Sequenzen wird um das 26,5-fache verbessert.

[Introduction] SpikingBrain draws on the brain's information processing mechanism, featuring linear/near-linear complexity. It has a significant speed advantage for extremely long sequences. On a GPU, at a length of 1M, the TTFT speed of SpikingBrain is 26.5 times faster than that of mainstream large models. At a length of 4M, the speed is conservatively estimated to be over 100 times faster. On a mobile phone CPU, at lengths of 64k - 128k - 256k, the decoding speed of SpikingBrain is 4.04x - 7.52x - 15.39x faster than that of the same-scale Llama3.2 model, demonstrating the great potential of the research path of building a new generation of AI foundation models and architectures by drawing on the brain's structure and function.

Current mainstream large models are based on the Transformer architecture and, driven by the Scaling law, have achieved great success in improving intelligence by increasing network scale, computing resources, and data volume.

However, the Transformer architecture has quadratic complexity relative to the sequence length, resulting in high training and inference costs and limited ability to process extremely long sequences.

Recently, the research team led by Li Guoqi and Xu Bo from the Institute of Automation, Chinese Academy of Sciences, drawing on the complex internal working mechanism of brain neurons, released the domestically-developed and self-controllable brain-inspired spiking large model SpikingBrain (Shunxi) - 1.0. This model can achieve efficient training with a very small amount of data, has linear/near-linear complexity, and significantly improves the training and inference efficiency of long sequences. The entire training and inference process is completed on a domestic GPU computing platform.

Online trial port URL: https://controller-fold-injuries-thick.trycloudflare.com

Chinese technical report URL: https://github.com/BICLab/SpikingBrain-7B/blob/main/SpikingBrain_Report_Chi.pdf

English technical report URL: https://arxiv.org/abs/2509.05276

Model code URL: https://github.com/BICLab/SpikingBrain-7B

Research Background

Existing mainstream large models are based on the Transformer architecture, and their basic computational unit is the point neuron model: a simple multiplication and addition unit followed by a non-linear function. This technical path of simple neurons combined with network scale expansion can be called the "exogenous complexity-based" approach to achieving general intelligence.

As mentioned above, this path faces problems such as high power consumption and poor interpretability.

The human brain is currently the only known general intelligence system, containing approximately 100 billion neurons and about 100 trillion synapses, with a rich variety of neuron types and different neurons having rich internal structures, but consuming only about 20W of power.

In view of this, Li Guoqi's research team believes that there is another path - the "endogenous complexity-based" approach to achieving general intelligence: finding a new path to integrate the rich dynamic characteristics of neurons and build a neural network with biological rationality and computational efficiency, which will fully utilize the structural and functional characteristics of biological neural networks at the neuron and neural circuit levels.

Under this idea, exploring the bridge between brain science and the architecture of AI foundation models and building a new generation of non-Transformer brain-inspired foundation model architectures may lead the development direction of the next generation of AI and provide a foundation for the development of a domestically-developed and self-controllable brain-inspired large model ecosystem.

Core Technologies

SpikingBrain - 1.0 builds a linear (hybrid) model architecture based on spiking neurons, with a brain-inspired foundation model featuring linear (SpikingBrain - 7B) and near-linear complexity (SpikingBrain - 76B, with 12B activation parameters) (Figure 1).

Figure 1. Overview of the SpikingBrain framework

To solve the performance degradation problem during spike encoding, an adaptive threshold neuron model is constructed to simulate the core process of biological neuron spike firing. Subsequently, the "potential - spike" conversion is achieved through a virtual time step strategy, and the integer spike count is re-expanded into a sparse spike sequence.

With the dynamic threshold spiking information encoding scheme, the dense continuous-value matrix multiplication, which accounts for over 90% of the computational volume in the model, can be replaced with a spiking operator that supports event-driven computing, achieving both high performance and low energy consumption: spiking neurons fire spike events only when the membrane potential accumulates to the threshold, triggering the activity of downstream neurons when spikes arrive, and remaining in a low-energy resting state when there are no spikes.

Furthermore, the MoE architecture at the network level combined with the sparse event-driven computing at the neuron level can provide a sparse solution at the micro - macro level, reflecting efficient computing resource allocation based on demand.

The research team theoretically established the connection between the endogenous dynamics of spiking neurons and the linear attention model, revealing that the existing linear attention mechanism is a special simplified form of dendritic computation, thus clearly demonstrating a new feasible path to continuously improve model complexity and performance.

Based on this understanding and the team's previous work, the team constructed a general model conversion technology and an efficient training paradigm compatible with existing large models, which can convert the standard self-attention mechanism into a low-rank linear attention model and adapt to the proposed spiking encoding framework.

In addition, to enable the domestic computing cluster to support the full-process training and inference of the brain-inspired spiking large model, the team developed an efficient training and inference framework for domestic GPU clusters, a Triton/CUDA operator library, a model parallel strategy, and cluster communication primitives.

SpikingBrain - 7B and SpikingBrain - 76B are an inter-layer mixed pure linear model and an intra-layer mixed linear MoE model respectively (Figure 2).

SpikingBrain - 7B is composed of linear attention and sliding window attention stacked layer by layer at a 1:1 ratio. SpikingBrain - 76B contains 128 sink tokens, 16 routing experts, and 1 shared expert. For the linear layer, 7 dense FFNs are arranged in layers [1, 2, 3, 5, 7, 9, 11], and the remaining layers are implemented as MoE layers.

For the attention module, a combination of linear attention + Softmax attention (LA + FA) is used in layers [7, 14, 21, 28], and a combination of linear attention + sliding window attention (LA + SWA) is used in other layers.

During the inference phase, SpikingBrain uses spike encoding to convert activation values into integer counts for GPU execution or into spike sequences for event-driven neuromorphic hardware.

Figure 2. SpikingBrain network architecture

Performance Highlights

The long-sequence training efficiency of SpikingBrain 1.0 is significantly improved. The SpikingBrain - 1.0 - 7B model can achieve general language modeling performance comparable to many open-source Transformer models with a very small amount of data (about 2% of that of mainstream large models) (Table 1).

The SpikingBrain - 1.0 - 76B hybrid linear model maintains the performance of the base model by expanding more parameters and using a more refined attention design. It can approach or even outperform advanced Transformer models such as Llama2 - 70B, Mixtral - 8*7B, and Gemma2 - 27B with fewer activation parameters (Table 2).

The SpikingBrain - 1.0 - 7B model is adapted to multi-card sequence parallel inference (using ZeCO plus P2P communication) in the Huggingface framework and supports Prefill of up to 4M length.

The results show that compared with the Qwen baseline using standard attention and A2A communication, the TTFT (time from submitting a prompt to generating the first token) of SpikingBrain - 1.0 - 7B is accelerated by 13.88 times and 26.5 times at lengths of 512K and 1M respectively. With the expansion of sequence length and the number of cards, it has almost constant time overhead. At a length of 4M, Qwen cannot be evaluated. According to the fitted scaling curve, the speed is conservatively estimated to be over 100 times faster (Table 4).

The team deployed the compressed 1B SpikingBrain - 1.0 on the CPU mobile phone inference framework. At lengths of 64k - 128k - 256k, the decoding speed is 4.04x - 7.52x - 15.39x faster than that of the 1B Llama3.2 model.

Figure 2 Comparison of decoding speeds at different output lengths based on the CPU mobile inference framework

Dialogue Demo and Online Trial Port: The team provides an online trial port for the SpikingBrain - 1.0 - 76B model for everyone to experience. This model is deployed on a domestic GPU cluster based on the vLLM inference framework and can support concurrent requests from hundreds of users.

To support the construction of a brain-inspired research ecosystem, the team has open-sourced the SpikingBrain - 1.0 - 7B model (see the technical report for details).

Conclusion

The domestically-developed and self-controllable brain-inspired spiking large model released this time explores the mechanism connection between the endogenous complex neural dynamics of spiking neurons and the linear attention model, designs a linear model architecture and a heterogeneous model architecture based on conversion, solves the performance degradation problem of large-scale brain-inspired models under spike-driven limitations through dynamic threshold spiking, and enables the domestic GPU computing cluster to support the full-process training and inference of the brain-inspired spiking large model.

Modeling of extremely long sequences will have significant potential efficiency advantages in scientific task modeling scenarios for extremely long sequences, such as complex multi-agent simulations, DNA sequence analysis, and molecular dynamics trajectories.

In the future, the team will further explore the mechanism connection between the endogenous complex dynamics of neurons and the basic operators of AI, build a bridge between neuroscience and AI, and hope to break through the current bottlenecks of AI by integrating biological insights, thereby achieving a brain-inspired general intelligence computing model with low power consumption, high performance, and support for extremely long context windows, providing important inspiration for future brain-inspired chip design.

Reference materials:

https://github.com/BICLab/SpikingBrain-7B/blob/main/SpikingBrain_Report_Chi.pdf

This article is from the WeChat public account "New Intelligence Yuan", edited by LRST. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Komplett in China hergestellte GPU, 100-fache Beschleunigung des Kontexts. Die chinesische Akademie der Wissenschaften hat ein hirnähnliches Großmodell mit "linearer Komplexität" veröffentlicht.

Research Background

Core Technologies

Performance Highlights

Conclusion