BigBang-Proton: An Autoregressive Foundation Model Unifying Language, Science, and the Material World
Can large language models like GPT-5 and DeepSeek directly perform professional scientific tasks such as those of AlphaFold? Sam Altman of OpenAI has mentioned on multiple occasions that the main goal of ChatGPT is to build a General Reasoning Machine based on language. Then, use this reasoning machine to call professional scientific models like AlphaFold to solve specific scientific problems. Therefore, it is neither possible nor necessary to use ChatGPT to directly perform AlphaFold's tasks.
Recently, Transcend Symmetry (Shanghai) Technology Co., Ltd. (Transcend Symmetry), a company focusing on the research and development of fundamental models for the physical world, released a new version of its fundamental model, BigBang-Proton. It successfully achieved unified pre-training and reasoning for multiple real-world professional discipline problems and large language models (LLMs), challenging Sam Altman's and the mainstream AGI technology routes.
The results of BigBang-Proton show that not only professional biological problems like those of AlphaFold and AlphaGenome, but also scientific problems spanning all material scales, from microscopic particles such as quarks, material lattices, to DNA and proteins, and to macroscopic Earth systems, can be integrated into the same autoregressive LLM for pre-training and reasoning using the next-word-prediction paradigm.
Meanwhile, the experimental results of BigBang-Proton indicate that the current mainstream AGI technology route, represented by GPT 5 and DeepSeek R1, which relies on the long horizon chain-of-thought, has completely failed in understanding real material structures. This shows that relying solely on the long-range chain of thought cannot achieve AGI.
Transcend Symmetry proposed that Structure Learning is one of the essential elements for achieving AGI. An LLM that masters material structures can naturally enter the physical world.
The significance of the achievements demonstrated by BigBang-Proton lies in answering the hotly debated question in the current industry: "Has the pre-training and scaling law reached its peak?" Mainstream general LLMs are trained on all Internet data. The data related to scientific problems is also limited to hundreds of millions of published papers and books in natural language. After exhausting this language data, the problem of hitting the wall of the scaling law naturally arises.
Currently, the world model technology route mainly based on image learning, represented by Fei-Fei Li and Yann LeCun, believes that the LLM of the next-word-prediction paradigm is a dead end, and the world should be reconstructed from images. Transcend Symmetry proposed a third route, starting from material structure learning. This allows the pre-training of LLMs to break free from the dilemma of Internet data, enter the physical world, and build a world model with an ultra-long context to encompass the entire physical world. The pre-trained fundamental model obtained in this way can integrate language, scientific intelligence, spatial intelligence, and embodied intelligence into an ultimate unified model.
Where are the boundaries of LLM pre-training? BigBang-Proton's answer is that LLM pre-training will continue to expand to the entire universe. Therefore, based on BigBang-Proton, Transcend Symmetry put forward a bold idea, "Universe Compression," which compresses the information of the entire universe into a single fundamental model in the form of an ultra-long sequence, making it the foundation for all current AI branch tasks.
Different from common LLM companies engaged in language learning, Transcend Symmetry has long focused on using LLMs to understand numbers from 0 - 9. In the early stage, the company's business was to analyze news and financial reports to predict financial market fluctuations, serving quantitative finance.
In the financial business, the team found that financial business is highly sensitive to numerical data. For example, corporate revenues can be 11 - digit numbers. An error of even one digit caused by hallucinations during the LLM's reasoning process can lead to a business collapse. During this business process, the Transcend Symmetry team discovered that the byte pair encoding (BPE) used by LLMs brought underlying defects in numerical analysis. This defect also led to the common LLM joke that 9.11 is greater than 9.8. They further found that the lack of numerical ability is one of the reasons why mainstream LLMs cannot learn real scientific data.
More than 90% of real - world scientific research requires the combination of theory and experiments, and most of the results of experimental measurements are recorded in numerical form. BigBang - Neutron, released by Transcend Symmetry in 2024 (the first open - source large fundamental model for scientific computing, BBT - Neutron, helps break through the data analysis bottleneck of large scientific facilities), is the first LLM focused on understanding large - scale experimental numerical data. It proposed using binary patch encoding to replace BPE. BigBang - Proton continued to innovate on the basis of BigBang - Neutron, achieving multi - task learning for real - world scientific research.
1 Fundamental Problems and Three Fundamental Innovations of BigBang - Proton
To build a unified model for professional scientific tasks based on LLMs, several fundamental problems must be solved. BigBang - Proton introduced three fundamental innovations for this purpose:
Innovation 1: Binary Patch Encoding - Discarding the Tokenizer to Unify Language, Numerical, and Scientific Data
Traditional tokenizers, such as Byte Pair Encoding (BPE), SentencePiece, and WordPiece, perform poorly on numerical data and cannot effectively handle the representation of scientific data with multiple disciplines, scales, and structures. When tokenizing numbers, they introduce ambiguity and inconsistency, causing the same number to be segmented into different fragments depending on the context. The discontinuity of these token IDs makes the management and processing of numerical data complex, especially when sequential or patterned token IDs are required.
We completely abandoned traditional tokenizers and adopted Binary Patch Encoding. This method is based on Transcend Symmetry's previous work, BigBang - Neutron, and important contributions from other fields (such as BGPT, Megabyte, SpaceByte, and BLT). It is based on a profound and simple insight: all data is ultimately stored in binary form in a computer. Therefore, BigBang - Proton treats all inputs - whether English text, Chinese characters, Python code, or particle energy, atomic coordinates, DNA sequences - as the most primitive binary sequences for processing. Currently, it uses UTF - 8 encoding and then reduces the computational complexity by dividing the binary sequence into patch blocks.
Advantages of Binary Patch Encoding include:
- Numerical Fidelity: Numbers are stored in their original format, avoiding information distortion caused by tokenization and enabling accurate arithmetic calculations. This allows the model to achieve 100% accuracy in addition operations of up to 50 - digit numbers.
- True Unification: One encoding method can handle all types of modal data, whether it is text, numerical, symbolic, or structural data, eliminating the need for specific modal tokenization schemes and simplifying the pre - processing process.
- Extreme Flexibility: It can seamlessly handle any scientific dataset stored in binary format (such as.bin and.dat formats), laying the foundation for building a unified data representation.
Innovation 2: Theory - Experiment Learning Paradigm - Bridging the Gap between Theory and Experiment
Scientific experiments produce a large amount of numerical data. How can we effectively align and train this data with text - centered theoretical knowledge? Solving this problem can cover more than 90% of experimental scientific research tasks. Scientific knowledge exists in both language and quantitative forms. A unified model must integrate symbolic reasoning and data - driven learning.
Transcend Symmetry proposed the Theory - Experiment Learning Paradigm. This is similar to the vision - language model that adds captions to pictures, but Transcend Symmetry adds "theoretical description captions" to scientific experimental data. The core innovation of this framework is to establish a hybrid representation that directly aligns numerical experimental data with text descriptions.
In particle physics, the numerical measurement values of each final - state particle (charge, energy, momentum components, collision parameters, etc.) are paired with text annotations such as "charged pion" or "neutral hadron", forming an experimental data - text alignment similar to the dual - modal image - caption pair.
In materials science, large - scale experimental or simulation datasets are systematically converted into natural language descriptions and embedded in the theoretical background. For example, for the crystal structure of Ag₂SnYb, the data from the original MPtrj format is decomposed and converted into a natural language description.
In addition to these immediate annotations, the framework also integrates deeper theoretical explanations, such as the principles of quantum chromodynamics (QCD) and quark - gluon dynamics in particle physics, and the density functional theory and electronic structure in condensed matter physics from general scientific corpora like Wikipedia and research literature.
Advantages of the Theory - Experiment Learning Paradigm:
- Dual Alignment Structure: During pre - training, theoretical concepts and experimental data sequences are placed in the same context, creating immediate data - caption pairs at the local level and providing comprehensive theoretical explanations at the global level.
- Converting Scientific Computation into Sequence Learning: The sequence - based autoregressive language model learns the patterns in experimental data (which are traditionally captured by graph neural networks or numerical analysis models) and aligns numerical observations with theoretical concepts in a unified context.
- Language - Guided Scientific Computation: Through integrated pattern recognition and language reasoning, the model can directly perform scientific tasks according to natural language instructions, achieving the most common scientific computing tasks such as Language - Guided Classification, Regression, Spatiotemporal Prediction, and Genome Modeling.
Innovation 3: Monte Carlo Attention - An Attention Mechanism for Simulating Complex Material Structures
To simulate complex material structures such as cells, quantum systems, the Earth, and the universe from the atomic scale, the model needs to process extremely large information sequences. The computational complexity of the traditional Transformer's attention mechanism increases quadratically with the sequence length, making it impossible to scale to the required size.
Transcend Symmetry replaced the traditional Transformer architecture with Monte Carlo Attention. This innovation aims to solve the inherent computational complexity in binary patch attention calculation while retaining the advantages of sparse attention and state - space models (considered as the main alternatives to the Transformer).
Its core mechanism is a block - representative communication mechanism, mimicking the human representative political system. After dividing the sequence into blocks, each block sends representatives to other blocks for communication and then returns to its own block. This mechanism allows the effective context length of the model to increase exponentially with the number of attention layers.
In this work, BigBang - Proton uses 20 layers of Monte Carlo Attention, achieving a context capacity of 10 30 bytes. Theoretically, to reach the estimated number of baryon particles in the observable universe, 10 80, the number of Monte Carlo Attention layers can be set to 60. Such a high context length is crucial for the model to effectively learn complex material structures, ranging from microscopic systems such as cells and quantum chromodynamics (QCD) phenomena to macroscopic structures such as the Earth system, airplanes, cars, and even the universe.
BigBang - Proton selected five professional scientific problems and general corpora for pre - training, including arithmetic operations of 50 - digit numbers, jet classification in particle collisions, inter - atomic potential simulation of materials, water quality prediction, and joint modeling of DNA/RNA/proteins. The ability of arithmetic operations is the foundation for LLMs to understand all other scientific tasks and is at the center. The other four scientific tasks are the core tasks in their respective disciplines, and many problems in these disciplines can be solved by extending from these tasks.
Particle jet classification determines scientists' ability to distinguish new particles from particle collision results. Inter - atomic potential simulation can infer the physical and chemical properties of materials. Water quality prediction is the basis for Earth system simulation. Joint modeling of DNA/RNA/proteins is the core of bioinformatics. The architectural design goal of BigBang - Proton is to achieve Language - guided scientific computing, including language - guided classification, regression, spatiotemporal prediction, and DNA sequence simulation.
BigBang - Proton has 1.5B parameters. The training loss and perplexity curves showed consistent, smooth, and monotonous convergence within 61,381 steps, proving the stability and effectiveness of learning throughout the pre - training process.
The loss steadily decreased to 0.613, and the perplexity decreased to 2.04, indicating a significant improvement in the model's ability to accurately predict the next token in all nine diverse tasks. This continuous improvement shows that "next - word prediction" achieved by Binary Patch Encoding can overcome high data heterogeneity and effectively achieve robust model convergence.