HomeArticle

Protein structure prediction, function annotation, interaction recognition, and on-demand design: The research team led by Zhang Shugang from Ocean University of China tackles the core tasks of protein intelligent computing.

超神经HyperAI2025-07-01 15:50
Intelligent computing reshapes the research paradigm of protein studies.

Associate Professor Zhang Shugang from the School of Computer Science at Ocean University of China, in his speech titled "Construction and Application of Protein Intelligent Computing System", systematically expounded on the innovative breakthroughs brought by intelligent computing technology around the traditional challenges in the field of protein research. He mainly introduced the research achievements of his team in areas such as functional annotation, interaction recognition, and design optimization. This article is a transcript of the essence of Associate Professor Zhang Shugang's speech.

As the main executor of life activities, proteins play a crucial role in the physiological functions of the human body. However, traditional research faces challenges such as high costs for structure analysis, serious lags in functional annotation, and low efficiency in the design of new proteins. In recent years, the demand for analyzing the complex characteristics of proteins in life sciences has become increasingly urgent. The breakthrough development of technologies such as big data, deep learning, and multimodal computing has provided new opportunities for the construction of a protein intelligent computing system. The construction of the protein intelligent computing system has led to significant achievements in large - scale functional annotation, interaction prediction, and three - dimensional structure modeling of proteins, providing new technical paths for drug discovery and life system simulation.

At the 2025 Beijing Zhiyuan Conference, Associate Professor Zhang Shugang from the School of Computer Science at Ocean University of China, in the "AI + Science & Engineering & Medicine" special forum, with the title "Construction and Application of Protein Intelligent Computing System", starting from the core value of the protein intelligent computing system, systematically expounded on the technological breakthroughs in the four core tasks of protein structure prediction, functional annotation, interaction recognition, and new design, and mainly introduced the relevant research achievements of his team.

HyperAI Super Neural has sorted out and summarized Associate Professor Zhang Shugang's in - depth sharing without violating the original intention. The following is the transcript of the speech.

Overview of the Protein Intelligent Computing System: The AI - Driven Revolution in Life Sciences

In life science research, the importance of proteins is self - evident. It is not only the enzyme that catalyzes biochemical reactions, the messenger that transmits signals, the structural basis of the organism, but also the "weapon" for the immune system to resist foreign invaders. However, traditional research methods seem inadequate when dealing with the complex characteristics of proteins. Problems such as high costs for structure analysis, serious lags in functional annotation, and low success rates in protein design have become major challenges.

The introduction of AI technology has completely reversed this situation. In 2024, the Nobel Prize in Chemistry was awarded for the breakthrough in the field of AI protein structure prediction and design, which undoubtedly once again fully demonstrated the important position of AI in protein research. Protein intelligent computing realizes the efficient simulation and prediction of the complex characteristics of proteins by constructing data - driven algorithm models, and provides new ideas and research paradigms for addressing the above challenges, also opening a new era for life science research.

Breakthroughs in the Core Tasks of Protein Intelligent Computing

The core issues of protein intelligent computing can be classified into the following 4 categories:

Can Protein Structures be Predicted from Scratch: From the Levinthal Paradox to the Subversion of AlphaFold

Taking protein folding as an example, a protein with 100 residues may have about 10^200 possible conformations. If randomly searched, the time required would far exceed the age of the universe (13.8 billion years). This is the famous Levinthal Paradox. However, actual protein folding can be completed within milliseconds to minutes, which implies the existence of specific folding paths.

In 2018, the first - generation AlphaFold model attempted to solve the problem using deep learning methods. It used residual convolution modules to predict amino acid pair distances and torsion angles. In CASP13, it led other participants by a significant margin, accurately predicting the structures of 25 proteins, while the second - place finisher only correctly predicted 3.

In 2021, the second - generation model achieved a qualitative leap. AlphaFold2 used HMMER and HH - suite for multiple sequence alignment and template search. Through 48 Evoformer modules and 8 Structure modules, it achieved atomic - precision protein structure prediction and released a database containing predictions of about 214 million protein monomers. The average error between its predicted structures and the results of electron microscopy analysis is no more than the width of an atom, meeting the "Highly Accurate" standard.

In 2024, the third - generation model further achieved the full prediction of in - vivo protein interaction structures. AlphaFold3 achieved a qualitative leap. It can not only predict protein structures but also predict the structures of complexes composed of proteins and all life molecules such as nucleic acids, small molecules, and ions, covering almost all molecular types in the PDB database, providing a powerful tool for understanding cell functions and disease treatment.

Can Protein Functions be Automatically Annotated: Breakthroughs in Multi - source Data Fusion

Due to the forward - looking progress of AlphaFold3 in the field of protein prediction, our team decided to shift the research focus to the fields of protein functional annotation and interaction analysis. Currently, among the 250 million protein sequences globally, only 0.5% have been accurately functionally annotated. The traditional model that relies on manual analysis by biological experts has difficulty coping with the challenge of massive data. Therefore, using deep learning to achieve large - scale batch annotation has become a key breakthrough.

Our exploration in this field began in 2022. Aiming at the industry pain point of the scarcity and high cost of electron microscopy structure data that deep learning depends on, we innovatively proposed to use the virtual structure data predicted by AlphaFold2 in model training. This strategy similar to "data augmentation" significantly expanded the scale of training data - from the 5 - million - level samples that traditional electron microscopy can provide to a theoretically up - to - hundreds - of - millions - level predicted data pool. Experimental verification shows that the model trained on predicted data not only outperforms the original version in performance but also can discover new protein functions that traditional methods have not identified.

Paper Title: Enhancing Protein Function Prediction Performance by Utilizing AlphaFold - Predicted Protein Structures

Paper Address: https://pubs.acs.org/doi/10.1021/acs.jcim.2c00885

In terms of technological method innovation, aiming at the problem of insufficient mining of protein structure information, our team proposed a protein function prediction method based on self - supervised graph attention. By encoding the residue association information within protein molecules and fully utilizing the distance information between residues as an auxiliary task, the performance of protein function prediction was improved. Paper Title: SuperEdgeGO: Edge - Supervised Graph Representation Learning for Enhanced Protein Function Prediction (to be published)

Schematic diagram of the model architecture

Aiming at the problems of difficulty in fusing heterogeneous protein features and spatial inconsistency, a protein dual - view construction strategy and a feature alignment method were proposed. Based on the complex characteristics of biological proteins with 6 cross - scale modalities (covering dimensions such as sequence, three - dimensional structure, and functional domain), the team further proposed a multimodal fusion strategy - integrating contrastive learning and multi - view analysis methods in the computing field to construct a hierarchical feature fusion model. This solution compared with 20 mainstream baseline methods on 7 datasets and achieved SOTA results in all cases, successfully solving the technical problem of performance degradation caused by direct modal combination.

Paper Title: Annotating protein functions via fusing multiple biological modalities

Paper Address: https://www.nature.com/articles/s42003-024-07411-y

Schematic diagram of the model architecture

Detailed test results

Detailed test results

In addition, in the research on the interpretability of function prediction, the model also demonstrated excellent ability to accurately identify more than 10 protein functions from thousands of GoTerms annotations. In addition, through literature research, the team found that cases where the model made wrong predictions but with high confidence were actually recorded in some studies, indicating the possibility of misjudgment due to the lag of the dataset version. This discovery highlights the potential of AI models in mining new protein functions.

Can Protein Interactions be Accurately Identified: Self - developed Model Achieves Efficient Prediction

In the field of drug R & D, the precise docking of proteins as human targets is the key to the efficacy of drugs, and AI technology shows important value in this process. Although AlphaFold3 performs excellently in the field of protein structure prediction, there are obvious limitations in practical applications: its free version only supports 20 accesses per day, it covers about 15 - 20 types of molecules, and it is extremely difficult to apply for commercial use rights. This has prompted the team to develop a self - developed model.

Based on this problem, the team mainly carried out the following work:

First, aiming at the problem of poor collaborative interaction in existing protein - protein interaction prediction methods, a twin - learning mode was introduced into the encoder to enhance the collaborative consistency of protein representations, and a collaborative learning framework with a protein interaction collaborative mechanism and a task collaborative mechanism was proposed. The team used interaction attention and multi - task learning methods to achieve interaction prediction between protein - nucleic acid, protein - protein, and protein - small molecule.

The team also fused the Transformer in the NLP field with graph neural networks, developed modules such as Convformer and Graphormer to achieve long - range interaction modeling, and strengthened multimodal information fusion through a cross - attention mechanism. The model showed strong generalization ability in practical scenarios. Taking the prediction of the pancreatic cancer signaling pathway as an example, its accuracy exceeded 95%, with only 9 pairs of interactions predicted incorrectly.

Paper Title: SSPPI: Cross - modality enhanced protein - protein interaction prediction from sequence and structure perspectives (to be published)

Schematic diagram of prediction. Green: low connectivity; Red: high connectivity; Black line: correct prediction; Red line: wrong prediction

In recent research, in addition to performing cross - scale dimensionality reduction representation of proteins at the network level, we also focused on mining protein features. Considering that traditional graph models lose information when reducing three - dimensional structure information to two - dimensional, we introduced the latest geometric deep learning. A geometric deep learning method based on a hybrid message - passing strategy was proposed to construct a complete three - dimensional information integration paradigm. This paradigm aims to solve the irrationality of discarding three - dimensional information in spatial site modeling, providing new research ideas for the field of protein three - dimensional modeling. Paper Title: Geometric Deep Learning for Protein - Ligand Affinity Prediction with Hybrid Message Passing Strategies (to be published)

Schematic diagram of the model architecture

In addition, we also conducted actual tests on the ACSS2 protein and screened out several candidate compounds from tens of thousands of compounds. The model prediction results indicate that the affinity of the screened compounds can reach the nM level, showing good drug - development potential. Our team cooperated with the School of Medicine of Qingdao University for verification, and the docking results have also been preliminarily confirmed in recent wet experiments.

Wet - experiment verification of drug - target protein affinity prediction

Can New Proteins be Designed on Demand: From Inverse Problems to Innovative Applications

Protein design is one of the ultimate goals of protein research and is of great significance for vaccine R & D, cancer treatment, and biomaterial development. However, as the inverse problem of protein folding, protein sequence design also faces challenges such as an explosion of the search space and errors in traditional force - field simulations.

Facing the core problem of protein intelligent design optimization, take the latest work of the team of last year's Nobel laureate Baker as an example. There is no specific antidote for snake venom. Can a new type of protein be designed based on computers? Based on this problem, Baker's team designed a new protein by combining their previous ProteinMPNN and RFDiffusion. In addition, their team also carried out the design of proteins that specifically bind to snake venom toxins, providing a new solution for neutralizing lethal snake venom toxins. The relevant paper was published in the main issue of Nature in early 2025. These research results show the great potential of AI in the field of protein design and take a solid step towards the "creator - like" goal of "designing new proteins".

Cross - scale Computing of Complex Life Systems: Full - chain Simulation from the Nanoscale to the Macroscale

The life system is a complex multi - scale system. From the nanoscale gene level to the macroscale cell level, each scale interacts and influences each other. During my visit to the research group of Professor Zhang Henggui at the University of Manchester in the UK, I carried out research on the digital heart. After returning to China, I further carried out research on digital cells. Different from the "numerical - driven" paradigm of the digital heart, the team proposed a multi - scale modeling method for microscopic life activities with a "data - driven" construction idea and constructed a three - dimensional microscopic computing full - method system of "representation - state - scale", covering 36 research points. Currently, there are papers or patents accumulated under nearly one - third of the methods.

In addition, under the guidance of Professor Wei Zhiqiang, a new definition of the four - level scale of the microscopic life system was proposed, including the nanoscale gene level, the "microscopic" protein level, the "mesoscopic" signaling