A French team has successfully predicted 2.39 million anti-phage proteins and used a deep learning model to map the antiviral immune profile of bacteria.
Researchers from the Pasteur Institute in France developed and fine - tuned three complementary deep - learning models for large - scale prediction of anti - phage functions. Among them, the ALBERT_DF model makes inferences relying solely on the local genomic context; ESM_DF uses a protein language model to parse amino acid sequences; and GeneCLR_DF integrates sequence information with genomic context.
In the microscopic world, the "arms race" between bacteria and phages has never stopped. The number of phages is usually about 10 times that of bacteria. They use bacteria as hosts to complete their own proliferation. Meanwhile, bacteria have developed a highly diverse antiviral defense system during long - term evolution. Currently, more than 250 anti - phage systems have been experimentally verified, covering various mechanisms such as restriction - modification systems and CRISPR - Cas systems, and new systems are still being discovered. This phenomenon indicates that the complexity and diversity of the bacterial defense system are likely to far exceed the current understanding. However, limited by traditional experimental methods and computational means, a large number of potential anti - phage mechanisms are still hidden in bacterial genomes and have not been systematically explored.
Existing studies have noticed that known anti - phage systems have certain common characteristics at the levels of protein sequences and genomic organization, such as the repeated appearance of characteristic domains and the enriched distribution in "defense islands" or prophage regions. These patterns suggest that if these common patterns can be identified and utilized, it may be possible to systematically mine unknown anti - phage systems on a whole - genome scale.
Based on this idea, researchers from the Pasteur Institute in France developed and fine - tuned three complementary deep - learning models for large - scale prediction of anti - phage functions. Among them, the ALBERT_DF model makes inferences relying solely on the local genomic context; ESM_DF uses a protein language model to parse amino acid sequences; and GeneCLR_DF integrates sequence information with genomic context. In a unified benchmark test, GeneCLR_DF performed the best, achieving an accuracy of 99% and a recall rate of 92%.
Based on this high - precision model, the research further carried out the prediction of anti - phage systems on a pan - genome scale. The results showed that in more than 32,000 bacterial genomes, about 1.5% of the genes in a typical bacterial genome are involved in antiviral defense. More importantly, more than 85% of the predicted defense - related protein families have never been associated with immune functions before. Finally, the model predicted about 2.39 million anti - phage proteins in total, a large number of which belong to single - gene defense systems, and about 23,000 operon families were defined based on gene co - occurrence relationships. Most of them have no previous association with antiviral defense. These results together outline a systematic map of bacterial antiviral immunity, showing that its scale and diversity far exceed the existing understanding.
The relevant research results, titled "Protein and genomic language models uncover the unexplored diversity of bacterial immunity", have been published in Science.
Research highlights:
* A total of 2.39 million anti - phage proteins were predicted, 85% of which have never been associated with immune functions before;
* In a typical bacterial genome, about 1.5% of the genes are specifically responsible for antiviral defense tasks;
* About 23,000 operon families were predicted, most of which were discovered for the first time;
* A large number of predicted defense proteins exist in the form of single - gene systems, challenging the traditional view that defense functions are usually completed by the cooperation of multiple genes.
Paper address: https://www.science.org/doi/10.1126/science.adv8275
Dataset: Based on 123 million proteins and 32,000 genomes
The research first used the DefenseFinder and PadLoc tools to systematically scan a total of 32,798 complete bacterial genomes in the RefSeq database, so as to quantitatively characterize the known anti - phage systems. Among about 123 million proteins, DefenseFinder v1.3 identified 521,360 (accounting for 0.4%) as components of anti - phage systems, and PadLoc identified 805,357 (accounting for 0.65%).
It is worth noting that a large number of defense systems were initially discovered through genomic associations with known systems. This association can be quantified at the protein family level through the "defense score", which mainly measures the frequency of co - occurrence of a protein family with known defense proteins in the genome.
Defense scores calculated by gene family
Based on the defense score method, as shown in the figure below, the researchers identified a total of 37,959 protein families (accounting for 4.6%) as candidate anti - phage families. Subsequently, the research removed 7,799 families related to core biological functions or mobile genetic elements, such as integrases. Finally, 30,160 selected candidate families (accounting for 3.7%) were obtained.
Distribution of defense scores for positive (pink) and negative (blue) identifications by DefenseFinder in the RefSeq database
However, this method has obvious limitations: First, it is only applicable to protein families containing more than five homologous sequences, thus excluding about 23% of proteins; Second, some anti - phage systems are not located in typical defense islands. Even if they have defense functions, their defense scores may be low and thus be missed.
To overcome the above limitations and more comprehensively capture the genomic signals related to defense, the research further constructed a dataset suitable for deep learning. Under the ALBERT_DF model framework, the research modeled bacterial genomes in a "linguistic" way: each protein family was regarded as a "word", and adjacent gene fragments were regarded as a "sentence".
Since there are more than 8 million different protein families in the complete dataset, far exceeding the vocabulary size of traditional language models, the research limited the training scope to the phylum Actinobacteria and constructed a dataset containing 10,796 genomes. The genes were clustered into 4.2 million protein families, and the vocabulary was limited to the 524,288 most common families, thus covering about 89% of the proteins.
For the ESM_DF and GeneCLR_DF models, the research constructed the Gembase_DF dataset: as shown in the figure below, 521,360 anti - phage proteins labeled by DefenseFinder were used as positive samples, 116 million highly conserved core genes present in more than 99% of the genomes and 14 million genes of mobile genetic elements with non - defense functions were used as negative samples, and the remaining proteins were retained as unlabeled candidates.
To avoid information leakage between training, validation, and testing, the research divided all proteins of the same defense system into the same data fold and used MMseqs2 to remove residual homology across data folds to ensure the rigor of model evaluation.
Construction process of the Gembase_DF protein dataset
Model architecture: Three - layer deep - learning models progress step by step
To break through the limitations of the traditional "defense score" method, the research team constructed a set of complementary and progressive deep - learning frameworks, targeting three goals: discovery of unknown systems, pan - genome scale mining, and high - precision integrated prediction, specifically including ALBERT_DF based on genomic context, ESM_DF based on protein sequences, and GeneCLR_DF that fuses sequence and context information.
Among them, ALBERT_DF focuses on learning functional signals from the "neighborhood relationship" of genes and has the ability to discover new defense systems; ESM_DF directly models using amino acid sequences and has good cross - sequence generalization ability; while GeneCLR_DF integrates two types of information in a unified framework, achieving a better balance between recognition accuracy and prediction coverage.
The ALBERT_DF model is based on a key observation: anti - phage systems often appear in clusters in the genome, and there is a stable organizational pattern between internal and adjacent genes. Based on this feature, the research introduced the ALBERT architecture in natural language processing into genome modeling. Protein families were regarded as "words", and gene arrangements were regarded as "syntactic structures". Local context was learned by predicting masked genes.
Different from traditional methods based on sequence similarity, this modeling method directly uses genomic organization information, so it has more potential to identify new defense mechanisms that lack homology with known systems. However, due to its reliance on discrete "vocabulary" representations, this type of method has natural limitations in cross - species expansion.
ALBERT_DF model
The ESM_DF model starts from another path and directly acts on protein amino acid sequences. This model learns the co - variation and long - range sequence relationships between residues through large - scale pre - training, thus extracting functional signals without relying on artificial features. After fine - tuning, ESM_DF can score any protein to determine whether it is involved in anti - phage defense. This path significantly improves the scope of application of the method, enabling it to run on a pan - genome scale. However, at the same time, the discriminative ability of ESM_DF still depends on sequence similarity to a certain extent, so it is better at identifying distant variants of known defense systems and has relatively limited ability to identify new domains lacking homology.
ESM_DF model
On this basis, the GeneCLR_DF model was proposed to integrate sequence and genomic context information. This model uses a contrastive learning framework to learn two representations for each gene simultaneously: one from the protein sequence and the other from its genomic neighborhood. By training the model to judge whether these two representations correspond to the same gene, the alignment of the two types of information is achieved in the representation space.
This design brings key advantages: when some genes lack homology at the sequence level, the typical genomic context they are in can still provide identification clues; conversely, when the context information is not typical, the sequence features can still support discrimination. Through this complementary mechanism, GeneCLR takes into account both the ability to discover new systems and the scalability for large - scale applications in subsequent predictions.
GeneCLR_DF model
Overall, these three types of models form a clear technical path: from local pattern learning based on context, to global generalization based on sequences, and then to unified modeling of multi - source information. This hierarchical design not only avoids the limitations of a single method but also provides a more universal technical framework for the systematic exploration of unknown anti - phage mechanisms.
Achieve 99% accuracy and 92% recall rate
In the experimental verification, the research first evaluated the prediction ability of ALBERT_DF. The model predicted a total of 1,930 candidate anti - phage protein families, about 33% of which overlapped with the results of the defense score method. The researchers further selected 10 candidate systems that had no support from the defense score and lacked known homology, expressed them in Streptomyces albus, and challenged them with 12 types of phages. Six of these systems showed robust protection, reducing the plaque - forming units by more than 100 times. These systems (such as "Ceres" and "Geb") contain metabolic enzymes and small proteins with unknown functions, beyond the scope of classic defense domains, indicating that the method based on genomic context can discover new defense mechanisms that are difficult to identify by traditional means.
Predicting candidate defense systems from the Streptomyces genome using ALBERT_DF
In the verification of ESM_DF, the research tested a group of high - scoring candidates in Escherichia coli. Six of these systems showed anti - phage ability, including ESM_DF, which can resist multiple types of phages. These systems include both variants of known defense domains and domains such as DUF7946 that have not been previously associated with anti - phage functions. This indicates that ESM not only relies on sequence homology but also can identify a wider range of functional features, but overall it still tends to expand known systems.
Candidate systems predicted by ESMDF and the corresponding defense phenotypes when each system is heterologously expressed in Escherichia coli
GeneCLR_DF performed the most prominently in the systematic evaluation. On the test set,