DeepMind Unveils AlphaGenome: Predict Variant Effects Across All Modalities and Cell Types in One Second

It can analyze up to 1 million DNA bases and make predictions with single-base resolution.

Google DeepMind has released the AlphaGenome model, which can predict thousands of molecular properties related to regulatory activity. It can also evaluate the impact of genetic variations or mutations by comparing the prediction results of mutated and non - mutated sequences.

Google DeepMind's Alpha series welcomes a new member - AlphaGenome. It can more comprehensively and accurately predict how single variations or mutations in human DNA sequences affect a series of biological processes that regulate genes.

The AlphaGenome model takes a DNA sequence of up to one million base pairs as input to predict thousands of molecular properties related to regulatory activity. It can also evaluate the impact of genetic variations or mutations by comparing the prediction results of mutated and non - mutated sequences. This model is built on DeepMind's previous genomic model, Enformer, and complements the AlphaMissense model, which focuses on classifying variations in protein - coding regions.

Jun Cheng, the co - first author of the paper, introduced on his personal X account, "RNA splicing errors are a common cause of many diseases. For the first time, we have constructed a unified model that can simultaneously predict RNA - seq coverage, splicing sites, site usage, and the specific splicing junctions they form, thus providing a more comprehensive picture of splicing results." He also pointed out that one of the important breakthroughs of AlphaGenome is "the ability to directly predict splicing junctions from sequences and use them for predicting variation effects."

Dr. Caleb Lareau from Memorial Sloan Kettering Cancer Center also commented, "This is a milestone in the field. For the first time, we have a model that combines long - context, single - base precision, and top - level performance and covers a wide range of genomic tasks." Currently, DeepMind has opened the preview version of AlphaGenome to non - commercial research users through the AlphaGenome API and plans to officially release the model in the future.

* Link to the research paper:

https://go.hyper.ai/w9Jes

Designed with a U - Net - like architecture based on one million DNA sequences and species information

As shown in Figure a below, the deep - learning model AlphaGenome takes a 1 Mb (million - base) DNA sequence and species information (human/mouse) as input. It is used to predict 5,930 human genomic tracks or 1,128 mouse genomic tracks across different cell types, covering 11 output types, including:

* Gene expression (RNA - seq, CAGE, PRO - cap)

* Detailed splicing patterns (splicing sites, splicing site usage frequency, splicing junctions) * Chromatin states (DNase, ATAC - seq, histone modifications, transcription factor binding)

* Chromatin contact maps

Overview of the AlphaGenome model

In terms of model architecture, AlphaGenome adopts a U - Net - like backbone architecture design. As shown in Figure a below, it can efficiently process the input sequence into two types of sequence representations:

* One - dimensional embeddings (resolutions of 1 bp and 128 bp): Represent linear genomic sequences for generating predictions of genomic tracks; * Two - dimensional embeddings (resolution of 2048 bp): Represent spatial interactions between genomic segments for predicting pairwise contact maps.

General overview of the AlphaGenome model

The convolutional layers of this model are used to model local sequence patterns to support fine - grained predictions, while the Transformer module is used to model longer - range dependencies, such as the interaction between enhancers and promoters. The model can be trained on a complete 1Mb sequence at the single - base level, thanks to the distributed sequence parallel technology, which can run on eight interconnected TPUv3 devices.

In terms of model training, the researchers adopted a two - stage training approach, namely pre - training and distillation. In the pre - training stage, they used existing experimental data to train two types of models, as shown in Figure b below:

* Fold - specific models: These models are trained using four - fold cross - validation, that is, 3/4 of the segments in the reference genome are used for training, and the remaining 1/4 is reserved for validation and testing. These models are used to evaluate the generalization ability of AlphaGenome in predicting genomic tracks on unseen segments of the reference genome.

* All - folds models: These models are trained on all available segments of the reference genome and serve as Teacher models in the next distillation stage, as shown in Figure c below.

AlphaGenome training process

In the distillation stage, the researchers trained a Student model with a shared pre - trained architecture. Its goal is to predict the combined output of multiple all - folds Teacher models using input sequences that have undergone random augmentation. Previous studies have shown that this distillation model can achieve stronger robustness and higher accuracy in predicting variation effects (VEP) in a single model instance.

Thanks to this design, the Student model can complete the task of predicting variation effects for all modalities and cell types with a single device call. On an NVIDIA H100 GPU, the prediction for each variation takes less than one second, making it extremely efficient compared to traditional multi - model ensemble methods in large - scale variation effect prediction.

AlphaGenome leads in various genomic prediction tasks

According to DeepMind, AlphaGenome has the following unique advantages compared to existing methods:

Long - sequence context + single - base resolution

AlphaGenome can analyze DNA sequences up to one million bases long and make predictions at the single - base level. This enables it to cover the remote regions that regulate genes while capturing fine biological details. Previous models often struggled to balance sequence length and prediction accuracy, limiting the range and accuracy of modelable modalities. AlphaGenome's technological breakthrough breaks this limitation. It only uses half of the computing resources of the original Enformer model and can complete one training session in just 4 hours.

Comprehensive multi - modal prediction ability

The combination of high resolution and long input sequences enables AlphaGenome to predict an unprecedented variety of regulatory modalities, providing researchers with more systematic information on gene regulation.

Efficient variation scoring

AlphaGenome can score the impact of variations within one second. By comparing the prediction differences between the pre - and post - mutated sequences and using the most appropriate summarization method for different modalities, it can quickly and accurately evaluate the potential impact of genetic variations on molecular mechanisms.

Novel splicing site modeling

AlphaGenome innovatively realizes the direct prediction of the location and expression level of RNA splicing junctions based on sequences. Many rare genetic diseases (such as spinal muscular atrophy and certain types of cystic fibrosis) are related to splicing errors. This ability provides a new tool for researching the causes of related diseases.

Excellent performance in benchmark tests

AlphaGenome leads in various genomic prediction tasks. For example, it can predict regions with close DNA structures, the impact of variations on gene expression, and changes in splicing patterns. It outperforms the existing best models in 22 out of 24 DNA sequence prediction evaluations and reaches or exceeds the current optimal models in 24 out of 26 variation effect tasks. More importantly, it is also the only model that can jointly predict all evaluated modalities, demonstrating strong versatility.

Specifically, to evaluate the performance of AlphaGenome, the researchers first examined its generalization ability on unseen genomic segments, which is a prerequisite for achieving high - quality variation effect prediction. They conducted a total of 24 genomic track prediction evaluations, covering all 11 modalities predicted by the model. In out - of - fold evaluations, the researchers used pre - trained fold - specific AlphaGenome models and compared their prediction results with the current strongest external models in each task.

The results show that AlphaGenome outperforms the corresponding external models in 22 out of these 24 evaluations, as shown in Figure d below. Notably, in the task of predicting cell - type - specific gene expression changes (log - fold change, LFC), AlphaGenome shows a + 17.4% relative performance improvement compared to another multi - modal sequence model, Borzoi, as shown in Figure e below.

In addition, AlphaGenome also outperforms professional models focusing on a single modality in various tasks. For example:

In the prediction of chromatin contact maps, AlphaGenome surpasses the Orca model, with a + 6.3% increase in the Pearson correlation coefficient of the contact map and a + 42.3% increase in cell - type - specific differences, as shown in Figure d below;

In the prediction of transcription start site tracks, AlphaGenome is better than ProCapNet, with a + 15% increase in the overall count Pearson correlation coefficient;

In the prediction of chromatin accessibility, AlphaGenome is better than ChromBPNet, with an + 8% increase in ATAC - seq and a + 19% increase in DNase - seq.

* Left panel d: Relative performance improvement (in %) of AlphaGenome in genomic track prediction tasks of different modalities and resolutions. PA represents polyadenylation. * Right panel e: Relative performance improvement of AlphaGenome in some variation effect prediction tasks.

AlphaGenome receives high praise as an industry milestone

Since the release of the AlphaGenome model, it has continuously sparked heated discussions on Twitter.

Pushmeet Kohli, the vice - president of research at DeepMind, introduced, "AlphaGenome provides a comprehensive view of the human non - coding genome by predicting the impact of DNA variations. It will deepen our understanding of disease biology and open up new research avenues." In the comment section, apart from exclamations and praises, people are more concerned about how to use it.

A doctoral student in genetics from the University of Edinburgh said, "This model may completely redefine the way we discover disease - causing mutations and drug targets. It is of great significance."

A commentator in the field of biological sciences said, "AlphaGenome is not just about a single gene but the entire regulatory genome. If DNA is compared to code, then AlphaGenome is the software composed of the code."

In practical applications, AlphaGenome has broad scientific research potential. For example, in disease mechanism research, it can more accurately predict the impact of genetic variations on regulatory processes, identify potential disease - causing variations, and reveal new targets, especially suitable for studying rare variations with significant effects. In the field of synthetic biology, it can guide the design of DNA with specific regulatory functions, such as activating target genes only in nerve cells. In basic genomics research, it can accelerate the positioning and role definition of key functional elements and help identify the "core instructions" required to regulate the functions of specific cell types.

Professor Marc Mansour from University College London commented, "When identifying the role of non - coding variations on a large scale, AlphaGenome provides a key piece of the puzzle, enabling us to better understand complex diseases such as cancer." Currently, AlphaGenome is open to non - commercial research, and we look forward to more academic achievements based on it.

This article is from the WeChat official account "HyperAI Superneural", author: Li Baozhu & Yeye. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

DeepMind has released AlphaGenome, which can complete the prediction of variant effects across all modalities and cell types within one second.