Google AlphaGenome: The AI "Microscope" for Observing Human DNA Is Born

Input millions of DNA sequences to predict thousands of molecular properties.

On June 25th, Zhidx reported that today, Google DeepMind launched AlphaGenome, an AI model that can help people quickly predict the impact of genetic changes.

AlphaGenome is like an "AI microscope for observing human DNA". It takes long DNA sequences of up to 1 million base pairs as input and predicts thousands of molecular properties that characterize its regulatory activity. It has achieved state-of-the-art performance in more than 20 extensive genomic prediction benchmarks.

Compared with existing DNA sequence models, AlphaGenome has several unique features: it supports high-resolution long-sequence context, comprehensive multimodal prediction, efficient variant scoring, and a novel splicing connection model.

Currently, Google is providing a preview version of AlphaGenome through the AlphaGenome API for non-commercial research use and plans to release the model in the future.

Caleb Lareau, a doctor at Memorial Sloan Kettering Cancer Center, said, "This is a milestone in the field. For the first time, we have a single model that can unify long-range context, fundamental accuracy, and cutting-edge performance across various genomic tasks."

Paper link: https://storage.googleapis.com/deepmind-media/papers/alphagenome.pdf

01. Input of a million DNA sequences to predict thousands of molecular properties

The AlphaGenome model takes long DNA sequences of up to 1 million base pairs as input and predicts thousands of molecular properties that characterize its regulatory activity. It can also evaluate the impact of genetic variations or mutations by comparing the prediction results of mutated sequences with those of unmutated sequences.

The predicted properties include the start and end positions of genes in different cell types and tissues, the positions of gene splicing, the amount of RNA produced, and which DNA bases are accessible, close to each other, or bound to certain proteins. The training data comes from large public consortia, including ENCODE, GTEx, 4D Nucleome, and FANTOM5. These consortia have experimentally measured these properties, covering important patterns of gene regulation in hundreds of human and mouse cell types and tissues.

The following animation shows that AlphaGenome takes one million DNA letters as input and predicts different molecular properties in different tissues and cell types.

The AlphaGenome architecture uses convolutional layers to initially detect short patterns in genomic sequences, uses transformers to transmit information across all positions in the sequence, and finally uses a series of layers to convert the detected patterns into predictions of different modalities. During training, this computation is distributed across multiple interconnected tensor processing units (TPUs) for a single sequence.

This model is based on Google's previous genomics model, Enformer, and complements AlphaMissense, which is specifically designed to classify the impact of variations within protein-coding regions. These regions cover 2% of the genome. The remaining 98% of the regions, called non-coding regions, are crucial for regulating gene activity and contain many disease-related variations. AlphaGenome provides a new perspective for interpreting these extensive sequences and the variations within them.

02. High-resolution long-sequence context and comprehensive multimodal prediction

Compared with existing DNA sequence models, AlphaGenome has several unique features:

1. High-resolution long-sequence context

Google's model analyzes up to one million DNA bases and makes predictions at single-base resolution. Long-sequence context is crucial for covering distant gene regulatory regions, while base resolution is crucial for capturing fine biological details.

Previous models had to make a trade-off between sequence length and resolution, which limited the range of modalities they could jointly model and accurately predict. Google's technological advancement has addressed this limitation without significantly increasing training resources - training a single AlphaGenome model (without data distillation) takes 4 hours, and the required computational budget is only half of that for training the original Enformer model.

2. Comprehensive multimodal prediction

By unlocking high-resolution predictions for long input sequences, AlphaGenome can predict the most diverse modalities. Thus, AlphaGenome provides scientists with more comprehensive information about the complex steps of gene regulation.

3. Efficient variant scoring

In addition to predicting various molecular properties, AlphaGenome can efficiently evaluate the impact of genetic variations on all these properties within one second. It achieves this by comparing the prediction results of mutated and unmutated sequences and efficiently summarizing this comparison using different methods for different patterns.

4. Novel splicing connection model

Many rare genetic diseases, such as spinal muscular atrophy and certain forms of cystic fibrosis, may be caused by errors in RNA splicing. RNA splicing refers to the process in which parts of an RNA molecule are removed, or "spliced out", and the remaining ends are reconnected. For the first time, AlphaGenome can directly and explicitly model the positions and expression levels of these connections from the sequence, thereby providing a deeper understanding of the impact of genetic variations on RNA splicing.

03. Best performance in over 20 benchmarks

AlphaGenome has achieved state-of-the-art performance in extensive genomic prediction benchmarks, such as predicting which parts of a DNA molecule will be close to each other, whether genetic variations will increase or decrease gene expression, or whether they will change the gene splicing pattern.

The bar chart below shows the relative improvement of AlphaGenome on selected DNA sequence and variant effect tasks and compares the results with the current best methods in each category.

When predicting single DNA sequences, AlphaGenome outperformed the best existing models on the market in 22 out of 24 evaluations. When predicting the regulatory effects of variations, it performed as well as or better than the best external models in 24 out of 26 evaluations.

This comparison covers task-specific models. AlphaGenome is the only model that can jointly predict all evaluated modalities, demonstrating its versatility.

04. Unified model for faster hypothesis generation and testing

The versatility of AlphaGenome enables scientists to simultaneously explore the impact of a single variation on multiple modalities through a single API call. This means that scientists can generate and test hypotheses more quickly without using multiple models to study different modalities.

Moreover, AlphaGenome's excellent performance indicates that it has learned relatively general DNA sequence representations in the context of gene regulation. This lays a solid foundation for a broader research community. Once the model is fully released, scientists will be able to adjust and fine-tune it on their own datasets to better address their unique research questions.

Finally, this approach provides a flexible and scalable architecture for the future. By expanding the training data, the capabilities of AlphaGenome can be extended to achieve better performance, cover more species, or include more modalities, making the model more comprehensive.

05. Facilitating disease understanding, basic research, etc.

The predictive capabilities of AlphaGenome can assist in various research avenues:

1. Disease understanding: By more accurately predicting genetic mutations, AlphaGenome can help researchers more precisely identify the potential causes of diseases and better interpret the functional impacts of variations associated with certain traits, potentially leading to the discovery of new therapeutic targets. We believe that this model is particularly suitable for studying rare variations that may have a significant impact, such as those causing rare Mendelian genetic diseases.

2. Synthetic biology: Its predictions can be used to guide the design of synthetic DNA with specific regulatory functions - for example, activating genes only in nerve cells rather than muscle cells.

3. Basic research: It can accelerate our understanding of the genome by helping to map the key functional elements of the genome, defining their roles, and identifying the most important DNA instructions that regulate the functions of specific cell types.

For example, Google used AlphaGenome to study the potential mechanism of a cancer-related mutation. In an existing study of patients with T-cell acute lymphoblastic leukemia (T-ALL), researchers observed mutations at specific genomic locations. Using AlphaGenome, they predicted that these mutations would activate the nearby TAL1 gene by introducing MYB DNA-binding motifs, which replicates the known disease mechanism and highlights AlphaGenome's ability to associate specific non-coding variations with disease genes.

Professor Marc Mansour from University College London said, "AlphaGenome will become a powerful tool in the field. Determining the correlations between different non-coding variations can be extremely challenging, especially in large-scale studies. This tool will provide crucial clues to help us better understand diseases such as cancer."

06. Conclusion: An important step in AI gene prediction

AlphaGenome marks an important step forward in AI gene prediction, but it still has its limitations.

Like other sequence-based models, accurately capturing the impact of extremely distant regulatory elements (such as those separated by more than 100,000 DNA bases) remains an unsolved challenge.

Meanwhile, Google has not designed or validated AlphaGenome for personal genome prediction. Although AlphaGenome can predict molecular outcomes, it does not fully reveal how genetic variations lead to complex traits or diseases.

This article is from the WeChat official account "Zhidx" (ID: zhidxcom). Author: Li Shuiqing, Editor: Xinyuan. Republished by 36Kr with permission.