MIT's Pichia - CLM Model: Learning Yeast DNA "Language" for Up to Three

Higher expression yields were consistently observed compared to four commercial codon optimization tools.

A research team from the Massachusetts Institute of Technology has proposed a deep learning-based language model, Pichia-CLM, for codon optimization in the industrial-related host Pichia pastoris (Komagataella phaffii) to enhance the production of recombinant proteins. The researchers experimentally verified Pichia-CLM on six types of proteins with different complexities and consistently observed higher expression yields compared to four commercial codon optimization tools.

In the fields of biopharmaceuticals and industrial biotechnology, the efficient expression of recombinant proteins has always been the core factor determining production costs and process feasibility. From monoclonal antibodies and vaccine antigens to industrial enzyme preparations, even a slight increase in expression levels can bring significant economic value.

Among many expression systems, Pichia pastoris (Komagataella phaffii) has become one of the important hosts for industrial production due to its high-density fermentation capacity, mature secretion expression system, and good protein processing ability. However, a long-standing problem in the industry is that even if the amino acid sequences are completely identical, simply changing the "synonymous codons" in the encoding DNA can lead to orders-of-magnitude differences in expression levels.

This phenomenon stems from codon usage bias (CUB) - in many organisms, certain synonymous codons are preferentially used. The choice of synonymous codons can affect protein production by influencing transcription, mRNA stability, translation, protein folding, post-translational modifications (PTMs), and solubility. Therefore, "codon optimization" has become a crucial step in the expression of foreign proteins.

Currently, the industry has developed various codon optimization tools and methods based on the host's CUB. However, these methods may still fail to stably produce high-expression constructs. In recent years, with the development of artificial intelligence, especially sequence modeling techniques, researchers have begun to regard gene sequences as a "language" and attempt to learn the implicit rules through methods similar to natural language processing.

In this context, a research team from the Massachusetts Institute of Technology has proposed a deep learning-based language model, Pichia-CLM, for codon optimization in the industrial-related host Pichia pastoris (Komagataella phaffii) to enhance the production of recombinant proteins. Different from traditional methods that rely on CUB indicators (usually only providing global scores and ignoring sequence context), Pichia-CLM uses host genome data to unbiasedly learn the mapping relationship between amino acids and codons. The researchers experimentally verified Pichia-CLM on six types of proteins with different complexities and consistently observed higher expression yields compared to four commercial codon optimization tools.

The relevant research results, titled "Pichia-CLM: A language model–based codon optimization pipeline for Komagataella phaffii," have been published in PNAS.

Research Highlights:

* Pichia-CLM uses host genome data to unbiasedly learn the mapping relationship between amino acids and codons, considering not only host preferences but also position dependence and long-range context relationships.

* Pichia-CLM was experimentally verified on six types of proteins with different complexities, and higher expression yields were consistently observed.

* The amino acid and codon embeddings learned by the model can be grouped according to physical and chemical properties, indicating that the language model can capture physically meaningful rules.

Paper Link: https://www.pnas.org/doi/10.1073/pnas.2522052123

Constructing a Large-Scale Sequence Dataset Centered on Pichia pastoris

Different from traditional methods that rely on empirical rules, the core idea of Pichia-CLM is to directly learn the encoding rules from the host genome. For this purpose, the research team constructed a large-scale sequence dataset centered on Pichia pastoris.

To train Pichia-CLM, the researchers collected amino acid sequence and coding sequence data of two Pichia pastoris variants, CBS7435 and GS115, from NCBI. In addition, they supplemented the data from previous genome sequencing and annotation in their laboratory, including GS115, K. phaffii (NRRL Y11430), and K. pastoris - a total of approximately 27,000 pairs of amino acid - coding sequence data were ultimately used.

During the data processing, the researchers tokenized the amino acids and codons and introduced start (<START>), end (<END>), and padding (<PAD>) tokens to enable the model to process sequences of different lengths and support batch training. Meanwhile, the dataset was divided into a training set and a test set, with approximately 20% used to evaluate the model's prediction ability on unseen data.

Notably, this data construction method did not artificially introduce any "optimization goals" but was entirely based on natural genome data. This means that the model learned the host's real expression preferences rather than artificially set approximate rules, laying the foundation for subsequent performance improvement.

Pichia-CLM Adopts an Encoder-Decoder Architecture Based on GRU

Model Architecture

Pichia-CLM adopts an encoder-decoder architecture based on gated recurrent units (GRU). GRU is an improved recurrent neural network structure designed to capture long-range and short-range dependencies in sequence data. By regulating the flow of information through a gating mechanism, GRU effectively alleviates the common gradient vanishing problem in traditional RNNs. Moreover, GRU can compete with long short-term memory networks (LSTM) in terms of performance but requires fewer parameters and less computational resources, making it more efficient in many sequence modeling tasks.

Compared with another mainstream architecture, Transformer, GRU has higher computational efficiency and lower resource consumption on small and medium-sized datasets. Research shows that under the condition of a dataset of approximately 27,000 sequences, introducing Transformer would increase unnecessary complexity, while GRU can achieve a better balance between performance and efficiency.

The model takes the amino acid sequence of a protein as input and generates the corresponding DNA sequence based on the patterns learned from the host's amino acid and coding sequences. The overall architecture is shown in the following figure:

Workflow and schematic diagram of Pichia-CLM

Model Training Process

During the training process, the researchers used a validation set (20% of the training set) for early stopping to optimize parameters. Meanwhile, hyperparameter selection was carried out with the goal of minimizing the validation set loss (sparse categorical cross-entropy). Hyperparameter optimization adopted the Bayesian optimization, a global optimization strategy, and was implemented with the code developed internally by the researchers.

Specifically, the following hyperparameters were involved in the model:

* Amino acid embedding dimension

* Codon embedding dimension

* Number of units in the encoder layer

* Size of the codon fully connected layer in the decoder

* Size of the amino acid fully connected layer in the decoder

During the model training phase, the decoder input was the real coding sequence (i.e., real codons). In the prediction phase, the model used the codon predicted at the previous position as the input for the next position, thus achieving fully autoregressive prediction. When a stop codon was encountered, the sequence prediction terminated.

After completing the architecture selection and verifying the prediction ability on the test set, the researchers retrained the final model using the complete dataset and continued to adopt the early stopping strategy to avoid overfitting. This final model was used to design the coding sequences of foreign proteins.

Pichia-CLM Can Generate Constructs for High-Yield Protein Production

In the experimental verification part, the research team selected six proteins with different complexities for testing, including:

* Human growth hormone (hGH)

* Human granulocyte colony-stimulating factor (hGCSF)

* VHH nanobody 3B2 (34)

* Engineered SARS-CoV-2 RBD subunit variant (RBD) (35)

* Human serum albumin (HSA)

* IgG1 monoclonal antibody trastuzumab (Trast)

Performance of Pichia-CLM in Enhancing Protein Secretion in Pichia pastoris

First, the researchers selected three human-derived proteins with different sizes and complexities: hGH, hGCSF, and HSA, and compared the differences in protein secretion yields (titers) between the gene constructs generated by Pichia-CLM and their natural coding sequences. Overall, for proteins such as hGH and hGCSF, the yield increased by approximately 25%; for HSA, a significant increase of approximately threefold was observed.

Subsequently, the researchers compared Pichia-CLM with four commercial codon optimization tools: Azenta, IDT, GenScript, and Thermo Fisher (Thermo). They evaluated the six aforementioned proteins using two indicators:

* BestTiter: The number of proteins for which a method achieved the highest titer

* Aggregated Score: The sum of the relative titers (normalized to the maximum) of different proteins

Overall, Pichia-CLM outperformed the commercial algorithms in both indicators (as shown in Figure C below); it achieved the highest titer in five out of the six proteins, and only had a slightly lower titer for HSA, resulting in a slight decrease in the aggregated score (approximately 0.2) (as shown in Figure D below).

(D) Comparison of the codon optimization efficiency of Pichia-CLM and commercial algorithms for different molecules

Evaluation of Genetic Sequence Characteristics

After verifying the performance of Pichia-CLM in foreign protein production, the researchers further analyzed the genetic sequence characteristics of different designed constructs. Including other reported protein language models, codon optimization usually relies on one or more codon usage bias (CUB) indicators for design or evaluation. Therefore, they evaluated the correlation between these CUB indicators and protein production using the data of the six tested proteins.

The results showed that none of these indicators showed a consistent and high correlation with production among different proteins. For example, in the case of HSA (as shown in Figure A below), the maximum positive correlation with codon volatility and codon frequency distribution (CFD) was only 0.43, and the maximum negative correlation with codon pair score (CPS) was only 0.25.

Comparison of the predicted number of negative cis-regulatory elements in the sequences designed by Pichia-CLM and commercial algorithms for the experimentally tested proteins

The global CUB indicators calculated based on the entire sequence have obvious limitations in characterizing the features related to foreign protein production. This further indicates the need for new evaluation indicators to assess codon optimization tools and strict experimental verification with diverse proteins - this result directly challenges the theoretical basis of traditional codon optimization.

Sequence Feature Evaluation

The researchers also evaluated the presence of negative cis-regulatory elements in different codon-optimized constructs. These elements may interfere with the host's regulatory mechanisms and should be avoided as much as possible in foreign DNA sequences.

Among the six tested proteins, no negative cis-regulatory elements were detected in the constructs designed by Pichia-CLM; in contrast, GenScript contained one negative cis-regulatory element in three out of the six proteins; Azenta and IDT produced sequences containing 3 to 4 such elements in at least one protein, as shown in Figure B below:

Comparison of the distribution of negative cis-regulatory elements in the optimized sequences of Pichia-CLM and GenScript for 52 biotechnology-related benchmark proteins

The researchers also analyzed the performance of Pichia-CLM in 52 biotechnology-related proteins. The results showed that 75% of the protein sequences completely did not contain negative cis-regulatory elements, and the remaining 25% contained at most 2 such elements. In contrast, the best-performing commercial algorithm, GenScript, still produced constructs containing 3 to 6 negative cis-regulatory elements in approximately 15% of the proteins, as shown in Figure C below:

Comparison of the RNA stability of different constructs based on the predicted free energy of RNA structure (Pichia-CLM vs. commercial algorithms)

In summary, these results indicate that Pichia-CLM can not only generate constructs for high-yield protein production but also learn key genetic sequence features and achieve a balance among multiple factors to design robust coding sequences suitable for host expression.

AI Accelerates the Industrialization Process of Protein Production

In the biopharmaceutical industry, improving protein production efficiency has always been the key factor determining the success of R & D transformation and commercialization. From monoclonal antibodies to recombinant vaccines, and to various fusion proteins and enzyme preparations, the market demand continues to grow, and the requirements for yield, stability, and consistency are also constantly increasing.

To achieve this goal, the industry has formed a multi-level optimization system: at the host level, in addition to traditional Escherichia coli and Saccharomyces cerevisiae, Pichia pastoris and mammalian cells have become mainstream production platforms due to their better post-translational modification ability and expression efficiency; at the molecular design level, in addition to codon optimization, it also includes promoter strength regulation, signal peptide screening, mRNA structure engineering, and optimization of protein folding and secretion pathways; at the process level, high-density fermentation, optimization of feeding strategies, and control of bioreactor parameters also play a decisive role in the final yield.

Outside this system, a "cell-free" technology path, namely cell-free protein synthesis (CFPS), is rapidly emerging. This technology bypasses the cell growth process and directly uses the transcription and translation system in cell lysates to achieve rapid protein expression. It has been widely used in the development and production of antibodies, enzymes, and even antibody-drug conjugates. However, the CFPS system itself is a highly complex multi-variable system involving dozens of components such as DNA templates, enzyme systems, energy donors, amino acids, and ionic environments. Its combinatorial space is extremely large, and traditional optimization methods based on experience often struggle to achieve an ideal balance between cost and yield.

In this context, AI-driven automated optimization shows disruptive potential. Recently, OpenAI collaborated with leading synthetic biology company Ginkgo Bioworks to release significant research results. The "closed-loop automated system" built based

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

MIT Develops Pichia-CLM Model to Learn Yeast DNA "Language", with Exogenous Protein Yield Up to Three Times Higher