Yang Zhenyuan: ByteDance Team Trained Large Language Model in 2021 with "Lack of Foresight"

Before the emergence of ChatGPT, ByteDance had an opportunity to pay early attention to large language models in 2021.

On November 24th, at the 5th ByteDance Scholarship Award Ceremony, Yang Zhenyuan, the technical vice president of ByteDance, shared the company's technological exploration journey.

According to him, in 2014, Zhang Yiming, the founder of ByteDance, approached him and expressed the intention to build a recommendation system using a large - scale machine learning system to solve the recommendation problems of various media forms, including pictures, texts, and videos. Attracted by this idea, Yang Zhenyuan joined ByteDance, which was still a small company at that time.

Yang Zhenyuan mentioned that before ChatGPT emerged on November 30th, 2022, the ByteDance team had an opportunity to pay early attention to large - language models in 2021: a colleague at ByteDance had trained a large - language model at that time, but the team concluded that "this large - language model had no practical value for the time being."

"So, we really lacked foresight," Yang Zhenyuan said.

Fortunately, the company adjusted quickly. Since 2022, it has been investing in this area and has achieved some results. "In terms of applications, you may be more familiar with them. Doubao is the most popular AI chat assistant in China, and the large - model services of Volcengine are also recognized by customers. According to an IDC report, Volcengine ranks first in China's MaaS market."

Yang Zhenyuan, the technical vice president of ByteDance

The following is the full text of Yang Zhenyuan's sharing:

Hello, everyone. I'm very glad to meet you at the ByteDance Technology Scholarship event. I'm a technology enthusiast myself. I joined ByteDance in 2014. Starting from being responsible for building a new recommendation system initially, it's been almost 12 years now. Over the years, I've also been involved in many technological explorations at ByteDance.

When it comes to ByteDance, most people are more familiar with our products, such as Douyin, Toutiao, and TikTok.

My perspective may be more technical. Today, I'd like to share some technological stories that you may not be so familiar with from my perspective.

2014, Large - scale Machine Learning and Recommendation Systems

The first version was planned to reach a feature scale of trillions (T)

Initially, the founder Zhang Yiming approached me and said that he wanted to use a large - scale machine learning system to build a recommendation system to solve the recommendation problems of various media forms, including pictures, texts, and videos. His idea was very appealing to me.

In 2014, the largest - scale machine learning system in the industry was the large - scale discrete LR (Logistic regression) that was already maturely used in search advertising. Applying this principle to the recommendation system was quite challenging. At that time, there were not many people who were familiar with both large - scale software and hardware engineering and machine learning. Moreover, apart from search advertising, which could generate a lot of revenue, people in other fields were reluctant to spend such high hardware costs on computing.

We set a very aggressive goal for the first version: to reach a feature scale of trillions (T) in 2014.

There were many challenges here, such as system modeling and handling the optimization goals of the recommendation. In terms of engineering, storage and computing were the earliest thresholds. Additionally, we also needed to optimize the algorithms. The challenges of building goals and handling storage have been shared before. Today, let's talk about optimizing algorithms.

Image source: the enterprise

The optimization of LR is a mature technology, but the efficiency and effectiveness of different methods vary greatly, especially after reaching an ultra - large scale. Many students today may not know about the optimizers at that time. Today, SGD - based methods are the mainstream, but in 2014, when we were doing very large - scale sparse logistic regression, it was not the case. At that time, some CD - based methods were used more. Additionally, the optimizer used in Baidu's search advertising was OWL - QN.

We only had 5 people at that time, and some of them had to work on engineering. We prepared two plans for the optimizer. 1. SGD - FTRL; 2. CDN (Coordinate Descent Newton). We selected two people to be responsible for each and conducted research simultaneously.

We predicted that the CDN optimizer project had potential at that time, and the initial progress was also good. However, the initial launch showed that it didn't work well, so we kept improving it. For two years, a group of people continued to work on it. It wasn't until the SGD method started to find more application methods that we finally stopped this project. The students in the CDN optimizer project team later switched to other directions in machine learning and were responsible for very important business in the company. Although the project was not successful, the company still recognized their exploration.

FTRL is mentioned less often now. It can be considered as an SGD with L1 regularization, based on cumulative gradients and AdaGrad - style adaptation. This project progressed quickly. We launched it in a few months and successfully achieved the goal of sparsifying trillions of features. Moreover, the framework was very flexible.

By the end of 2014, we gradually introduced FM - like algorithms, which later evolved into a more general deep learning system. And from the day we launched it, it was a streaming training system.

To this day, we find that the relatively shallow neural network algorithms with streaming updates (training only) still have good effects in recommendation. It may be related to some problems in test - time training and may be a more approximate implementation of RNN.

2020, Exploration of Scientific Computing

Solving the Schrödinger equation can simulate most phenomena in the world.

From the end of 2019 to 2020, we had a discussion about how AI could develop in the future and how it could play a more important role in society.

Our thinking at that time was that only a large amount of valuable data could produce valuable models and algorithms. In the online world, recommendation, search, and advertising are the mainstream applications. So, what other scenarios can generate a lot of valuable data? Obviously, it's the real world. However, the collection and application of data in the real world are more complex, involving fields such as self - driving cars and robots. Besides the real world, we also thought of scientific computing.

Although our world is complex, the underlying physical laws are very simple. From the perspective of quantum mechanics, if we had a machine with unlimited computing power today, we could indeed solve most of the phenomena in the current world from the Schrödinger equation (without considering gravity). A large number of simulations would generate valuable data to guide the progress of machine learning. Better results, in turn, could improve the simulations.

This picture was shared by Academician E Weinan, our consultant at that time. I'm posting it here. It shows the classification of scientific computing at different scales.

Image source: the enterprise

As you can see, the horizontal axis represents the spatial scale, and the vertical axis represents the time scale. This picture shows some problems in physics and scientific computing. For example, at the bottom - left corner is the first - principles calculation, which includes methods such as CCSD and QMC. It requires calculating the wave function of multiple electrons. Moving up, there are the approximate DFT (Density Functional Theory). Moving further up, instead of describing the wave function, particles are used for abstraction, which is molecular dynamics MD (Molecular dynamics). Moving up further, it is abstracted to particle clusters; at the top are higher - level abstractions such as fluid mechanics and finite element methods.
What's the value of machine learning in this? The meanings of L1, L2, L3, and L4 in the picture are that for these problems at different scales, machine learning methods can be used to solve them better. For example, from the perspective of quantum chemistry calculation at the bottom, neural networks can be used to fit the wave function of multiple electrons. Although these physical laws are very simple to describe, they are extremely complex to calculate. So, machine learning can play a very important role.

First - principles Calculation

We started to invest continuously in this direction in 2020. Here is a picture provided by a colleague, showing some of our work in this area.

Image source: the enterprise

The horizontal axis in the picture represents time. The early representative work in this field was DeepMind's FermiNet. In 2019, several of us discussed this work in the meeting room. This field is called NNQMC (Neural Network Quantum Monte Carlo method). What does it mean? QMC is Quantum Monte Carlo. According to the variational principle, the system energy calculated by any trial wave function is always greater than or equal to the real ground - state energy. So, we can use a neural network to represent a wave function, then sample on this wave function and calculate the system energy. Then, we can update the neural network according to the gradient in the direction of lower energy to finally obtain a better representation of the wave function.

The pink part represents several of our works after 2021. We have basically reached the forefront in the industry.

The vertical axis in this picture represents the simulation accuracy, which is the degree of closeness to physical experiments. The closer the simulation is to reality, the better the application prospects. The size of the circle indicates the number of electrons in the simulation system. The larger the circle, the greater its practical value.

In the top - right corner, there is Scaling Laws with LAVA, which is our latest achievement. We found that this problem shows the Scaling Law, just like large models. If we use more parameters, we will see that the simulation accuracy continues to increase. This is a good sign, indicating that we may still have great potential for breakthroughs in terms of practicality.

In terms of the scope of the processing system, we proposed the first NNQMC method applicable to solid systems, DeepSolid. At the same time, we also conducted a series of studies on two - dimensional twisted materials. One of the key tasks this year is to apply NNQMC to the study of topological insulators.

Topological insulators have special electrical properties. After being energized, there is no current inside the device, but a current is generated at the edge of the device. The device hardly generates heat.

The electrical property of "not generating heat" in topological insulators is very attractive. Because the currently used CPUs and GPUs generate a lot of heat, resulting in energy loss. If topological insulators can really be used as a replacement, perhaps supercomputers can be manufactured.

How to find topological insulators? By applying the above - mentioned method, we can simulate and calculate the properties of materials based on their descriptions, thereby greatly improving the efficiency of experiments. We specifically calculated the two - dimensional material MoTe2 and found that it would become a topological insulator under specific density and rotation angle θ, which was consistent with the experimental results.

Molecular Dynamics

Image source: the enterprise

We also made many explorations in molecular dynamics. MD (Molecular dynamics) is at the position of classic MD in Professor E Weinan's picture. Our idea was to first improve the forward problem. Use higher - precision simulations to provide more accurate labels for the force field of machine - learning MD. DFT (Density Functional Theory) is a reasonable level. We first carried out the GPU acceleration work for DFT. Our GPU4PySCF achieved the state - of - the - art (SOTA) in GPU - accelerated DFT calculations in the industry. Compared with traditional CPU - based calculation programs, it achieved an acceleration of 1 GPU ≈ 500 - 1000 CPU cores, and the computing power cost for completing the same calculation task was reduced by an order of magnitude.

With better labels, we can obtain a more accurate force - field model, and then conduct more accurate MD simulations to make better property predictions.

After doing a lot of forward - problem work, we can retrain the model to directly generate candidates for small molecules that may meet certain properties. This is the inverse problem. This problem is the core problem in several industrial fields (energy, pharmaceuticals). Our team developed two types of molecular dynamics force fields, Bamboo - MLFF and ByteFF, to accurately predict the properties of molecular and solid systems. Among them, ByteFF - Pol currently achieves the SOTA accuracy in zero - shot prediction of electrolyte properties without experimental data.

These works are not limited to our experiments. This year, we established a joint laboratory with BYD, which will combine high - throughput automated experiments with scientific computing algorithms to explore the industrial application of AI for Science in the field of battery materials. Currently, GPU - accelerated DFT calculations, force - field + molecular dynamics simulations, and prediction + design models have all been put into practical use by enterprise partners.

2021, PICO - Exploration of XR

Invest more in basic technologies and strive for a significant improvement in core experience

The development of ByteDance is inseparable from the innovation and progress of hardware. Large - screen mobile phones and high - definition cameras are the foundation for the development of products like Douyin and TikTok. So, what kind of interactive experience can surpass video in the future?

XR has the potential to bring a brand - new experience. In 2021, ByteDance acquired the Pico team.

After the acquisition, we were advancing two product routes simultaneously. One was to focus on the current product form, while investing resources in operating content such as videos and live - broadcasts and conducting relatively aggressive marketing. The second route was to invest in basic technologies and strive for a significant improvement in core experience.

In 2023, we decided to reduce the investment in content and marketing and be more committed to the technology route. This was because the hardware experience of the product was not yet mature at that time and could not support large - scale market applications. This adjustment caused some misunderstandings at that time. Many people said that ByteDance was no longer involved in this direction. In fact, on the contrary, since 2023, we have been investing more in XR technology than before.

Next, I'd like to share some technological explorations in the second route.

Firstly, clarity.

XR needs to simulate the experience of human eyes observing the real world. The key indicator is PPD (Pixels Per Degree), which means how many pixels there are when a person's eyes look at one degree. This indicator is strongly related to

This article is originally produced by「李小霞」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

Yang Zhenyuan: In 2021, the ByteDance team trained a large language model, but at that time, they "lacked foresight."

2014, Large - scale Machine Learning and Recommendation Systems

The first version was planned to reach a feature scale of trillions (T)

2020, Exploration of Scientific Computing

Solving the Schrödinger equation can simulate most phenomena in the world.

First - principles Calculation

Molecular Dynamics

2021, PICO - Exploration of XR

Invest more in basic technologies and strive for a significant improvement in core experience