HomeArticle

The waters in the AI circle are too deep: OpenAI keeps secrets, Meta cheats, yet domestic Mixture of Experts (MoE) models are emerging as a new force.

新智元2025-07-17 11:02
The parameters of large models have skyrocketed to the trillion level, and the innovation of the MoE architecture has triggered open-source competition.

From GPT-2 to Llama 4, how much has the size of large models "grown" in recent years? From models with tens of billions of dense parameters to sparse Mixture of Experts (MoE) architectures, from the hegemony of closed - source models to the counterattack of open - source models, Meta, OpenAI, Mistral, DeepSeek... In this era of competition, who will emerge victorious?

From traditional dense architectures to today's popular sparse Mixture of Experts (MoE) models, the development of large language models has advanced by leaps and bounds:

Initially, the number of parameters was only in the tens of billions. Now, even the activated parameters alone have reached hundreds of billions!

From tens of billions to trillions, behind the inflation of parameters is the "faith" in the Scaling Law in the AI community.

Since the release of GPT - 2 in 2019, large language models (LLMs) have continuously achieved leaps in parameter scale, training data volume, and model architecture.

How big are large models exactly? From 2019 to now, what kind of "weight gain" have large models experienced?

The GitHub user rain - 1 manually summarized the trends of foundation models, "without any AI - generated content". He also said:

In recent years, the development of language models has been magnificent and far - reaching.

What is recorded here is just a tiny fragment, like seeing a leopard through a bamboo tube, giving a glimpse of the whole.

This article aims to objectively present the scale information of large language models. It does not involve the disclosure of confidential information or rumors, and only focuses on foundation models (i.e., the original text continuation engines, not ChatBots).

The number of parameters in AI models shows exponential growth

The Journey of Large Models: The GPT Series, OpenAI Becomes "CloseAI"

It is mainly divided into two major stages: the early stage of dense models and the middle stage of transformation and confidentiality.

Early dense models (2019 - 2020):

GPT - 2 family: Parameters range from 137M to 1.61B, and the training data is about 10B tokens.

GPT - 3 (175B): The first true "large model".

Middle stage of transformation and confidentiality (2022 - 2023):

GPT - 3.5 and GPT - 4: The parameter or data scale has not been announced, and the information is highly confidential.

Specifically, for GPT - 2 (2019), the parameter scale:

GPT - 2 - small: 137 million parameters

GPT - 2 - medium: 380 million parameters

GPT - 2 - large: 812 million parameters

GPT - 2 - xl: 1.61 billion parameters

Training data is based on the unpublished WebText dataset, about 40GB of Internet text, estimated to be about 10 billion tokens.

In 2020, OpenAI released GPT - 3, codenamed davinci/davinci - 002, with a parameter scale of 175 billion (175.0B).

Link: https://www.lesswrong.com/posts/3duR8CrvcHywrnhLo/how - does - gpt - 3 - spend - its - 175b - parameters

Training data is about 400 billion tokens, sourced from CommonCrawl, WebText2, Books1, Books2, and Wikipedia.

For specific information on the data sources, refer to the following paper.

Paper link: https://arxiv.org/abs/2005.14165

The training of GPT - 3 took several months and utilized the computing power of a data center with tens of thousands of A100 GPUs.

From 2022 to 2023, OpenAI did not publicly disclose information such as the architectural details and training data scale of GPT - 3.5 and GPT - 4.

After that, OpenAI became a highly confidential "black box". Meanwhile, open - source models, especially the LLaMA family, "rose with the tide":

From 7B to 65B, where the 65B model was trained with 1.4T tokens;

Llama 3.1 reached 405B parameters and 3.67T tokens of data, marking a turning point in the open - source field.

The Journey of Large Models: The Llama Series

The initial versions of Llama have 7B, 13B, 33B, and 65B parameters.

In terms of training data, the official confirmed the use of the Books3 dataset. The 65B version was pre - trained with a dataset of 1.4 trillion (1.4T) tokens.

In 2024, Meta open - sourced Llama - 3.1 405B, with a staggering parameter scale of 405 billion, using a dense Transformer architecture (i.e., all parameters participate in the calculation during inference).

In terms of training data, Meta did not disclose the data sources in detail, only vaguely stating it as "mixed data from multiple knowledge sources". In total, it consumed about 3.67 trillion tokens:

Initial pre - training: 2.87 trillion tokens

Long - context training: 800 billion tokens

Annealing training: 40 million tokens

Paper link: https://arxiv.org/abs/2407.21783

They also made a key discovery:

Experiments show that in core benchmark tests, annealing training on small - scale, high - quality code and mathematical data can significantly improve the performance of pre - trained models.

However, the user rain expressed regret about the current popular trend of "Benchmax annealing pre - training" —

It makes the foundation language model gradually deviate from its "original intention" — the positioning of a pure text continuation engine.

This kind of optimization should belong to the post - training stage (i.e., the process of making the model play the role of an "AI chat assistant"), but enterprises clearly value the short - term improvement of benchmark scores more.

In 2025, Meta launched the Llama - 4 series. The 2 - trillion - parameter behemoth "Behemoth" may never see the light of day.

The flagship large model Behemoth in the Llama4 series has a total of

The Maverick and Scout models of Llama4 are distilled from this large model. However, a scandal erupted around these lightweight versions —

Meta (formerly Facebook) was exposed for "cheating" on the lmarena benchmark platform:

They uploaded a "customized version" of Llama4 Maverick for benchmarking but released a different version.

This act was regarded as academic misconduct by the outside world, seriously undermining the trust in the Llama team. Since then, the Llama team seems to have quickly fallen apart, and it is still unclear whether this 2T model will ever be released.

As for the released small Llama4 models, although they claim to "inherit the essence of large models", the current general evaluation is that they have low intelligence levels and are not very useful.

The Era of the Large - Model Wilderness

Once, the AI community was in a "large - model wilderness" — other models could not match GPT - 3.

People could only repeatedly fine - tune small models like LLaMA, trying to catch up with the huge shadow left by GPT - 3.

But this approach of "using AI to train AI" also led to a vicious cycle in model performance.

The release of the Llama 405B model was a turning point. Before that, Mistral released two Mixture of Experts models:

In December 2023, it launched Mixtral 8x7B (a Mixture of Experts model).

In April 2024, it upgraded and released Mixtral - 8x22B (a sparse Mixture of Experts model with a total of 141B parameters and 39B actually activated parameters).

Although Mixtral - 8x22B is not a dense model like GPT - 3, its total parameter order of magnitude is comparable to that of GPT - 3 (175B).

The revolutionary aspect of the Mixture of Experts (MoE) architecture is that it allows ordinary researchers to train and use ultra - large - scale models without the need for computing clusters composed of thousands of GPUs.

At the end of 2023, the rise of the sparse MoE architecture: Deepseek V3 and others followed.

While the total number of parameters far exceeds that of GPT - 3, the activated parameters of MoE models are maintained at the level of tens of billions, thus reducing the inference cost.

These LLMs support multiple languages and modalities and adopt larger context windows (32K - 256K tokens). Some new models also use "annealing" post - training to improve performance on specific benchmark tests.

The MoE Craze Hits: In the Competition, Who Will Prevail?

On the day after Christmas in 2024, DeepSeek released a stunning work — V3 Base. The official website describes it as follows:

🎉New Features of V3

🧠 671 billion MoE parameters

🚀 37 billion activated parameters

📚 Trained on 14.8 trillion high - quality tokens