20-Year-Old Transformer Author Truly Open-Sources 218-Billion Parameter Large Model

Just now, Cohere released the MoE large model Command A+ with 218 billion parameters. It can run on a single B200, supports 48 languages, and comes with native citation capabilities. However, the most astonishing aspect of this release is not the parameter list but the license: Apache 2.0.

It was the famous paper "Attention Is All You Need" that gave birth to all the large models today.

On May 20th, Aidan Gomez, a co - author of the paper, announced on X the launch of the first fully open - source model licensed under Apache 2.0: Cohere Command A+.

Gomez is a former Google researcher and is now the co - founder and CEO of Cohere.

Command A+ is the last model in the Command A family and is also Cohere's first MoE (Mixture of Experts) model. It has a total of 218 billion parameters and 25 billion active parameters. It integrates visual input, inference, translation, and AI agent capabilities all into the same model at once.

The minimum deployment configuration: 1 NVIDIA B200 or 2 H100s. License: Apache 2.0.

https://cohere.com/blog/command-a-plus

According to VentureBeat, this is Cohere's first truly commercially available open - source flagship in its history. Co - founder Nick Frosst called it "the best model we've ever released."

218 billion parameters, but only 25 billion are active each time

218 billion parameters sound like a behemoth that gobbles up computing power. But for each generation by Command A+, only 25 billion parameters are actually activated.

This is the essence of the MoE architecture.

In a MoE model, the incoming questions are only routed to the few "expert" neural networks that are best at handling them, while the rest remain dormant. This design allows the model to retain "giant - level" knowledge reserves and reasoning capabilities, but its computing power and energy consumption during operation are close to those of a much smaller model.

According to VentureBeat, third - party observations estimate that the parameter counts of OpenAI's GPT - 5.5 and Anthropic's Claude Opus 4.7 are in the trillions, while Command A+ only activates 25 billion parameters each time.

Saving computing power with MoE is now a common practice for most leading models. But Cohere has added a second layer of compression on this basis: quantization.

Command A+ offers three versions: BF16, FP8, and the highly compressed W4A4, with W4A4 being the core technology of this release.

Normally, once an inference model is compressed, its performance on complex problems will significantly decline, which the industry calls the "quantization tax."

Cohere's approach is to compress only the MoE experts to 4 - bit, keep the key attention pathways at full precision, and then superimpose a technology called Quantization - Aware Distillation.

Cohere claims that its W4A4 quantization scheme is nearly lossless. According to the performance data released by Cohere, the W4A4 version can generate 375 tokens per second under low concurrency, with a first - token latency of only 113 milliseconds.

It is precisely this scheme that enables a model with 218 billion parameters to run on a single NVIDIA B200 or two H100s.

A comparison of the speed and latency between Command A+ and its predecessor Command A Reasoning under different concurrency and quantization levels. TOPS is the number of tokens generated per second, and TTFT is the first - token latency. The data is released by Cohere.

The so - called "single - card operation" here refers to a data - center - level Blackwell B200, not a consumer - grade graphics card.

In the past, a model with hundreds of billions of parameters required an entire GPU cluster, but now it can be handled by a single machine.

This is the story Cohere wants to tell this time: large parameters no longer mean burning money.

Apache 2.0, a license to true open - source

If we only look at the parameters and speed, Command A+ is a powerful engineering upgrade. But what is more worthy of developers' attention is an Apache 2.0 license.

In today's AI circle, the word "open - source" has long been "abused."

Many leading AI companies release model weights but attach restrictive commercial terms: large enterprises are not allowed to use them for commercial purposes, nor are they allowed to train competing models with them. You can download and research, but if you want to make money, you have to buy a license.

Cohere has also wavered in this direction for a long time.

According to VentureBeat, its previous Command R and Command R+ adopted the CC - BY - NC 4.0, that is, the "Creative Commons - Non - Commercial" license. Researchers and developers can download, tinker with, and evaluate, but commercial use is strictly prohibited.

In other words, it was half - open and half - restricted. But with Command A+, the other half has also been loosened.

It uses Apache 2.0, a truly open - source license recognized by the OSI. From independent developers to Fortune 500 companies, anyone can use, modify, distribute, and commercialize this model without paying a license fee or being subject to non - competition clauses.

This is the first time Cohere has done this. Led by someone who wrote the Transformer, it has fully embraced true open - source.

According to VentureBeat, this decision was strongly promoted by co - founder Nick Frosst.

Frosst is one of Cohere's three co - founders. He was a researcher at Google Brain's Toronto lab and was one of the earliest employees there under AI godfather Geoffrey Hinton.

Cohere's shift of its flagship model from CC - BY - NC 4.0 to Apache 2.0 means that enterprises no longer have to be tied to the supplier.

A company can download the weights of Command A+, fine - tune it with its own highly confidential internal data, and deploy it on a private server or even an air - gapped network, without being tied to Cohere's infrastructure, pricing changes, or API stability.

Command A+ makes "traceability" a native ability of the model

Being able to run and being willing to use are two completely different things.

For a model to truly enter the production environments of finance, healthcare, and law, the real bottleneck is not the model's capabilities but its trustworthiness.

Command A+ has made a native - level design in this regard: native citation generation.

When Command A+ retrieves information from external tools, it not only synthesizes the answer but also generates so - called "grounding spans."

By embedding special tags in the output, the model directly links each factual statement it makes to the specific document or database record it references.

For example, if you ask it to generate a daily sales report, while giving the total sales amount, it will clearly mark the result of the database query that provided this figure. The source is clear at a glance, and the risk of hallucination is minimized.

This traceability is particularly important for industries under strict regulation.

The agent ability is also a key point of this release.

Command A+ supports conversational tool calls under the standard chat template and can seamlessly connect to internal APIs, search engines, or SQL databases.

It is also fully multimodal, natively handling text and images within a 128K input context, making it suitable for analyzing scanned invoices, charts, and technical manuals.

A comparison of the multimodal capabilities between Command A+ and Command A Vision. Command A+ is Cohere's first multimodal inference model. The data is released by Cohere.

According to the performance data released by Cohere, on the ² - Bench Telecom for testing complex reasoning, Command A+ increased from 37% in the previous generation to 85%; on the Terminal - Bench Hard for measuring agent coding ability, it climbed from 3% to 25%; on the AIME 25 math test, it rose from 57% to 90%.

A comparison of the performance between Command A+ and its predecessor Command A Reasoning on five open - source benchmarks. The data is released by Cohere.

These are all data cited by VentureBeat from Cohere's own release, not independent third - party evaluations.

VentureBeat believes that with 25 billion active parameters, Command A+ can compete with much larger models in pure reasoning and mathematics. However, in terms of in - depth agent coding and the breadth of comprehensive intelligence, it currently lags behind leading Chinese open - source models such as DeepSeek.

More important than the score is that Command A+ makes "traceability" a native ability of the model.

The author of Transformer teams up with a disciple of Hinton to make Cohere truly open - source

Finally, let's talk about the two people behind Command A+.

https://arxiv.org/pdf/1706.03762

In 2017, the Transformer paper "Attention Is All You Need" was born at Google. Among the eight authors, the youngest, Aidan Gomez, was only 20 years old at the time. He was an intern at Google Brain and was an undergraduate majoring in computer science and mathematics at the University of Toronto.

Aidan Gomez

According to TIME, in order to meet the deadline for an important AI conference, he and his colleagues even slept in the office. Later, he admitted to TIME that no one could have predicted at that time that this paper would bring the entire AI industry to where it is today.

Gomez is good at turning underlying architectures into practical products. In 2017, he also launched FOR.ai, a collaborative project that allows researchers to share machine - learning knowledge, which later evolved into Cohere For AI.

In 2019, he left Google Brain and founded Cohere in Toronto with Ivan Zhang and Nick Frosst. The three of them chose a different path from OpenAI: instead of creating chatbots for the general public, they only develop models for enterprises.

Nick Frosst

Frosst is a co - founder of Cohere. He was a researcher at Google Brain's Toronto lab under AI godfather Geoffrey Hinton and was one of the earliest employees there. The industry often regards him as Hinton's favorite disciple. His research focuses on capsule networks and model interpretability.

One wrote the Transformer, and the other is a disciple of Hinton. From the very beginning, Cohere had the gene of "turning cutting - edge research into products that enterprises can use."

With Command A+, under the strong promotion of Frosst, Gomez made the decision, and Cohere completely switched the license of its flagship model to Apache 2.0.

According to Cohere's official statement, Command A+ is the last model in the Command A family, which often means that the next family is on the way.

For a long time, data privacy and cost control have trapped enterprises in a bottleneck: if they want to use cutting - edge AI, they have to rely on centralized large - scale computing clusters.

This time, Command A+ combines cutting - edge inference, robust agent tool calls, multimodal capabilities, and an architecture designed specifically for hardware efficiency. This transformation is rewriting the cost equation for enterprises to adopt AI.

First, the deployment threshold has been lowered. In the past, a model with hundreds of billions of parameters required an entire GPU cluster, but now a minimum of 1 B200 or 2 H100s is sufficient.

Second, the inference cost has also been reduced. The output speed of the W4A4 version is up to 63% higher than that of the previous generation Command A Reasoning, and the latency is reduced by 17%. Computing power time is money. As the speed increases, the unit cost decreases.

Third, the cost of multilingual use has also been reduced. The new tokenizer saves more tokens for non - European languages: 20% less for Arabic, 18% less for Japanese, and 16% less for Korean. Since inference is billed by tokens, fewer tokens mean a thinner bill for cross - border and multilingual deployment.

Recently, Cohere also announced a merger with German AI company Aleph Alpha. The two companies have the same direction: instead of betting on chatbots, they develop AI that can be installed in the data centers of governments and large enterprises.

The competition in the open - source large - model field has entered the second half. In the first half, the competition was about parameter scale. In the second half, it's about something else: who can enable enterprises to truly move the models into their own data centers.

References:

https://cohere.com/blog/command-a-plus

https://venturebeat.com/technology/

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The person who wrote Transformer at the age of 20 has truly open-sourced a 218-billion parameter large model.

218 billion parameters, but only 25 billion are active each time

Apache 2.0, a license to true open - source

Command A+ makes "traceability" a native ability of the model

The author of Transformer teams up with a disciple of Hinton to make Cohere truly open - source