Qwen: Weltweit schnellstes Open - Source - Modell mit über 2000 Tokens pro Sekunde, erneut Verdienste erbracht!

Nur 32 GB Größe

The world's fastest open-source large model is here - it can reach a speed of 2000 tokens per second!

Although it only has 32 billion parameters (32B), its throughput is more than 10 times that of a typical GPU deployment.

It is K2 Think, jointly launched by the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in the UAE and the startup G42 AI.

Does the name sound a bit familiar?

Yes, it has a slight naming coincidence with Kimi K2 recently launched by Yuezhianmian, but the UAE version has an additional "Think".

But interestingly, there is indeed a "made in China" flavor behind K2 Think.

According to the Model tree on HuggingFace, K2 Think is built based on Qwen 2.5-32B:

Moreover, in addition to being the "world's fastest open-source AI model", MBZUAI officially claims that its K2 Think is the "most advanced open-source AI inference system ever".

So, what's its real strength? Let's continue to find out.

The measured speed exceeds 2000 tokens per second

Currently, K2 Think has provided an address for users to experience (see the end of the article).

Let's first test it with an IMO question:

Let a_n = 6^n + 8^n. Determine the remainder when dividing a_{83} by 49.

It's obvious to the naked eye that without any acceleration, the speed at which K2 Think outputs the answer after thinking is really "in a flash".

According to the speed shown at the bottom, it has reached 2730.4 tokens per second.

Next, let's test it with a classic question in Chinese:

How many letter 'R's are there in the word "Strawberry"?

The speed still remains at 2224.7 tokens per second, and it gives the correct answer: 3 'R's.

Let's test it with several AIME 2025 math questions:

Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.

Find the number of ordered pairs $(x,y)$, where both $x$ and $y$ are integers between $-100$ and $100$, inclusive, such that $12x^{2}-xy-6y^{2}=0$.

It can be seen that the most prominent feature of K2 Think is that it can maintain a speed of over 2000 tokens per second for all questions, and based on the current test results, all the generated answers are correct.

However, from a functional perspective, K2 Think currently does not support document transfer and multi-modal capabilities.

But Taylor W. Killian, a senior researcher at MBZUAI, also gave an explanation on X:

This model is mainly developed for mathematical reasoning.

The technical report has also been released

In terms of scale, K2 Think only has 32B, but the official says it can already match the performance of the flagship inference models of OpenAI and DeepSeek.

According to the test results, K2 Think has achieved ideal scores in multiple mathematical benchmark tests. For example, it scored 90.83 in AIME’24, 81.24 in AIME’25, 73.75 in HMMT25, and 60.73 in Omni - MATH - HARD.

Moreover, the K2 Think team has released a technical report:

Overall, the K2 Think team has achieved technological innovation in six aspects:

Supervised fine - tuning (SFT) for long - chain thinking: Through carefully designed chain - reasoning data, the model is trained to think step by step rather than directly give answers, making it more organized in dealing with complex problems.

Reinforcement learning with verifiable rewards (RLVR): Instead of relying on human preference scores, the model uses the correctness of answers as the reward signal, significantly improving its performance in mathematics, logic and other fields.

Intelligent planning before reasoning (Plan - Before - You - Think): First, a planning agent extracts the key points of the problem and formulates a problem - solving outline, and then hands it over to the model for detailed reasoning, just like humans making an outline before solving a problem.

Expansion during reasoning (Best - of - N sampling): Generate multiple answers for the same problem and then select the best result to improve the accuracy.

Speculative decoding: Generate and verify answers in parallel during reasoning to reduce redundant calculations and speed up the output.

Hardware acceleration (Cerebras WSE wafer - scale engine): Relying on the world's largest single - chip computing platform, it can achieve a generation speed of over 2000 tokens per second for a single request, ensuring a smooth interactive experience even for long - chain reasoning.

Meanwhile, the research team has also conducted systematic security tests on K2 Think, including rejecting harmful requests, ensuring the robustness of multi - round conversations, preventing information leakage and jailbreak attacks, etc., and has achieved a relatively high overall level.

Experience address: https://www.k2think.ai/

Technical report: https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

Reference links:

[1]https://www.k2think.ai/k2think

[2]https://x.com/mbzuai/status/1965386234559086943

[3]https://huggingface.co/LLM360/K2-Think

[4]https://venturebeat.com/ai/k2-think-arrives-from-uae-as-worlds-fastest-open-source-ai-model

[5]https://www.youtube.com/watch?v=8C6_B1QeyBo

This article is from the WeChat official account "QbitAI" (ID: QbitAI), author: Jin Lei. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Qwen hat erneut Verdienste erbracht. Das weltweit schnellste Open-Source-Modell ist entstanden, mit einer Geschwindigkeit von über 2000 Tokens pro Sekunde.

The measured speed exceeds 2000 tokens per second

The technical report has also been released