The open-source 3B inference model that can run on mobile phones is faster than Qwen 3-4B and maintains high speed even with an ultra-long context.
An Israeli startup has open-sourced a 3B model, outperforming Google's Gemma 3 - 4B in performance.
According to a report by Zhidx on October 9th, yesterday, Israeli AI startup AI21 Labs open-sourced the lightweight inference model Jamba Reasoning 3B. This model has 3 billion parameters and can run on various devices, including mobile phones and computers. Its performance surpasses that of industry-leading models such as Qwen3 - 4B and Gemma 3 - 4B.
Screenshot of the open-source release of Jamba Reasoning 3B
Hugging Face link: http://huggingface.co/ai21labs/AI21-Jamba-Reasoning-3B
A21 claims that Jamba Reasoning 3B is built on its new SSM - Transformer architecture. It has a context window length of 256K tokens and can handle up to 1M tokens. Compared with competitors such as DeepSeek, Google, Llama, and Microsoft, it is 2 - 5 times more efficient and leads in benchmark tests.
Jamba Reasoning 3B outperforms models like Qwen 3 - 4B in evaluations such as the Humanity's Last Exam
It summarizes the advantages of Jamba Reasoning 3B into three points:
1. No decline in intelligent performance: Due to the adoption of the hybrid SSM - Transformer architecture, Jamba Reasoning 3B is more efficient than pure Transformer models.
Most Transformer - based models experience a significant decline in performance when the context length exceeds 32K tokens, while Jamba Reasoning 3B can handle longer context lengths, including up to 1 million tokens. This makes it very useful in advanced intelligent agent systems or multimodal applications, as long - context understanding is crucial for output quality.
The performance of Jamba Reasoning 3B deteriorates little as the context grows
2. Leading intelligence: Jamba Reasoning 3B outperforms other device - side models from DeepSeek, Google, Meta, and Microsoft.
It is particularly outstanding in instruction - following tasks (IFBench) and common - sense knowledge (MMLU - Pro and Humanity's Last Exam), which makes Jamba Reasoning 3B an efficient and intelligent model that can be used in advanced intelligent agent workflows or device - side RAG applications.
These results are due to the post - training process. A21 combines alignment training techniques such as RLVR, SFT, DPO, and GRPO with its self - developed proprietary methods to ensure model quality.
Jamba Reasoning 3B outperforms models from Alibaba, Google, etc. in evaluations
3. Built for secure device - side use: This model is licensed under the Apache 2.0 license. It can be directly downloaded to users' computers or mobile phones and customized on the device using users' own files to achieve fully secure applications. Even when the network is disconnected, they can continue to run.
The hybrid SSM - Transformer architecture of Jamba Reasoning 3B successfully utilizes a key - value cache (KV) that is 8 times smaller than the original Transformer architecture, maintaining low memory usage even as the context grows.
On an M3 MacBook Pro, it can generate 40 tokens per second with a context length of 32K. This result is better than that of models such as Qwen3 - 4B, DeepSeek Distill Qwen 1.5B, and Gemma 3 - 4B, making it a streamlined component in advanced intelligent agent applications.
It can generate 40 tokens per second on an M3 MacBook Pro
The languages currently supported by this model include: English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic, and Hebrew.
Conclusion: Lightweight models accelerate iteration, opening up new paths for intelligent agent implementation
As enterprises integrate AI into their operations, cloud - based large language models have exposed the problem of low economic efficiency. A21 cited a research report saying that 40% - 70% of AI tasks can be handled by small language models, and the cost can be reduced by 10 - 30 times through intelligent routing.
Device - side lightweight models like Jamba Reasoning 3B can achieve cost - effective heterogeneous computing allocation, processing simple tasks locally while reserving cloud resources for complex reasoning. This provides low latency for real - time applications in manufacturing and healthcare, offers offline recovery capabilities for remote operations, and enhances data privacy protection, promising to usher in a decentralized AI era.
This article is from the WeChat official account "Zhidx" (ID: zhidxcom). Author: Li Shuiqing, Editor: Xinyuan. It is published by 36Kr with authorization.