The overlooked huge market: Why is it difficult for big companies to succeed in Local Agent?
Since the second half of this year, there has been a half - joking joke circulating in the AI circle: "Why hasn't DeepSeek R2 been released yet? Because the Scaling Law doesn't work anymore."
Behind the laughter lies a harsh reality that the entire industry is facing: The marginal returns of large models are diminishing, and the rules of the first half of the AI competition are becoming invalid.
Firstly, training models requires a huge amount of money. The cost of training a GPT - 4 - level model has exceeded $100 million. The latest "AI Industry Trend Research Report" released by the well - known technology investment institution BOND at the end of May 2025 shows that the cost of training the most cutting - edge AI models has now approached the scale of $1 billion. This cost scale far exceeds that of any previous technology development project, marking the era when only capital - rich giant enterprises can dominate AI model training.
Secondly, the growth of model capabilities has hit a bottleneck. From GPT - 3.5 to GPT - 4 and then to GPT - 5, the leap in the intelligence level of large models is amazing. However, from GPT - 4 to GPT - 4.5 and then to GPT - 5, even if the number of parameters doubles, the improvement in capabilities becomes less and less obvious. The Scaling Law has started to "hit the wall" - simply piling up parameters is no longer a shortcut to AGI.
However, while the giants are in trouble, a story of a "small model" making a comeback is unfolding:
In May this year, DeepSeek R1 - 0528 distilled the original large model with 671B parameters into a model with only 8B parameters. It not only maintained its original capabilities but even outperformed the original model by 10% in the AIME 2024 test.
DeepSeek is not an isolated case. The newly launched Qwen3 - VL 4B/8B (Instruct / Thinking) model by Qwen can run stably on low - memory devices while maintaining a 256K - 1M ultra - long context and complete multimodal capabilities. It also provides FP8 quantization weights, truly making multimodal AI "implementable".
A paper published by NVIDIA in June 2025 also states that "small language models" (SLM) with less than 10 billion parameters can not only match but even surpass large LLMs in most Agent tasks, and their operating costs are only 1/10 to 1/30 of the latter.
Image source: "Small Language Models are the Future of Agentic AI".
These cases have shocked the entire AI community - The "small models standing on the shoulders of giants" can actually surpass the giants themselves.
While OpenAI, Anthropic and others are still arguing about how many trillion parameters the next - generation model should have, the industry has quietly shifted from the "parameter competition" to the "efficiency revolution", and AI has begun to move from the cloud to the edge, and into everyone's daily devices and items.
From Cloud First to Local First, AI Enters the Second Half
In 1965, Gordon Moore made a famous prediction: The number of transistors that can be accommodated on an integrated circuit doubles approximately every 18 - 24 months. This prediction became the "golden rule" of the semiconductor industry in the following half - century, driving the exponential growth of computing performance and promoting the arrival of the mobile Internet and cloud - computing revolution.
However, after 2015, this golden rule began to fail. Transistors have become so small that they are close to the atomic scale. Further miniaturization will encounter physical limits such as quantum effects, leakage, and heat dissipation. The manufacturing cost has also skyrocketed, and a new wafer factory can cost tens of billions of dollars. In other words, the "free lunch of computing power" is over.
After the slowdown of Moore's Law, technology giants have to find new ways.
Apple's approach is "vertical integration": Instead of relying on Intel, it develops its own chips and rewrites the way hardware and software collaborate from the bottom up. The M1 launched in 2020 was the first SoC (system - on - a - chip) customized for Mac - the CPU, GPU, and AI neural engine share a common memory pool, reducing data transfer and tripling the energy efficiency ratio. By the M4 and M5 era, Apple has taken packaging technology to the extreme: Using Chiplet (small chips) + 3D stacking, it combines different functional modules like building blocks. This has found a new balance among performance, cost, and power consumption. As written in the article "A19 and M4: Dual - line Strategy", Apple continues to pursue the ultimate process (N3P, N2) on the iPhone on one hand, and explores packaging innovation (CoWoS, 3DIC) on the Mac on the other hand. These two directions together form the dual engines of the "post - Moore era".
NVIDIA has taken another path. Huang Renxun keenly realized that single - core performance is no longer important, and the future belongs to the era of "ten - thousand - core parallelism". So, since 2006, he has promoted the general - purpose computing of GPUs and bound tens of millions of developers to his camp with the CUDA software ecosystem. In 2017, the Tensor Core of the Volta architecture appeared for the first time, accelerating the matrix multiplication commonly used in AI training by a hundred times. Since then, Ampere, Hopper, and Blackwell have become more and more powerful. Now, H100 and B200 have become the standard for AI large - model training. Huang Renxun even proposed that "Moore's Law is dead, and Huang's Law takes over" - the performance of GPUs doubles every year, not by using smaller transistors, but by more intelligent parallel architectures, sparse computing, and super - node interconnection.
Just as the chip industry experienced in the past, when Moore's Law slows down, the industry also begins to shift from the "process competition" to "architectural innovation" - Apple's M - series chips and NVIDIA's Tensor Core are all products of finding new ways under physical limits.
The AI industry is also experiencing the same paradigm shift as the chip industry.
In the past three years, generative AI has experienced explosive growth. From ChatGPT to Claude, from GPT - 4 to DeepSeek, cloud - based large models have redefined the boundaries of human - machine interaction with infinite computing power and continuous iteration capabilities. However, under the prosperity, three major pain points have become increasingly prominent:
Firstly, the productivity experience is not closed - loop. Except for a few scenarios such as Coding IDE where models directly produce productivity tokens, AI models still remain at the single - point efficiency - improvement stage of dialogue - consultation in most office and traditional serious R & D scenarios. Due to privacy concerns, core data and workflows in productivity scenarios cannot be "uploaded to the cloud across the entire chain with one click". A BBC report in August this year showed that hundreds of thousands of user - Grok (Elon Musk's product) dialogue records were exposed in search engine results without users' knowledge. In addition, lawyers handling sensitive case files, investment managers analyzing insider materials, and enterprises managing business secrets - in all these scenarios, uploading data to the cloud means losing control and failing to achieve both efficiency improvement and security.
Secondly, Token costs have become a bottleneck for applications. According to Anthropic's data, the Token consumption of Multi - Agent systems is 15 times that of ordinary chats. According to foreign media reports, using Agent products such as Manus and Devin can consume millions of Tokens per task, with costs starting from $2 and reaching up to $50 for complex tasks. This cost structure makes it difficult to scale high - frequency and in - depth AI applications.
Thirdly, network dependence limits the use scenarios. Cloud - based AI fails in these daily scenarios such as on airplanes, in subways, and in meeting rooms with limited network. When AI is claimed to be the "water, electricity, and coal of the new era" but cannot be used at any time like local mobile applications, this contradiction is becoming unbearable.
However, beyond these three pain points, three new forces are also converging:
The capabilities of small models are undergoing qualitative changes: The inference model of DeepSeek R1 - 0528 distilled the thought chain of the 671B - parameter model into Qwen3 8B and outperformed the original model by 10% in the AIME 2024 test. Its performance matches that of Qwen3 - 235B - thinking, which requires 30 times the number of parameters. This means that the intelligence level is no longer simply positively correlated with the model scale. Technologies such as knowledge distillation and inference enhancement are enabling small models to "stand on the shoulders of giants".
Edge - side chips are opening up the market: NVIDIA has launched DGX Spark, which has shrunk the AI training and inference capabilities that originally only existed in data centers to a scale that can be deployed on desktops. This means that high - performance inference and small - model training are starting to become "accessible at the edge". The AI computing efficiency per unit power consumption of Apple's M5 chip has increased several times compared to M4, enabling laptops and tablets to complete complex generation tasks offline. This marks a significant increase in the intelligence ceiling of consumer - grade devices, while the cost curve is rapidly declining. Huawei is also betting on edge - side large models in the Hongmeng ecosystem. The collective actions of hardware manufacturers are laying the infrastructure for local AI.
User demand is awakening: AI efficiency improvement, data sovereignty, and model sovereignty are no longer the obsessions of geeks but the rigid needs of professional users. Just as households are shifting from centralized power supply to distributed photovoltaics, the "electrification" of AI capabilities is also moving towards distributed deployment.
Under these pain points and trends, a clear consensus is emerging: The future of AI is not the cloud replacing the local, but the deep collaboration between the cloud and the local, and local intelligence will handle 50 - 80% of daily tasks.
From "Small Models" to "Local Agent", Why is the Local Experience of AI Products Always Disappointing?
However, the ideal is full, but the reality is skinny. At present, when AI Agents are booming, the local experience of most existing products is still "disappointing".
Taking local AI products represented by Ollama and LM Studio as examples, the core problem is not the lack of model performance but the fundamental difference between the development model and user needs.
Firstly, there is a positioning deviation. These products are essentially "local - version ChatGPT" experience tools designed for developers to quickly test open - source models on Hugging Face. This causes three major experience problems for ordinary users:
Far from non - technical users: Ordinary users such as lawyers and investment managers neither understand Hugging Face nor the GGUF model format, making it difficult for them to use directly.
Lack of vertical integration: The products only provide basic chat or API interfaces and cannot meet the needs of complex productivity scenarios such as in - depth document research.
Amplify model defects: The broad positioning of "able to talk about anything" makes users habitually compare them with top - level models such as GPT - 4. Users do not need an offline chatbot.
Secondly, there are technical stack problems. Most local products are optimizing on the wrong technical path. Although Ollama and LM Studio try to build peripheral toolkits such as CLI around developers' needs, the design of the container management platform around the GGUF open - source model has become their historical burden, and the foundation of the entire Local Infra is not solid:
Inference technology bottleneck: The product ecosystem heavily relies on post - training quantization (PTQ) solutions such as GGUF. Its fatal flaw is that low - bit quantization (such as 3 - bit and below) will lead to a serious decline in model accuracy, and there is an upper limit to the "intelligence density". It falls into a seesaw situation between model capabilities and users' hardware resources and is difficult to be competent for complex multi - step reasoning tasks such as Agent.
Lack of solution integration: The experience of the "massive" open - source GGUF models seems abundant but is actually like first - generation pre - made dishes. They only provide "seasoning packets" (local models + APIs) that only need to be heated, rather than an integrated solution of "local model + Agent infrastructure + product interaction" carefully crafted for users. Ordinary users need a "finished car" that can be used directly, not a pile of "car parts" that need to be assembled by themselves.
Limitations of the application ecosystem: Developers cannot build a data flywheel around third - party UGC quantization model ecosystems such as GGUF. The black - box nature of current pre - trained models has already brought obstacles to business evaluation and iteration. The irreversible quantization of third - party GGUF models introduces new quantization noise to large models, making the maintenance and iteration of AI capabilities a bottleneck.
In conclusion, the real value of local Agents does not lie in "talking generally" but in using the advantage of being closer and one step ahead of the cloud to be deeply integrated into specific vertical scenario tools, realizing Tool - Integrated Reasoning, bringing a better intelligent service experience, and becoming an efficient productivity tool that is "willing to do dirty work and not afraid of tiredness". The direction of current mainstream Local AI products is more like using a screwdriver to hammer a nail, which is the wrong approach.
From "1 - bit Models" to "Local Agent Infra": GreenBitAI's Ten - year "Local" Long March
When the cloud - based AI competition is approaching a bottleneck due to cost and physical limits, a German technology team that has been adhering to and delving into low - bit models for nearly a decade is leveraging a professional - grade Local Agent product to pry open the trillion - dollar incremental market of Local Agent Infra.
The story of GreenBitAI is a strategic evolution history from "making models" to "making infrastructure".
The story begins in 2016. At that time, deep learning had just emerged, and the industry mainstream was still pursuing deeper and larger networks. Meanwhile, a "reverse path" of compressing models to the extreme - 1 - bit neural networks (BNN) - also began to appear. Pioneering works represented by XNOR - Net, due to their huge imagination space for efficiency improvement (even claiming that CPUs could replace GPUs in the future), triggered a short - lived academic "gold rush" between 2016 and 2018. Yang Haojin, who was in the HPI Laboratory in Germany at that time, was one of the few core pioneers in the world who first engaged in this field.
However, this boom came and went quickly. When researchers found that the accuracy of BNN always hovered at an "unusable" level and was difficult to break through the bottleneck, the once "gold mine" quickly turned into a "barren wasteland". For the academic circle that pursues "quick and easy" and new concepts, it was time to move on to the next hot spot. So, the gold - diggers quickly retreated, and the BNN field quickly cooled down and became a little - explored "dead - end road".
But Yang Haojin and his team chose to persevere. It was during this period that seemed the coldest and most hopeless to the outside world that they achieved a decisive breakthrough. This process was long and full of thorns, and every step of breaking through the ceiling was full of pain.
This perseverance finally paid off with breakthrough results. The GreenBitAI team used solid milestones to prove the correctness of this path:
From 2018 to 2020: The team developed the first 1 - bit CNN models with an accuracy of over 60% and 70% on ImageNet, reaching the accuracy level of Google's mobile - side SOTA model MobileNet at the same time and breaking through the baseline for deploying BNN on mobile phones.
At the end of 2022: The BNext - L model launched by the team achieved a Top - 1 accuracy of 80.4% on ImageNet, 3 percentage points higher than Google's model at the same time. This is not just a numerical breakthrough. It means that the extremely compressed 1 - bit model has reached the ResNet baseline for mainstream deployment on the edge and in the cloud for the first time in terms of accuracy, proving its