Scientific Guide for Shrimp Farming: Optimal Practices and Insights

The free "lobsters" are the most expensive to raise.

The provided text is already in English, so the content remains unchanged:

The "Lobster" craze is intensifying. Just on Monday when work resumed, Tencent's Qclaw (Lobster) launched its internal testing, and ByteDance's ArkClaw (Lobster) was officially launched on the Volcengine platform. Earlier, Alibaba Cloud had already introduced a similar product, CoPaw. Meanwhile, the Ministry of Industry and Information Technology quickly issued a relevant announcement, warning of the security risks in this field.

For ordinary users, this sudden technological craze is both a rare opportunity to access cutting - edge AI applications and a fog that is hard to see through, making it difficult to judge the underlying value and risks.

Fortunately, amidst the current industry hype, the open - source benchmark testing tool PinchBench, developed by Kilo.ai, was timely released, providing a highly valuable rational judgment anchor for all users concerned about this field:

Official website: https://pinchbench.com/

GitHub project address: https://github.com/pinchbench/skill

01 Benchmark Testing: How to Score AI Agents?

In fact, OpenClaw is a product that has been released for two months. When it was still called ClawdBot, it had already sparked crazy discussions in the technology community.

Now, it has triggered an interesting phenomenon: Early adopters have demystified it and are trying to call for rational thinking by elaborating on its capability boundaries; latecomers are still extremely enthusiastic, even if they haven't understood the product positioning and usage goals.

In the previous article, we've mentioned that OpenClaw itself can't do anything. What drives it to work is actually the underlying large - language model. In other words, the money people spend on OpenClaw is exactly the API fee for the large - language model.

Since AI entered the agent era, the usability of the large - language model (LLM), which serves as the "brain" and underlying infrastructure, increasingly depends on subjective word - of - mouth.

However, PinchBench attempts to break this situation. This benchmark test is specifically designed for OpenClaw to test the performance of the large - language model driving OpenClaw in real - world tasks.

Since the core task of an agent is to help people work, this benchmark test has different focuses from previous benchmarking tests: it covers 23 standardized tasks such as schedule arrangement, code writing, and even market research.

Its design logic is also very clear: all tasks are open - sourced in the pinchbench/skill repository on GitHub in the form of Markdown files with YAML metadata. Each task contains five core elements: prompt words, expected behavior, scoring criteria, automated checking functions, and LLM judgment rules.

Compared with the correctness of scientific questions and the quality of code, measuring the completion of real - world tasks is obviously more complex.

To objectively reflect the real capabilities of the large model driving OpenClaw as much as possible, PinchBench uses a three - level architecture scoring mechanism of "automation + LLM judgment + hybrid".

Among them, Python functions can automatically verify objective indicators that are easy to judge, such as file creation and keyword matching, while top - tier models like Claude Opus are used to score subjective dimensions such as content quality and analysis depth.

If these professional terms are difficult to understand, don't worry. We'll use two actual test tasks to illustrate.

The first question is task no.21 in the skills repository: OpenClaw Report Comprehension.

In this task, the large - language model needs to drive OpenClaw to read a research report named openclaw_report.pdf and answer 8 specific questions, such as:

"How many skills were there in the community before filtering? (The correct answer is 5705)"

"What is the second - largest category of skills? (The correct answer is Search & Research: 253)"

This task can be completely scored automatically by the program. The Python script will check the generated answer.txt file, not only verifying whether the numbers match precisely but also using regular expressions to validate the date format and the existence of keywords.

The scoring criteria are quite strict. Even if 7 questions are answered correctly, if there is a single - digit error in the last simple question, the score is zero. This design is to test the agent's most basic structured and unstructured information extraction ability and precise execution ability.

The second question is task no.16 in the skills repository: Competitive Market Research.

Compared with the previous task, this task is closer to the actual application scenarios of users, requiring the agent to generate an enterprise - level competitive analysis report on the application performance monitoring market.

To complete this task, the agent needs to go through complex steps such as identifying leading manufacturers, analyzing differentiated positioning, sorting out pricing models, and outputting a structured Markdown document. This is also a task with a considerable workload for humans.

Therefore, this task uses a hybrid scoring method. The automated part is responsible for checking judgment criteria such as "whether 5 competitors are listed" and "whether there is a comparison table", while the research quality and analysis insights are scored by top - tier models. The scoring criteria can be as detailed as "whether the style is similar to that of a human business analyst" and "whether the trends are consistent with real - world business dynamics".

02 Evaluation Results: Domestic Models Break Through Strongly

After understanding the evaluation mechanism, let's take a look at the evaluation results together.

PinchBench divides the evaluation results into three dimensions: success rate, speed, and cost.

In terms of success rate, the top - ranking contestant is from Google. Surprisingly, it's not the most intelligent flagship model Gemini 3.1 Pro, but Gemini 3.1 Flash Lite, which offers the best cost - performance ratio and is designed for a large number of agent tasks.

What's more noteworthy is that this time, domestic large models were not left far behind in terms of performance. MiniMax's MiniMax - M2.1 and Darkside's Kimi - K2.5, two domestic models that top the OpenClaw API call volume ranking, ranked second and third, with only a tiny gap from Google.

In terms of speed, MiniMax - M2.5 even took the top spot. Alibaba's Qwen3 - Max - Thinking and Zhipu's GLM - 5 also made it into the top ten, ranking sixth and seventh.

In terms of cost, which most users are most concerned about, as we predicted before, domestic AI models have an obvious advantage over international top - tier large models in terms of cost.

As we can see, the latest flagship models of Gemini, GPT, Claude, and Grok all missed the top ten. Although lightweight models and older versions have low costs, their success rates are not guaranteed, and the total cost may not be advantageous.

Additionally, it's worth noting that the cost differences between different models are huge. The token cost of Qwen3 - Coder - Next, which ranks tenth, is already more than 12 times that of GPT - 5 - Nano, which ranks first. And this is only the cost consumed under the optimal situation.

In practical applications, what users need most is for the model to "do the work well", and on this basis, the lower the cost, the better.

If we divide this comprehensive graph that combines task success rate and cost into four areas, the upper - left corner represents "cheap and useful", while the upper - right corner represents "expensive but useful".

The model names of MiniMax, Darkside, and Zhipu all happen to appear in the upper - left corner area.

This also reflects the reality at the technical level:

The arrival of the Agent era has effectively narrowed the capability gap between underlying large models.

Domestic large models not only have an advantage in token cost but also have reached the international top - level performance in agent tasks.

03 Free Trap: Hidden Costs and Security Risks

Returning to the recent industry trends, Tencent's public - welfare activity has completely eliminated the usage threshold of OpenClaw.

Even if users can't participate offline, compared with the "one - click deployment" function previously launched by major AI platforms, the method of scanning the code to log in + one click + copy - paste has almost no technical difficulty.

The Longgang District of Shenzhen is even preparing to introduce relevant policies to support OpenClaw.

This series of important news has really left people at a loss, and people in the technology community even find it a bit absurd.

After reading the above content about PinchBench, everyone should understand:

Installing OpenClaw under the guise of being free is actually not free.

Because there is a very easily overlooked technical detail behind this: Running an agent and directly calling a large - language model are completely different concepts in terms of resource consumption.

As we mentioned in the previous article, the resource consumption of directly calling a large - language model for a question - answer chat is relatively controllable.

However, using an agent to do work is completely different. Searching the web, reading reports, organizing files, and analyzing and summarizing, these tasks that humans take for granted mean hundreds or thousands of API calls and token consumption for AI.

What's even more terrifying is that this consumption is opaque. The vaguer the instructions, the more times the agent needs to call tools, backtrack the context, and retry errors.

The linear increase in the number of interactions leads to an exponential increase in token consumption.

This extremely hidden resource - consumption logic and the potential security risks of OpenClaw are fatal to ordinary users who are attracted by the "free installation".

This also explains why the attitude of the technology community is completely opposite to that of ordinary users recently.

The follow - up news of Tencent's public - welfare activity also reflects this problem to some extent: After installing OpenClaw for users for free and issuing a "baby lobster birth certificate", within a few hours, some netizens reported that their accounts were continuously deducted small amounts, with a cumulative deduction of more than 200 yuan.

Although Tencent immediately responded that the fees were due to historical actions and had nothing to do with the OpenClaw deployment, it has sounded the alarm for users: Free installation does not mean free use.

Recently, major domestic AI companies have successively launched products related to the Coding Plan as a cost - effective alternative to directly purchasing APIs. In essence, this is also a way to sell excess tokens and cloud servers.

04 Return to Rationality: What Will Be Left After the Craze Fades?

Regarding this "lobster - raising" craze, a user on the Linuxdo forum commented:

Although the words are a bit extreme, they hit the nail on the head.

There is nothing wrong with "experimenting". Technology enthusiasts exploring new tools and trying new solutions are the driving force behind technological progress.

However, when it comes to the product itself, OpenClaw still faces an awkward situation: The deployment threshold is almost zero, but the effective usage threshold is still very high.

Perhaps most people who installed OpenClaw this weekend enjoyed the sense of accomplishment at the moment of successful deployment and had something to talk about like "I'm also raising lobsters" after meals, but they couldn't feel the actual value that the tool can bring.

In the technology community, I saw a view worth sharing:

People who use OpenClaw should meet the following three conditions:

① Be very clear about what OpenClaw can do;

② Be very clear about how OpenClaw realizes its value;

③ Use it with a purpose and achieve good results;

However, the reality is often the opposite: Many people install OpenClaw out of follow - the - crowd mentality or curiosity, only to find that their wild expectations do not match the actual capability boundaries at all. After the fantasy of "getting a day's work done in one sentence" is shattered, they don't know what else OpenClaw can do. Naturally, they can't achieve the expected results in the end. Some leave it there without using it again, while others directly uninstall and delete it.

This is a typical case of "product capabilities are ahead of user needs".

This current craze is essentially just another round of following the trend. First, there was one - click deployment, and then there was free installation. More and more people are following the trend to "raise lobsters", and there are more and more "pets" in the "fish tank".

It's undeniable that after the birth of a revolutionary new product, there will always be people who can make their creativity bring more value than the token cost.

But for most ordinary users, the technology itself is innocent. However, the over - simplification of information and the lack of expectation management in the dissemination process, as well as the blind enthusiasm brought by the word "free", make the explorers bear unnecessary trial - and - error costs.

The craze will eventually fade away, and what remains will be the tools and users that truly solve problems.

The emergence of benchmark tests like PinchBench means that agents have moved from laboratory demos to reality

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

A Scientific Guide to Shrimp Farming

01 Benchmark Testing: How to Score AI Agents?

02 Evaluation Results: Domestic Models Break Through Strongly

03 Free Trap: Hidden Costs and Security Risks

04 Return to Rationality: What Will Be Left After the Craze Fades?