HomeArticle

Real shrimp farming: Let lobsters evolve while chatting in 3 steps. Reinforcement learning can be achieved without GPUs or datasets.

量子位2026-03-12 15:53
It can also automatically generate skills.

It's not enough to let OpenClaw do the work. Now, programmers are trying every means to make the lobster become stronger on its own.

Attention! This is not just a single - point improvement in a certain task. This time, someone directly wraps the entire intelligent agent with an online reinforcement learning system MetaClaw ——

There's no need to maintain a GPU cluster by yourself, no need for a dataset, and no need for manual fine - tuning. Let the AI become smarter as you chat with it.

This new learning mode is to directly turn the daily conversations between users and the AI into training data. The entire learning cycle is completed in the background without affecting normal use.

In our daily chats with the AI, MetaClaw will silently intercept the interaction process of OpenClaw, score each round of conversation, and then optimize the AI's decision - making strategy through online fine - tuning.

Moreover, it learns from experience. If the AI makes a mistake in a certain sentence, MetaClaw will automatically dig out the complete interaction trajectory, analyze where the problem lies, and then automatically generate a new skill and store it in the skill library.

The next time it encounters a similar pitfall, the relevant skills will be accurately searched and injected into the system prompt, and the same kind of mistakes will be avoided.

Skill injection + Skill evolution

The model base is built on Kimi - 2.5, and a lightweight alternative, Qwen3 - 4B, is also prepared, so it can run on low - configuration devices.

The core mechanism is the self - developed SkillRL skill - enhanced reinforcement learning framework. In simple terms, it is a combination of skill injection + skill evolution.

  • Skill injection

Accurately match relevant skill instructions in each round of conversation. The AI can optimize its performance on the spot without waiting for the training to end.

  • Skill evolution

Let the AI change from passively receiving instructions to actively generating skills. The skill library becomes more and more rich, and its ability improves.

What's most attractive is the setting of not relying on a local GPU cluster and not needing to maintain it by yourself.

MetaClaw outsources all training tasks to the Tinker cloud platform, completely separating training and deployment.

As long as your device can connect to the Internet, you can run the entire system. You don't have to worry about computing power, nor do you need a dedicated engineering team for maintenance.

This directly lowers the threshold of continuous AI learning to the lowest level. Ordinary people can also raise an evolving lobster.

In addition, the detailed design of MetaClaw also understands the pain points of developers very well.

Asynchronous architecture + Dual learning modes completely decouple service, reward modeling, and training. The AI provides real - time responses to users while the background scores and optimizes, without affecting either "work" or "learning".

There are also plenty of choices for learning modes. If you want a lightweight approach, use reinforcement learning to optimize from users' implicit feedback; if you want in - depth improvement, use online policy distillation to advance with high - quality text feedback.

The main idea is that you can train the AI in any way you like.

Get started in three steps

It's very easy to use, just three steps.

Step 1: Install dependencies. The former ones are related to regular services and large - model libraries, which are useful for running APIs, sending requests, and connecting to large models.

The latter ones, tinker and tinker - cookbook, are the key. They are the SDKs for cloud - based LoRA training.

- pip install fastapi uvicorn httpx openai transformers - pip install tinker tinker - cookbook

Step 2: Run the configuration script. Point the OpenClaw gateway to the MetaClaw proxy. Kimi2.5 is highly recommended.

- bash openclaw_model_kimi.sh

Step 3: Set the Tinker API key and directly run the training script.

- export TINKER_API_KEY=”xxx” - cd /path/to/metaclaw - python examples/run_conversation_rl.py

Done. After that, you just need to chat with the Agent as usual. MetaClaw will automatically collect conversation rounds, score them, and train the model.

It will hot - replace the weights every time a batch of samples is collected, without any manual intervention throughout the process.

If you want to enable skill injection, just set it in the configuration:

- config = MetaClawConfig(use_skills=True)

If you want to start skill evolution, you can set it (taking GPT5.2 as an example):

- config = MetaClawConfig(use_skills=True, enable_skill_evolution=True, azure_openai_deployment=”gpt - 5.2”)

Then configure the keys:

- export AZURE_OPENAI_API_KEY=”xxx” - export AZURE_OPENAI_ENDPOINT=”https://your - endpoint.openai.azure.com/“

All configuration items are concentrated in MetaClawConfig, including model selection, LoRA parameters, batch size, training steps, loss function type, etc., which are clear at a glance.

Well, now it's really like "raising lobsters" (doge).

This work on MetaClaw is led by Huaxiu Yao. He is an alumnus of the University of Electronic Science and Technology of China, an assistant professor in the Department of Computer Science at UNC. He was a postdoctoral fellow at the Stanford AI Lab, focusing on Agents and Embodied AI.

Project address: https://github.com/aiming - lab/MetaClaw

Reference links: [1]https://x.com/BoWang87/status/2031094971630235941 [2]https://x.com/HuaxiuYaoML/status/2031069599651729905

This article is from the WeChat official account “QbitAI”. The author focuses on cutting - edge technology. It is published by 36Kr with authorization.