HomeArticle

What on earth enables Herems to outshine OpenClaw?

36氪的朋友们2026-04-15 15:52
With its self-evolving skills and active memory system, Hermes Agent surpasses OpenClaw and leads the new direction of open-source agents.

In April 2026, OpenClaw (commonly known as "Lobster"), which had only been popular for two months, faced a challenger. Hermes Agent topped the GitHub Trending list for several consecutive weeks and amassed 22,000 stars.

How popular was it? Even Anthropic copied it. On April 10th, Teknium, the founder of Nous Research, came out to complain that Anthropic was "copying" Hermes' function of automatically judging task completion and actively reminding users. The community narrative was also very unified, believing that Hermes, with its self - evolving Agent, automatic memory management, and user modeling system, comprehensively surpassed the former king OpenClaw in technology and redefined the direction of open - source Agents.

However, if you put aside these grand narratives and really compare the two in detail, you'll find that there are far more similarities than differences in their functions.

For example, both support scheduled tasks. Hermes supports human - readable formats and standard cron expressions, and each task runs in an isolated session. OpenClaw also supports three scheduling types: at, every, and cron. Tasks are directly persisted in a local JSON file, so they won't be lost even after a restart.

Another example is the sub - Agent delegation, which both have. Hermes' delegate_task supports single tasks and up to 3 parallel sub - tasks. The sub - Agent environment is completely isolated, and it only returns a summary after finishing the work. OpenClaw's sub - agent mechanism also supports this kind of background isolated execution and result return, and it can even configure the nesting depth.

Both have browser automation, TTS (text - to - speech), Vision visual capabilities, image generation, and voice interaction. In terms of Gateways, both have message integration with more than 20 platforms such as Telegram, Discord, Slack, WhatsApp, and Signal without a doubt.

If you check the list item by item, you'll find that their functions almost completely overlap. There is no so - called "absolute crushing" in the function list.

So, the question is, since the functions are the same, why is Hermes so popular? How much of the so - called "self - evolution", "automatic memory", and "user modeling" hyped in the community are actually fundamental structural differences?

01 Self - evolving Skills

If you look through the default configurations of both, the only significant difference you can find is that Hermes has achieved a closed - loop of automatic evolution in Skills.

A Skill is a knowledge unit of an Agent's workflow. In simple terms, it's a Markdown file that tells the Agent what steps to take for a certain type of task, which tools to use in the middle, and how to salvage the situation if something goes wrong.

Hermes splits the life cycle of skills into two parts. One is the silent generation during runtime, and the other is the offline hardcore evolution.

Let's start with the generation. When you ask the Agent to do a task, if it calls a tool more than 5 times, or recovers from an error on its own, or if you directly correct its output as a user, a set of hard - coded rules in the main repository will be triggered. The Agent will silently package the successful workflow and save it as a local SKILL file. This step is completely silent, and often you don't even know it has written a new skill for itself.

The next time it encounters a similar task, it will automatically search the index. This loading process is a four - layer progressive one, like searching for materials in a library. It first checks the catalog cards (Tier 0), only stuffing the names and descriptions into the system prompt, which takes up about 3,000 tokens. If the direction is right, it will then go to the bookshelves layer by layer to get the full content.

But what really sets Hermes apart is the second - step evolution.

Hermes has a built - in offline batch evolution algorithm and a separate repository (hermes - agent - self - evolution). The engine uses the DSPy framework, along with a core algorithm called GEPA.

The full name of GEPA is Genetic - Pareto Prompt Evolution. This system was not created by Hermes but comes from an ICLR 2026 Oral paper by Lakshya Agrawal et al., titled "Reflective Prompt Evolution Can Outperform Reinforcement Learning".

Most of the academic circle's work on skill evolution currently follows the RL (Reinforcement Learning) route. Frameworks like SkillRL or SAGE even have "RL" in their names, hoping to strengthen the skill library through gradient updates. But GEPA takes a completely opposite path and deliberately abandons reinforcement learning. The GEPA paper itself is proving that even without gradient updates, relying on the reflection ability of large models and evolutionary algorithms can not only outperform RL but also have higher sample utilization efficiency.

How does it work? This algorithm has three core foundations.

First is Reflective mutation. It's not a random guess - based mutation. The large model will read the previous execution traces, reflect on why it did something right or wrong, and figure out which words in the prompt need to be changed.

Second is Pareto frontier selection. After generating a batch of mutated candidate skills, it doesn't just keep the ones with the highest overall average. As long as a candidate performs best on even one evaluation sample, it will be retained. This is to ensure the diversity and robustness of skill exploration.

Finally, natural language feedback is used as the mutation signal. Traditional RL uses numerical rewards to guide parameter updates, but the numerical signals have a too - coarse granularity. If you get 0.6 points after one run, you don't know what's right or wrong. GEPA uses specific natural language feedback for each mutation, such as "This step didn't check the boundary conditions" or "It should read the configuration first and then write the cache". LLMs can understand this feedback and generate the next - round variants based on it, which is much more effective than interpreting a floating - point number.

The workflow is as follows: The system regularly reads the existing SKILL files, samples from historical sessions (or synthesizes them by itself) to create an evaluation set. Then GEPA intervenes, examines the execution traces, provides reflective feedback, generates candidate variants, runs an evaluation round, and finally selects the winners using the Pareto algorithm.

After this offline evolution closed - loop is completed and the optimized Skill is obtained, it won't directly overwrite the original file. Instead, it will generate a PR (Pull Request) honestly. Only when you, as a human reviewer, approve and merge it will the evolved skill take effect. The system will never make a direct submission.

This directly shatters the myth in the community that "users don't need to intervene at all". Hermes' attitude is very clear: skill generation can be fully automatic and silent, but skill evolution must be reviewed by humans.

Let's look at OpenClaw. It also has a Skill system, but the problem is that you have to take the initiative at every step. You need to manually create files, manually install, and then manually authorize. Only when these three conditions are met will the skill take effect. If you create a new Skill, you also need to restart the Gateway process that it manages uniformly for the system to recognize it.

Moreover, its loading is extremely simple and crude. It doesn't do any task matching. As long as it's configured, it will be fully stuffed into the context, unless you manually add a disable tag to remove it.

Both have Skills. The real difference lies in who presses the start button. Hermes says "Let me do it", while OpenClaw says "Do it yourself".

02 Who Remembers for Whom

If the Skill explains why Hermes "gets faster with use", then the other half of the narrative in the community that "it understands who I am" can be attributed to the memory system.

Currently, the three major mainstream open - source Agents (Claude Code, OpenClaw, and Hermes) all have automatic memory. But if you dig a little deeper, you'll find that the objects they serve, the triggering mechanisms, and the memory expiration periods are completely different.

Let's start with Claude Code. Its auto - memory is enabled by default. When it's working, it will automatically record build commands, debugging experiences, architecture notes, and even code styles. It also runs Auto Dream every 24 hours to organize and clear out expired or contradictory information. It sounds very intelligent, but this system has extremely strict project isolation.

Its boundary is fixed at the git root (project root directory). The hard - learned lessons in Project A will never be carried over to Project B. It doesn't remember your personal preferences and doesn't care who is sitting in front of the screen. It only focuses on "how to run this project".

Now, let's talk about OpenClaw. Its memory system is more long - term. Every time a conversation starts, it will forcefully load 8 underlying files, including MEMORY.md and USER.md, into its "mind". These two files are shared across projects and are automatically written.

How does it write? Its writing mechanism is extremely passive, more like a fallback. Before the context (tokens) of each conversation is about to reach its limit and the system is about to perform a major compaction, the Agent will quietly run a silent turn. In this turn, it will casually record the key points of the current conversation in the daily diary file and write your preferences into the long - term MEMORY.md or USER.md.

So, when you haven't used OpenClaw for a long time and then open it and find that "it still remembers who you are", it's thanks to this long - term passive network. Those preferences have already been stuffed into several files that are read at startup. This can really give people the feeling that "this AI can be nurtured". But in essence, it's more like a survival instinct. When it realizes that its "brain" is full, it quickly saves the data. As for those old diaries, without the support of an external semantic vector database, it can only search by keywords.

In this regard, Hermes has a different logic. Before version v0.7, Honcho was the only hard - coded long - term memory backend in Hermes, with no other options.

The previously default Honcho was designed very cleverly. Most Agent memory systems (including Hermes' default built - in memory) are essentially passive recorders. They chop up what you've talked about, convert it into vectors, and store it in the database. The next time a similar topic comes up, they retrieve it by calculating the distance (Embedding cosine similarity).

Honcho doesn't follow this path. It's an "AI - native" memory backend, focusing on asynchronous dialectic reasoning and in - depth entity modeling.

After you finish chatting with the Agent and the main session ends, Honcho's work just begins. It will initiate additional model calls in the background to analyze the chat history, extract the concepts (Entities) from your words, extract underlying preferences, and even dialectically align your contradictory statements. It calculates your casual chatter into structured "Insights".

It sounds very advanced, but it also consumes a lot of tokens and is likely to wash out key details. It's safer to set it as a plugin.

But even without Honcho, Hermes' memory writing is much more active than OpenClaw's. Hermes has a nudge mechanism. Instead of waiting for the "brain" to be full, it is triggered about every 15 rounds of conversation. This is a reflection instruction forced on the Agent by the system, prompting it to quickly review the conversation and see if there are any user habits worth recording. This high - frequency active reflection allows Hermes to write a huge amount of information into persistent files in the same amount of time.

Not only is the writing more active, but Hermes' way of retrieving memory is also more powerful. It has the full - text retrieval ability of SQLite FTS5 built into its default architecture. There's no need to laboriously configure a word vector service. When the Agent wants to look up old records, it can directly search through a large amount of past chat history.

When comparing these three, the evolution line becomes clear. OpenClaw has a passive - triggered long - term memory system. Claude Code can actively record and organize, but its focus is on tasks rather than users. Hermes makes the triggering timing extremely active, allows for easy switching of memory plugins, shares globally, and is equipped with a retrieval tool that can search through all history by default.

This is how the difference in daily usage experience is created. OpenClaw only remembers you when it's about to crash. Hermes, on the other hand, secretly speculates about your thoughts every once in a while and can retrieve your past conversations at any time.

03 Hiding Complexity

Whether it's the self - generation of Skills or the high - frequency active writing of memory, they actually refer to the same thing: Hermes makes decisions for you that you should have made.

However, the complexity of the system is conserved.

Just because you don't need to take action doesn't mean the decision disappears. It just shifts from your manual operation to the hard - coded rules at the bottom.

In the process of building this harness (Agent shell), the designers of Hermes realized that model judgment is unreliable, so they made hard - coded rules.

This harness is much more rigid than those of Anthropic and others. When the Agent is working, it's not a pure large - model running and thinking naked. The large model is tightly wrapped in a code framework filled with conditional judgments.

Has the tool been called 5 times? Has the number of conversation rounds reached 15? Did it just recover from a failure and retry? Did the user clearly point out an error? The system doesn't intend to let the large model make vague judgments on these issues. Instead, it uses deterministic code to monitor them one by one. As soon as the conditions are met, it immediately executes the pre - written actions, such as generating an initial skill, forcing a reflection instruction, or recording a sentence in a long - term file.

These defense nets everywhere are the transferred complexity. What should have been self - regulated by the user during use is now all written into Hermes' code.

Hermes writes these rules based on design judgment. Triggering skill generation after 5 tool calls is a balance. Setting it to 3 times would lead to too many false triggers, and setting it to 8 times might miss valuable workflows. Reflecting every 15 rounds instead of every round is to avoid generating a large amount of garbage memory and reducing costs.

You sit in front of the screen and feel so comfortable without having to manage anything. Behind the scenes, Hermes' development team has pre - written all the judgment logic for you.

Automation doesn't eliminate decision - making. It just hides it in an invisible place.

To ensure that this set of hard - coded rules doesn't fail without human supervision, Hermes has made a series of defensive designs at the bottom.

First, look at context management. When the conversation reaches 85% of the threshold, Hermes doesn't ask the large model to make an intelligent summary. Its ContextCompressor is a pure string - replacement logic, simply replacing the old tool output with a placeholder. It's crude but absolutely safe. In terms of memory, it uses frozen snapshots. It pours the memory into the system prompt at startup and doesn't refresh it in the middle. It only takes effect after the next restart. This sacrifices real - time performance but guarantees a stable hit rate of the prefix cache, directly reducing the token input cost by about 75%. The spirit of these two choices is the same: don't let the LLM make dynamic judgments about context and memory within the session. Use the simplest rules to ensure determinacy.

Then, look at its security review. The built - in Smart approval mode also doesn't let the large model judge whether a command is dangerous. Instead, it directly uses a hard - coded blacklist to perform regular matching on terminal operations. If a match is found, human confirmation is required.

Even in its plugin system for ecological expansion, it treats developers as potential enemies. There are 6 types of hooks in the Event Hooks system, and 5 of them are "fire - and - forget" spectators. The system doesn't care about their return values. There is only one unique injection point to modify the Agent's running context. The official firmly holds a bottom line: even if the plugin code crashes, it won't bring down the Agent's main loop.

These seemingly conservative choices have a highly consistent underlying logic.

At the beginning of this year, the Chroma team conducted multiple rounds of dialogue stress tests. After changing from single - round to multi - round dialogues, the average performance of the model dropped by