Hermes一周测评

我进行了一次为期一周的hermes实际体验测评!

hermes——优化工作流的神力

从openclaw开始————架构

提到openclaw，我们都并不陌生，作为现在许多企业与个人开发者的agent框架，在优化企业职能与个人开发者工作流中占市场比很大，直到今天，其内在的一些不足之处，网上对其的吐槽数不计数，大致分为两种：一是说它token浪费严重，二是说它上下文能力弱，其实本质来说就是它上下文能力不足，它的记忆系统就是将上下文打包放到memory文件夹里，然后需要时再全部查找读取，同时最严重的问题正是这里：它总是会忘记存储记忆（也许这和模型不能不有关），导致session之间衔接不齐，工作流体验糟糕。

在2026年初，openclaw刚刚推出时，得益于其新颖与开源式的普及应用，同时作为行业新星，它独特的市场广大的需求使得它作为新星一直火到现在（实际上 2025 年 11 月openclaw就已经名为Clawbot推出，可见时间是重要的优势），实际客观地说，openclaw的框架体系做的不赖，从agent核心信息文件（soul.md,agent.md），到skills生态，再到心跳检查,以及类似web-search这样的内置技能，都是一个开源项目的顶配设计，废话不多说，一下是openclaw的架构：

层级	说明	包含组件
用户交互层	多平台消息接入	飞书 / Telegram / Discord / WhatsApp / Slack / Web 控制面板 / CLI
Gateway 网关层	核心消息路由与调度	消息归一化 → 会话管理 (SQLite) → 权限审批 (/approve) → 多用户隔离 → 插件系统 / 技能市场 (npm) → 内置渠道 (Email / RSS / Webhook / Cron)
AI 模型层	多模型推理引擎	OpenAI / Anthropic / DeepSeek / Ollama / OpenRouter → 多智能体路由 → 子代理编排 → Function Calling

分类	技能	说明
内置技能	cron	定时任务调度
	email	邮件收发
	rss	RSS 订阅监控
	webhook	HTTP Webhook 触发
	script	本地脚本执行
	fetch	网页抓取
社区技能 (npm install)	@openclaw/github	GitHub 监控 / PR / Issue
	@openclaw/notion	Notion 笔记管理
	@openclaw/calendar	Google / Outlook 日历
	@openclaw/gpt-sovits	AI 语音合成
开发脚手架	openclaw create-skill	TypeScript SDK + 标准化接口 (trigger/action/middleware)

用户发消息 ──► Gateway 接收 ──► 消息归一化 ──► 会话持久化 ──► 权限检查
                                                        │
                                                    ┌───▼──────────┐
                                                    │ 审批弹窗      │
                                                    │ /approve      │
                                                    │ /deny         │
                                                    │ once/session  │
                                                    │ /always       │
                                                    └───┬──────────┘
                                                        │
                                                        ▼
                                                  ┌──────────┐
                                                  │ LLM 推理  │
                                                  │ + 工具执行 │
                                                  └──────────┘

OpenClaw 的架构设计方向是对的——Gateway 路由 + 插件扩展 + 多模型支持。但它目前存在两个核心痛点：

上下文能力弱：记忆系统本质是把历史打包存到 memory 文件夹，每次全部读取，既不精准又浪费 token
技能深度不足：每个插件往往是单点功能，缺乏链式编排的深度，复杂任务需要用户自己组合

Hermes深度体验

起源

在使用了openclaw将近两个月后，我的新鲜感和体验感都被消磨殆尽，github上的各种框架的涌出激发了我对新框架的需求和查找，hermes成为了一个不二之选

深度体验
在测评体验之前，我不得不提一句，我使用openclaw时使用了英伟达提供的免费minmax-2.5api，所以体验感差不得不有一些个人主观意见，我选择了性价比最高的模型也是行业新星（其实只是在各种大模型打的最激烈的时候空档了而已）的deepseek-v4 falsh版本（pro的价格没那么亲民，同时有些小企业也甚至只装falsh）
安装与配置

在hermes官网，最推荐也是最适用的安装方法就是把中文社区网址直接发给你的openclaw，让他自己访问并阅读文章，然后给他发这样一段话：

1
2
3

请把这个 MCP server 加到你的配置里：https://mcp.hermesagent.org.cn/v1 
（Streamable HTTP，无需 API Key、无需登录）。
加完后用它帮我查 Hermes Agent 中文文档来指导我完成安装。

考虑到hermes是Linux生态的产物，我作为一个windows用户在wsl2安装这就消耗了很多时间，具体的安装与配置流程可以参考（官网）也可以自己上网查查教程。

配置几乎和openclaw差不多，但多了一个核心的点：上下文对话相关的参数设定，这是hermes最精辟的几点之一：

agent.max_turns — 最大轮数
compression. — 上下文压缩*
terminal.timeout -命令超时
delegation.max_iterations — 子代理迭代次数
memory.* — 跨会话记忆
参数最好就和默认一样即可，一般能满足使用，但上下文压缩是特例，为了节约token，可以大幅调整（原因后面会说）
然后你可以选择配置你的通讯工具，其实命令行真的不错，毕竟是最原生态的，我选用了常见的飞书。

配置到这里，基本就可以开启你的hermes之旅了，模型ds是在是夯，5/31之前还打折，百万输入输出和不要钱一样。

具体工作体验

在部署完hermes当天晚上，我迫不及待开始用它做了第一个项目，灵感来源于之前看过的一个全栈程序员自媒体分享的一个仿生鸟群项目，typescript架构，我打算做一个类似的，但可以调节参数的开源版。

在和hermes说了我的需求后，甩给我一个OK就开始干活了，在实际使用时，hermes和openclaw不太一样，它默认就会抛出所有工作流程，包括工具调用，命令输出等等,提一句，它默认执行命令行命令时会给用户发一个许可请求，同意了才会继续，可以调为只有关于重要的命令才问询，否则会很难受。

追问功能是hermes上下文一个精辟的点，hermes在遇到一个工作有多种不同的合理做法，会对用户发起询问，相当于在模型中添加一个阻塞任务节点，等待用户输入再继续执行，遗憾的是，对于一些想让模型完全自动化的用户，框架本身没有实现自动默认选项，但每次框架都会自动判断有没有必要问询用户，来增强体验感，对一般用户来说够用了（除非你打算搞一个自动盯盘的助手），就像下面这样

飞书上几乎差不多，这里就不展示了

衔接上文，那个鸟群项目，我是分了三步走，从两个鸟之间的分离力，到种群行为中的队列行为，最后是外界与内部系统干扰。hermes很干练的开始写算法，这里我只展示队列行为：

在此之前，我们需要初始化鸟群个体：

// 空间形态
function buildGrid(boids, w, h) {
  grid = {};
  for (var i = 0; i < boids.length; i++) {
    var b = boids[i];
    var cx = Math.floor(b.pos.x / CELL_SIZE);
    var cy = Math.floor(b.pos.y / CELL_SIZE);
    var key = cx + ',' + cy;
    if (!grid[key]) grid[key] = [];
    grid[key].push(b);
  }
}
// 速度方向
function Boid(x, y, id) {
  this.id = id;
  this.pos = new Vec2(x, y);
  var angle = Math.random() * Math.PI * 2;
  var speed = CONFIG.MIN_SPEED + Math.random() * (CONFIG.MAX_SPEED - CONFIG.MIN_SPEED);
  this.vel = new Vec2(Math.cos(angle) * speed, Math.sin(angle) * speed);
  this.acc = new Vec2(0, 0);
// === 向量 ===
function Vec2(x, y) {
  this.x = x;
  this.y = y;
}
Vec2.prototype.add = function(v) { return new Vec2(this.x+v.x, this.y+v.y); };
Vec2.prototype.sub = function(v) { return new Vec2(this.x-v.x, this.y-v.y); };
Vec2.prototype.scale = function(s) { return new Vec2(this.x*s, this.y*s); };
Vec2.prototype.mag = function() { return Math.sqrt(this.x*this.x + this.y*this.y); };
Vec2.prototype.normalize = function() {
  var m = this.mag();
  return m === 0 ? new Vec2(0,0) : new Vec2(this.x/m, this.y/m);
};
Vec2.prototype.limit = function(max) {
  var m = this.mag();
  return m > max ? this.scale(max/m) : this;
};
Vec2.prototype.dist = function(v) { return this.sub(v).mag(); };
Vec2.prototype.clone = function() { return new Vec2(this.x, this.y); };

hermes通过确定每一帧个体鸟的位置（行列键值对key）来定义个体鸟的空间形态，同时也初始化了鸟的速度向量。

队列行为
实际鸟群中，队列行为至少有对齐，凝聚，头鸟，队列这四种，我本身不是很懂，只能做到最简单的仿生。
- 对齐，顾名思义，就是速度向量的方向合并，每只鸟的视野范围内，距离越近，两只鸟的对齐欲望越大，对齐力就更大，两只鸟被迫形成相同的速度方向，然后以此类推，形成鸟群的方向行为

// 对齐算法
if (distSq < CONFIG.PERCEPTION_RADIUS * CONFIG.PERCEPTION_RADIUS) {
    avgVel.x += other.vel.x;
    avgVel.y += other.vel.y;
    aliCount++;
  }
// === 对齐结果 ===
var ali = new Vec2(0, 0);
if (aliCount > 0 && weights.alignment > 0) {
  avgVel.x /= aliCount; avgVel.y /= aliCount;
  ali = new Vec2(avgVel.x - this.vel.x, avgVel.y - this.vel.y);
  ali = ali.limit(weights.alignment * 0.15);
}

其中有一些约束条件，这里不过多阐述。

一系列的力合并后，就形成了整体的鸟群系统，看起来就像这样:

实际上，我用openclaw的框架做了一遍，由于模型不同，不能单看成品，这里也不打算展示，但在工作中，两者最大的不同就是，openclaw不会告诉你：怎么改的，为什么这样改，改了有什么效果，而且总是忘记先前的要求，在实际执行中，soul.md一类的核心文件无法真正约束其行为，反而导致它陷入自我混乱。

除此以外，我还用hermes做了一个简单的github新消息推送机器人，后面我会专门写一篇文章整合一些我自己的小工具

我会将这些项目推送到我的github，敬请关注

skill生态

关于skills生态————无论是cloude，codex，openclaw，都是一个可扩展的庞大系统，在agent的能力提升中有举足轻重的地位，在hermes以前，包括hermes本身的主体系统，skills都是由开发者独立开发的，hermes只是在出厂设置之前就打包了基础skill，其主要系统由这几部分组成

安装方式几乎和市面上主流方法一样，但需要信任外部skill。
渐进披露式加载skills
hermes仅在需要时加载skills，读取全文，减少上下文浪费

skills管理

# 浏览所有可用 Skills
  hermes skills browse

# 搜索特定功能
  hermes skills search email
  hermes skills search spreadsheet
  hermes skills search calendar

# 安装官方可选 Skills（docker、blockchain 等）
  hermes skills install official/devops/docker-management
  hermes skills install official/blockchain/solana

# 从 GitHub 直接安装
  hermes skills install openai/skills/k8s
  hermes skills install anthropics/skills/docs-writer

# 检查 Skills 更新
  hermes skills check

# 更新所有 Hub 安装的 Skills
  hermes skills update

skills自定义
hermes 的skill文件形式有清晰的定义，类似于：

---
name: skill-name
version: 1.0.0
trigger_keywords: ["关键词1", "关键词2"]
requires: []  # 可选，依赖的 Python 包或系统命令
---

# 技能标题（人类可读）

## 描述
简要说明这个技能做什么，适用于什么场景。

## 适用场景
- 场景1
- 场景2

## 触发条件
详细说明在什么情况下 Hermes 应该激活此技能。

## 执行流程
1. 第一步
2. 第二步
3. ...

## 参数说明
| 参数名 | 类型 | 默认值 | 说明 |
|--------|------|--------|------|

## 示例输入输出
**用户**: ...
**AI**: ...

这只是基本的skill框架，最令人眼前一亮的是hermes的闭合学习循环能力：

闭合学习循环skill

生态的生产力解放

当hermes执行复杂任务（多次调用工具）的复杂项目结束后，会自动将工作流封装为一个全新的skill，准确来说有一下几点条件是前置要求：

复杂任务成功完成（5次以上工具调用）
踩坑填坑之后
用户纠正了做法
发现了固有的工作流模式
发现了重复且不显而易见的工作方法。

我在实际使用中就已经自动生成了不少skill，这里分享一个来填填坑：

systematic-debugging — 系统化调试

名字听着就很像工作流工程的skill，hermes这样介绍：

1
2
3
4
5
6
7
8
9
10
11
**定位：** 遇到任何 bug、测试失败、异常行为时使用。

    **核心思想：** 不找到根因就不修 bug。随机修复浪费的时间是系统化调试的 3-5 倍。

    **四阶段流程：**
    - Phase 1: **根因调查** — 读错误信息、稳定复现、检查近期改动、追踪数据流向找到源头
    - Phase 2: **模式分析** — 找代码库里类似的工作示例，对比差异
    - Phase 3: **假设验证** — 一次只改一个变量来验证根因假设
    - Phase 4: **实现修复** — 先写复现 bug 的测试，再修根因，最后验证

    **关键规则：** 如果 3 次修复都失败了，停下来质疑架构设计，不要修第 4 次。

源文件完整内容如下（嵌入）：

---
name: systematic-debugging
description: Use when encountering any bug, test failure, or unexpected behavior. 4-phase root cause investigation — NO fixes without understanding the problem first.
version: 1.1.0
author: Hermes Agent (adapted from obra/superpowers)        #这两个属性其实是不支持的。
license: MIT
metadata:
  hermes:
    tags: [debugging, troubleshooting, problem-solving, root-cause, investigation]
    related_skills: [test-driven-development, writing-plans, subagent-driven-development]
---

# Systematic Debugging

## Overview

Random fixes waste time and create new bugs. Quick patches mask underlying issues.

**Core principle:** ALWAYS find root cause before attempting fixes. Symptom fixes are failure.

**Violating the letter of this process is violating the spirit of debugging.**

## The Iron Law

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

If you haven't completed Phase 1, you cannot propose fixes.

## When to Use

Use for ANY technical issue:
- Test failures
- Bugs in production
- Unexpected behavior
- Performance problems
- Build failures
- Integration issues

**Use this ESPECIALLY when:**
- Under time pressure (emergencies make guessing tempting)
- "Just one quick fix" seems obvious
- You've already tried multiple fixes
- Previous fix didn't work
- You don't fully understand the issue

**Don't skip when:**
- Issue seems simple (simple bugs have root causes too)
- You're in a hurry (rushing guarantees rework)
- Someone wants it fixed NOW (systematic is faster than thrashing)

## The Four Phases

You MUST complete each phase before proceeding to the next.

---

## Phase 1: Root Cause Investigation

**BEFORE attempting ANY fix:**

### 1. Read Error Messages Carefully

- Don't skip past errors or warnings
- They often contain the exact solution
- Read stack traces completely
- Note line numbers, file paths, error codes

**Action:** Use `read_file` on the relevant source files. Use `search_files` to find the error string in the codebase.

### 2. Reproduce Consistently

- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If not reproducible → gather more data, don't guess

**Action:** Use the `terminal` tool to run the failing test or trigger the bug:

# Run specific failing test
pytest tests/test_module.py::test_name -v

# Run with verbose output
pytest tests/test_module.py -v --tb=long


### 3. Check Recent Changes

- What changed that could cause this?
- Git diff, recent commits
- New dependencies, config changes

**Action:**


# Recent commits
git log --oneline -10

# Uncommitted changes
git diff

# Changes in specific file
git log -p --follow src/problematic_file.py | head -100


### 4. Gather Evidence in Multi-Component Systems

**WHEN system has multiple components (API → service → database, CI → build → deploy):**

**BEFORE proposing fixes, add diagnostic instrumentation:**

For EACH component boundary:
- Log what data enters the component
- Log what data exits the component
- Verify environment/config propagation
- Check state at each layer

Run once to gather evidence showing WHERE it breaks.
THEN analyze evidence to identify the failing component.
THEN investigate that specific component.

### 5. Trace Data Flow

**WHEN error is deep in the call stack:**

- Where does the bad value originate?
- What called this function with the bad value?
- Keep tracing upstream until you find the source
- Fix at the source, not at the symptom

**Action:** Use `search_files` to trace references:


# Find where the function is called
search_files("function_name(", path="src/", file_glob="*.py")

# Find where the variable is set
search_files("variable_name\\s*=", path="src/", file_glob="*.py")

### Phase 1 Completion Checklist

- [ ] Error messages fully read and understood
- [ ] Issue reproduced consistently
- [ ] Recent changes identified and reviewed
- [ ] Evidence gathered (logs, state, data flow)
- [ ] Problem isolated to specific component/code
- [ ] Root cause hypothesis formed

**STOP:** Do not proceed to Phase 2 until you understand WHY it's happening.

---

## Phase 2: Pattern Analysis

**Find the pattern before fixing:**

### 1. Find Working Examples

- Locate similar working code in the same codebase
- What works that's similar to what's broken?

**Action:** Use `search_files` to find comparable patterns:


search_files("similar_pattern", path="src/", file_glob="*.py")


### 2. Compare Against References

- If implementing a pattern, read the reference implementation COMPLETELY
- Don't skim — read every line
- Understand the pattern fully before applying

### 3. Identify Differences

- What's different between working and broken?
- List every difference, however small
- Don't assume "that can't matter"

### 4. Understand Dependencies

- What other components does this need?
- What settings, config, environment?
- What assumptions does it make?

---

## Phase 3: Hypothesis and Testing

**Scientific method:**

### 1. Form a Single Hypothesis

- State clearly: "I think X is the root cause because Y"
- Write it down
- Be specific, not vague

### 2. Test Minimally

- Make the SMALLEST possible change to test the hypothesis
- One variable at a time
- Don't fix multiple things at once

### 3. Verify Before Continuing

- Did it work? → Phase 4
- Didn't work? → Form NEW hypothesis
- DON'T add more fixes on top

### 4. When You Don't Know

- Say "I don't understand X"
- Don't pretend to know
- Ask the user for help
- Research more

---

## Phase 4: Implementation

**Fix the root cause, not the symptom:**

### 1. Create Failing Test Case

- Simplest possible reproduction
- Automated test if possible
- MUST have before fixing
- Use the `test-driven-development` skill

### 2. Implement Single Fix

- Address the root cause identified
- ONE change at a time
- No "while I'm here" improvements
- No bundled refactoring

### 3. Verify Fix


# Run the specific regression test
pytest tests/test_module.py::test_regression -v

# Run full suite — no regressions
pytest tests/ -q


### 4. If Fix Doesn't Work — The Rule of Three

- **STOP.**
- Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze with new information
- **If ≥ 3: STOP and question the architecture (step 5 below)**
- DON'T attempt Fix #4 without architectural discussion

### 5. If 3+ Fixes Failed: Question Architecture

**Pattern indicating an architectural problem:**
- Each fix reveals new shared state/coupling in a different place
- Fixes require "massive refactoring" to implement
- Each fix creates new symptoms elsewhere

**STOP and question fundamentals:**
- Is this pattern fundamentally sound?
- Are we "sticking with it through sheer inertia"?
- Should we refactor the architecture vs. continue fixing symptoms?

**Discuss with the user before attempting more fixes.**

This is NOT a failed hypothesis — this is a wrong architecture.

---

## Red Flags — STOP and Follow Process

If you catch yourself thinking:
- "Quick fix for now, investigate later"
- "Just try changing X and see if it works"
- "Add multiple changes, run tests"
- "Skip the test, I'll manually verify"
- "It's probably X, let me fix that"
- "I don't fully understand but this might work"
- "Pattern says X but I'll adapt it differently"
- "Here are the main problems: [lists fixes without investigation]"
- Proposing solutions before tracing data flow
- **"One more fix attempt" (when already tried 2+)**
- **Each fix reveals a new problem in a different place**

**ALL of these mean: STOP. Return to Phase 1.**

**If 3+ fixes failed:** Question the architecture (Phase 4 step 5).

## Common Rationalizations

| Excuse | Reality |
|--------|---------|
| "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. |
| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
| "I'll write test after confirming fix works" | Untested fixes don't stick. Test first proves it. |
| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
| "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
| "I see the problem, let me fix it" | Seeing symptoms ≠ understanding root cause. |
| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question the pattern, don't fix again. |

## Quick Reference

| Phase | Key Activities | Success Criteria |
|-------|---------------|------------------|
| **1. Root Cause** | Read errors, reproduce, check changes, gather evidence, trace data flow | Understand WHAT and WHY |
| **2. Pattern** | Find working examples, compare, identify differences | Know what's different |
| **3. Hypothesis** | Form theory, test minimally, one variable at a time | Confirmed or new hypothesis |
| **4. Implementation** | Create regression test, fix root cause, verify | Bug resolved, all tests pass |

## Hermes Agent Integration

### Investigation Tools

Use these Hermes tools during Phase 1:

- **`search_files`** — Find error strings, trace function calls, locate patterns
- **`read_file`** — Read source code with line numbers for precise analysis
- **`terminal`** — Run tests, check git history, reproduce bugs
- **`web_search`/`web_extract`** — Research error messages, library docs

### With delegate_task

For complex multi-component debugging, dispatch investigation subagents:


delegate_task(
    goal="Investigate why [specific test/behavior] fails",
    context="""
    Follow systematic-debugging skill:
    1. Read the error message carefully
    2. Reproduce the issue
    3. Trace the data flow to find root cause
    4. Report findings — do NOT fix yet

    Error: [paste full error]
    File: [path to failing code]
    Test command: [exact command]
)


### With test-driven-development

When fixing bugs:
1. Write a test that reproduces the bug (RED)
2. Debug systematically to find root cause
3. Fix the root cause (GREEN)
4. The test proves the fix and prevents regression

## Real-World Impact

From debugging sessions:
- Systematic approach: 15-30 minutes to fix
- Random fixes approach: 2-3 hours of thrashing
- First-time fix rate: 95% vs 40%
- New bugs introduced: Near zero vs common

**No shortcuts. No guessing. Systematic always wins.**

通过这样的自我修订防止跳入循环的skill竟然也能自动化创建了，这个skill的亮点就是让ai拥有科学探究解决问题的能力，在最少试错次数中获得最大化回收效益，同时保证了输出质量，也确认不会犯记忆丢失，问牛答马式错误，ai的自检能力得到深化。

那么，仅仅是有skill并不够，很多时候，agent难以发现或者说忘记skill的存在，hermes的skill执行框架是怎么样的呢？

三步强化skill意识

显式触发

每个skill都有自己的description，用户需求匹配时，hermes会主动加载
但成功率往往较低，取决于description的质量，他决定了模型能否正确选择该skill，但往往提示词无法到达最优。

隐式触发 — 系统 prompt 里的规则

每次对话开始的系统提示词（system prompt）里都有写：

1 2	If a skill matches or is even partially relevant to your task, you MUST load it with skill_view(name) and follow its instructions.

这就是硬约束。

session之间的记忆

memory中存储了用户的喜好和工作模式，在之前某个对话中使用过的skills会更容易被识别选择，模式会被强化

但就目前来说，漏洞倒也有：

hermes可能不认为当前任务足够复杂
→ 如果是改一行配置、加个字段这种简单操作，我不会触发 systematic-debugging。问题是我可能低估了任务的复杂度。
技能描述匹配不精准
→ 比如用户遇到的是”构建报错”，hermes能联想到
systematic-debugging。但如果是更微妙的问题，hermes可能用其他更轻量的方式处理。
没有自动化的”触发器”
→ 目前没有类似 git hook 或者 pre-commit 那种强制执行的机制。只能靠hermes自己的判断。

综上是hermes的skill框架，虽然有所不足，已经令人可喜了，下面我将给出一个实战案例