Skip to content

Commit 8ec210a

Browse files
committed
Merge branch 'openclaw-exp'
2 parents adb93d8 + 8f56e4a commit 8ec210a

11 files changed

Lines changed: 1712 additions & 0 deletions

File tree

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# OpenClaw奖励模块更新
2+
3+
**项目**: OpenClaw Agent 构建
4+
**时间范围**: 2026 年 3 月 13 日 — 2026 年 3 月 20 日
5+
6+
---
7+
8+
## 一、之前的奖励模块的问题
9+
10+
之前奖励模块处于最小可行状态:仅依赖语言模型评判回复的外向性(Extraversion)人格特征分数,逻辑简洁但功能单一。在实际训练中,这种单一维度的评估暴露出三类问题:
11+
12+
- **离题回复仍然获得较高奖励**:回复虽然热情洋溢、表达力强,但若与问题无关,仍能得到不错分数,导致模型学会"热情地答非所问"。
13+
- **批量内回复趋于同质化**:语言模型在生成多个候选回复时,容易产出大量近似重复的内容,这些内容各自获得接近的分数,缺乏多样性信号。
14+
- **退化输出缺乏惩罚机制**:训练过程中偶尔出现的循环段落、特殊 token 泄露或字符级重复(nonsense generation),因为在 Extraversion 维度上没有明显短板,仍能获得中上奖励,无法被有效压制。
15+
16+
本次更新的核心目标,就是将奖励系统从单一维度扩展为多维度复合架构,以更精细的信号引导模型同时兼顾**相关性****多样性****输出质量**
17+
18+
---
19+
20+
## 二、核心升级:从单一维度到四维复合奖励
21+
22+
### 2.1 奖励公式
23+
24+
新的奖励由四个维度加权融合,并通过一个乘法质量门控进行修正:
25+
26+
```
27+
最终奖励 = 质量分数 × (外向性权重 × 外向性 + 相关性权重 × 相关性 + 多样性权重 × 多样性)
28+
```
29+
30+
默认权重配置为:外向性 0.5、相关性 0.3、多样性 0.2。三个子维度权重之和为 1.0,质量门控以乘法形式作用于最终得分。
31+
32+
### 2.2 各维度说明
33+
34+
**外向性(Extraversion)**
35+
36+
沿用上一版本的 LLM 评判方案,由语言模型评估回复在热情、活力和表达力方面的表现。评估模式保持两种:pointwise 模式对每个回复独立打分(0–1),listwise 模式在同一批回复中做相对排名(最好 1.0,最差 0.0)。
37+
38+
**相关性(Relevance)**
39+
40+
新增的维度。评判回复是否围绕问题展开、是否切中主题。相关性的加入解决了"热情但跑题"这一问题:即使回复在表达力上得分很高,若相关性不足,综合得分也会被拉低。
41+
42+
**多样性(Diversity)**
43+
44+
新增的维度。鼓励模型在生成多个候选回复时保持差异性,避免同质化输出。多样性评估分为两个层面:
45+
46+
- **批量内多样性**:当前这批候选回复中,各回复之间的相似程度。相似度越高,多样性分越低。
47+
- **跨请求多样性**:当前回复与近期历史上出现过的回复之间的相似程度。若模型反复产出与历史相似的回复,多样性分也会被压低。
48+
49+
多样性评估采用 n-gram 字符级重叠度(Jaccard 相似度)作为量化指标,无需语言模型调用,完全确定性执行。
50+
51+
**质量门控(Quality Gate)**
52+
53+
新增的维度。作为一个乘法修正项(0–1 之间),质量门控以"硬开关"的方式惩罚两类退化输出:
54+
55+
- **段落级循环**:同一结构化段落(如 `If you have any questions...` 模板段落)被重复多次。
56+
- **字符级重复与 token 泄露**:连续重复词汇、特殊标记(如 `<|im_start|>`)泄露等。
57+
58+
质量门控采用 OpenJudge 的 NgramRepetitionPenaltyGrader 结合字符串退化检测工具联合判定。当检测到上述退化模式时,质量分数直接压至接近零,无论其他三个维度的得分有多高。
59+
60+
---
61+
62+
## 三、其他变更
63+
64+
### 3.1 查询历史记录
65+
66+
在请求处理环节新增了一个轻量级的查询历史滚动缓冲区(上限 100 条),记录每次提交的请求元信息。其目的不在训练奖励计算,而在于系统层面的可观测性:若同一问题在短时间内高频出现,说明上游数据分发存在问题,需要及时告警,而非归咎于模型。
67+
68+
### 3.2 vLLM 兼容处理
69+
70+
服务端点在转发请求时,自动剥离了上游不支持的字段(如 `strict``store`),避免不必要的警告输出。同时,`/requests` 接口的返回值从原始请求记录改为查询历史,提供更清晰的调试视图。
71+
72+
### 3.3 测试体系
73+
74+
原有的两个端到端测试(pointwise 模式、listwise 模式)被扩展为六个专项测试,覆盖复合奖励的各个维度以及质量门控的惩罚效果:
75+
76+
- 外向性复合奖励测试:验证热情回复优于平淡回复
77+
- 相关性惩罚测试:验证离题回复得分低于切题回复
78+
- 多样性惩罚测试:验证近似重复回复得分低于独特回复
79+
- 跨请求多样性测试:验证重复历史回复的代价
80+
- 退化惩罚测试:验证循环段落和特殊 token 泄露会被质量门控压制
81+
- listwise 复合测试:验证 listwise 模式下复合奖励同样生效
82+
83+
每个测试在运行时隔离历史状态,确保测试结果不受执行顺序影响。
84+
85+
### 3.4 快速参考文档
86+
87+
新增了一份速查文档(cheatsheet),包含测试运行命令、服务启动命令、所有奖励模式说明和环境变量速查表,方便日常操作时快速查阅。
88+
89+
---
90+
91+
## 四、架构升级概览
92+
93+
| 特性 | 更新前 | 更新后 |
94+
|------|--------|--------|
95+
| 奖励维度 | 1 个(外向性) | 4 个(外向性 + 相关性 + 多样性 + 质量门控) |
96+
| 质量门控 || 乘法门控,压制退化输出至 ~0 |
97+
| 批量内多样性 || n-gram 相似度检测 |
98+
| 跨请求记忆 || 25 条回复历史滚动缓冲区 |
99+
| 相关性评估 || LLM 评判 |
100+
| 测试用例数 | 2 个 | 6 个 |
101+
| 快速参考文档 || 新增 cheatsheet |
102+
| 请求可观测性 || 查询历史记录接口 |
103+
104+
---
105+
106+
## 五、总结
107+
108+
本次更新的本质,是将奖励模块从"外向性评分器"转变为"多维度质量评估系统"。新增的相关性和多样性维度填补了上一版本的盲区,质量门控则为训练稳定性提供了最后一道防线。更新后的系统能够在鼓励热情表达的同时,确保回复切题、不重复、无退化,使模型真正学会在正确方向上发挥外向性人格优势。
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# OpenClaw Agent Training - Extraversion Personality
2+
3+
Train an LLM agent to exhibit more extraverted personality traits using reinforcement learning.
4+
5+
## Overview
6+
7+
This training program uses GRPO (Group Relative Policy Optimization) to train Qwen2.5-7B-Instruct to respond with more extraverted characteristics:
8+
- Outgoing, energetic, enthusiastic tone
9+
- Social engagement and excitement
10+
- Positive, upbeat language
11+
- Action-oriented expressions
12+
13+
## Architecture
14+
15+
```
16+
User Query → fake_vllm_endpoint.py → Swarm Server (8 GPUs)
17+
18+
Generate N=4 responses in parallel
19+
20+
Evaluate with ExtraversionGrader (OpenJudge)
21+
22+
Compute rewards & update model (GRPO)
23+
24+
Return best response to user
25+
```
26+
27+
## Prerequisites
28+
29+
```bash
30+
pip install py-openjudge datasets
31+
```
32+
33+
## Setup
34+
35+
### 1. Download Dataset
36+
37+
```bash
38+
cd tutorial/opencode_build_openclaw_agent
39+
python download_dataset.py
40+
```
41+
42+
This downloads the `holistic-ai/personality_manipulation` dataset and extracts extraversion examples.
43+
44+
### 2. Configure API Key
45+
46+
Edit `on_compute_relative_reward.py` and set your API key for the judge model:
47+
48+
```python
49+
model = OpenAIChatModel(
50+
model="qwen-plus",
51+
api_key="YOUR_API_KEY_HERE", # Change this
52+
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
53+
)
54+
```
55+
56+
## Training
57+
58+
### Step 1: Start Swarm Server
59+
60+
On your GPU server (with 8 GPUs available):
61+
62+
```bash
63+
ajet-swarm start
64+
```
65+
66+
Or with monitoring:
67+
68+
```bash
69+
(ajet-swarm start &> ajet-swarm-server.log) & (ajet-swarm overwatch)
70+
```
71+
72+
### Step 2: Start Fake vLLM Endpoint
73+
74+
In a new terminal:
75+
76+
```bash
77+
cd tutorial/opencode_build_openclaw_agent
78+
79+
# Option 1: Use OpenJudge pointwise grading (default)
80+
export AJET_SWARM_URL="http://localhost:10086"
81+
export NUM_REPEAT=4
82+
export REWARD_MODE=pointwise
83+
export DASHSCOPE_API_KEY=your_api_key_here
84+
python fake_vllm_endpoint.py
85+
86+
# Option 2: Use OpenJudge listwise ranking
87+
export AJET_SWARM_URL="http://localhost:10086"
88+
export NUM_REPEAT=4
89+
export REWARD_MODE=listwise
90+
export DASHSCOPE_API_KEY=your_api_key_here
91+
python fake_vllm_endpoint.py
92+
```
93+
94+
This starts the training proxy on `http://localhost:8090`.
95+
96+
### Step 3: Configure OpenClaw to Use Training Endpoint
97+
98+
OpenClaw needs to connect to the fake vLLM endpoint.
99+
100+
Configure it to use `http://localhost:8090` as the LLM backend.
101+
102+
### Step 4: Send Training Requests
103+
104+
Option A - Manual testing via OpenClaw Web / Cli:
105+
106+
```bash
107+
openclaw agent --message "What are your thoughts on Paris?" --thinking high
108+
```
109+
110+
Option B - Automated dataset iteration:
111+
112+
```bash
113+
python mock_user_request.py
114+
```
115+
116+
This will iterate through the personality_manipulation dataset and send each question via OpenClaw CLI.
117+
118+
## Configuration
119+
120+
Key parameters in `fake_vllm_endpoint.py`:
121+
122+
- `n_gpu=8` - Number of GPUs for training
123+
- `batch_size=32` - Training batch size
124+
- `num_repeat=4` - GRPO N parameter (responses per query)
125+
- `model` - Base model path
126+
127+
Environment variables for reward computation:
128+
129+
- `REWARD_MODE` - Reward computation mode: `pointwise` (default) or `listwise`
130+
- `DASHSCOPE_API_KEY` - API key for OpenJudge LLM grader
131+
- `JUDGE_BASE_URL` - Base URL for judge model API (default: DashScope)
132+
- `JUDGE_MODEL` - Judge model name (default: `qwen-plus`)
133+
134+
## Reward Function
135+
136+
Two OpenJudge-based reward modes are available:
137+
138+
### 1. Pointwise Mode (Default)
139+
140+
Uses OpenJudge LLM grader to evaluate each response independently:
141+
- Evaluates extraversion traits on 1-10 scale
142+
- Provides detailed reasoning for each score
143+
- Scores normalized to [-1, 1] for GRPO training
144+
145+
```bash
146+
export REWARD_MODE=pointwise
147+
export DASHSCOPE_API_KEY=your_api_key_here
148+
```
149+
150+
### 2. Listwise Mode
151+
152+
Uses OpenJudge to rank all responses together:
153+
- Compares responses directly against each other
154+
- Produces relative rankings
155+
- Best for capturing subtle differences
156+
157+
```bash
158+
export REWARD_MODE=listwise
159+
export DASHSCOPE_API_KEY=your_api_key_here
160+
```
161+
162+
## Monitoring
163+
164+
Check training progress:
165+
166+
```bash
167+
# View swarm status
168+
ajet-swarm overwatch
169+
170+
# Check request history
171+
curl http://localhost:8090/requests
172+
173+
# Health check
174+
curl http://localhost:8090/health
175+
```
176+
177+
## Files
178+
179+
- `fake_vllm_endpoint.py` - Main training server
180+
- `on_compute_relative_reward.py` - Extraversion reward function
181+
- `on_user_submit_new_requests.py` - Request handler
182+
- `download_dataset.py` - Dataset downloader
183+
- `mock_user_request.py` - Automated testing client
184+
185+
## Troubleshooting
186+
187+
**Import errors**: LSP warnings about unresolved imports are normal - dependencies will be available at runtime.
188+
189+
**Connection refused**: Ensure swarm server is running on port 10086.
190+
191+
**All episodes failed**: Check GPU availability and swarm server logs.
192+
193+
## Notes
194+
195+
- Training is passive - the endpoint waits for requests rather than iterating a dataset
196+
- Each request generates N=4 responses, evaluates them, and trains on the best
197+
- The model gradually learns to produce more extraverted responses over time

0 commit comments

Comments
 (0)