实验记录：terminal-rl Qwen3-8B run-3 (60h, 565 step, wandb msp60ius) — accuracy 0.32→0.52，无 mode-collapse + 6 类 dataset 卡顿

## TL;DR

在 issue #2 同一台 8× H200（143 GB/卡）上，用**同一份 8B launch script**（[`run_qwen3_8b_experiment.sh`](https://github.com/HansBug/OpenClaw-RL/blob/main/terminal-rl/run_qwen3_8b_experiment.sh)，仅启动顺序加了一步 pool 预拉容量到 1024 lease）跑了第三轮 **`terminal-rl` GRPO outcome-only**（dense pass-rate reward，无 PRM、无 process reward）。模型仍是 **Qwen3-8B**，dataset 仍是 `seta_env`（1376 个 Linux 终端任务）。

- **W&B run**：https://wandb.ai/hansbug/openclaw-terminal-rl/runs/msp60ius
- **runtime**：~60 h 12 m（主动 kill 收尾，本 issue 写作时已停）
- **训练量**：565 wandb step / 283 rollout / 最后 ckpt = `iter_0000279`，约 `num_rollout=2000` 目标的 **14%**（远低于 issue #2 的 55%，原因是 dataset 卡顿损耗 ~19% + 慢 sample 拖累，详见 §6）
- **峰值**：`terminal/accuracy` rollout 232 单点 **0.711**，最近 50 rollout 均值 **0.517**；`reward_mean` 单点最高 **+0.422**，最近 50 rollout 均值 **+0.034**
- **vs issue #1 (4B / ~270 step) / issue #2 (8B / ~1100 step)**：accuracy peak 比 issue #2 略低（0.71 vs 0.59，但 issue #2 那是 25-step 桶，run-3 是单 batch），均值层面 run-3 稳定 0.50–0.52，issue #2 后期 mode-collapse 到 0.05；**run-3 收尾时尚未塌缩**，但已进入边际收益阶段
- **没出现** issue #2 那种 grad_norm→0 + non_trainable→0.92 + response_len→7 的 mode-collapse（截止 step 565 时 grad_norm=0.45 / non_trainable=0.001 / response_len=300+）
- **环境侧**：env-pool 1024 lease 配置在 60h+ 内 0 fatal 错误（`Too many open|address pools|429` 全为 0），但 dataset 里 6 个"慢/坏"task 累计造成 **~19% 训练空转**（详细 pool 侧分析见 [issue #3 的 60h+ pool 实测 comment](https://github.com/HansBug/OpenClaw-RL/issues/3#issuecomment-4340486381)）

---

## 1 期望训练结果（同 issue #1, #2，不重复）

terminal-rl 没有官方公开的 expected accuracy 曲线。issue #1 把这一点穷举调研过（[arxiv/2603.10165](https://arxiv.org/abs/2603.10165) Table 4 terminal 行为空、[blog](https://yinjjiew.github.io/projects/openclawrl2/) 不含 terminal、[paper 2602.02488](https://arxiv.org/abs/2602.02488) 也不含 terminal、main repo issue 至今无数字回复），不展开。

唯一有指导性的是 Tech Report Table 4：

| 来源 | setting | outcome-only |
|---|---|---:|
| Table 4 | tool-call (250 step) | 0.17 |
| Table 4 | GUI (120 step) | 0.31 |
| Table 4 | terminal | **空** |
| Table 3 | personal-agent Binary RL (8/16 step) | 0.25 / 0.23 |

把本次 8B 的 `terminal/accuracy` 长程均值 **0.49**（全程 mean，282 个 rollout）/ 最近 50 rollout 均值 **0.52** 对照这两条：量级合理（高于 tool-call outcome-only baseline 0.17，介于 GUI outcome-only 0.31 和 GUI integrated 0.33 之间）。仍然不能判断"达标"或"不达标"——这只是没有对照组的自我度量。

---

## 2 启动顺序和命令（与 issue #2 完全相同）

完整可运行脚本：[`run_qwen3_8b_experiment.sh`](https://gist.github.com/HansBug/5f01a4a703b55d7991a00bd07e66509a)。**算法 args 与 issue #2 一字未改**，env-pool 容量从 issue #2 的默认 32 lease 提升到 1024 lease（`--max-tasks 64 --max-runs-per-task 16 --max-concurrent-closes 32`），是 issue #3 主帖讨论的产物。

### 2.1 pool launch（**唯一与 issue #2 不同的环境侧**）

```bash
docker run -d --name openclaw_pool_server --network=host \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v /nfs/terminal-rl-workspace/OpenClaw-RL:/nfs/terminal-rl-workspace/OpenClaw-RL:rw \
  openclaw-rl/pool-server:latest \
  python -m terminal-rl.remote.pool_server \
    --host 0.0.0.0 --port 18081 \
    --max-tasks 64 \
    --max-runs-per-task 16 \
    --max-concurrent-closes 32 \
    --output-root /nfs/terminal-rl-workspace/OpenClaw-RL/terminal-rl/build_outputs
```

`64 × 16 = 1024` lease 容量。每个 GRPO rollout 实际峰值用 128 lease（16 prompt × 8 sample）= 12.5% 利用率。**60h+ 0 fatal 错误**，详见 [issue #3 的 60h+ pool 实测 comment](https://github.com/HansBug/OpenClaw-RL/issues/3#issuecomment-4340486381)。

### 2.2 wrapper 层 / 算法 args 全 0 行变更

完全沿用 issue #2 的 4 处 wrapper compat（conda activate / .env / LD_LIBRARY_PATH / `--no-gradient-accumulation-fusion`）和 7 处 OpenClaw-RL 源码补丁。**没有任何新增源码修改**。

### 2.3 配置摘要（核心 RL args）

```
模型           : Qwen3-8B (/nfs/models/Qwen3-8B)
ref-load       : /nfs/models/Qwen3-8B_torch_dist
algorithm      : GRPO outcome-only (dense pass-rate reward)
rollout_batch  : 16 prompt × 8 sample = 128 traj/round
max_response   : 8192 token
max_context    : 16384 token
optimizer      : Adam (β1=0.9, β2=0.98, wd=0.1)
lr             : 1e-6 constant
kl_loss_coef   : 0.01 (k3)
save_interval  : 8 round
num_rollout    : 2000 (target，未跑完)
```

---

## 3 wandb 全程曲线（按 50-rollout 桶平均）

### 3.1 主指标

| 阶段 | rollout 0-49 | 50-99 | 100-149 | 150-199 | 200-249 | **250-282** |
|---|---:|---:|---:|---:|---:|---:|
| `terminal/accuracy` | 0.318 | **0.466** ⬆️+0.148 | 0.487 | 0.483 | 0.526 | **0.522** |
| `terminal/reward_mean` | -0.364 | **-0.068** ⬆️+0.296 | -0.027 | -0.034 | +0.052 | **+0.045** |
| `terminal/non_trainable_ratio` | 0.017 | 0.022 | 0.017 | 0.016 | 0.028 | 0.028 |
| `terminal/truncated` (count) | 1104 | 1136 | 1147 | 1118 | 1049 | 1062 |

**故事线**：
- **rollout 0→100（前 25%）**：accuracy +0.15，reward +0.30 — 大涨期，模型从冷启动跑通基本结构
- **rollout 100→200（中间 35%）**：accuracy 横盘 0.48–0.49，reward 横盘 -0.03 — 第 1 个平台期
- **rollout 200→280（最近 30%）**：accuracy 0.48 → 0.52（+0.04），reward 0 → +0.05 — 小幅 grind
- **最后 30 个 rollout 又横盘** 0.52 — 第 2 个平台期苗头

`non_trainable_ratio` **全程 < 0.03**（vs issue #2 后期 0.92），说明**模型没进入 mode-collapse**。`truncated` 在 1050-1150 区间稳定，**没有出现 issue #2 那种 response_len→7 的塌缩**。

### 3.2 训练侧指标（按 50-step 桶取中位数 — 中位数防 step 517 异常值污染）

| 指标 | step 0-50 | 200-250 | 400-450 | **500-535** |
|---|---:|---:|---:|---:|
| `train/grad_norm` | 0.79 | 0.66 | 0.51 | **0.51** |
| `train/kl_loss` | 0.04 | 0.08 | 0.12 | **0.11** |
| `train/loss` | -0.025 | -0.045 | -0.026 | **-0.038** |

- `grad_norm 0.79 → 0.51`：梯度幅度稳步缩小 — 接近收敛信号
- `kl_loss 0.04 → 0.11`：离 reference policy 越来越远，但远低于 0.5 阈值
- `train/loss` 中位数始终为负 — policy 持续往 reward 改善方向更新

### 3.3 极值点

```json
{
  "accuracy_max"          : 0.7113  (rollout 232)
  "accuracy_min"          : 0.1241  (rollout ~30)
  "reward_mean_max"       : +0.4225 (rollout ~232)
  "reward_mean_min"       : -0.7517 (rollout ~10)
  "grad_norm_max (raw)"   : 5,150,010.68  (step 517 — 异常，详见 §5)
  "grad_norm_min"         : 0.0779
  "kl_loss_max (raw)"     : 3572.90  (step 517 同事件)
  "kl_loss_min"           : 0.0000
}
```

### 3.4 wandb 训练曲线（图）

总览（3×2 dashboard）：

| 总览 |
|---|
| ![00_dashboard.png](https://github.com/HansBug/OpenClaw-RL/blob/a6981c9c1fddd84fd5e8ddf5fec8bd4402ac2917/00_dashboard.png?raw=true) |

8 个独立指标 panel：

| 01 accuracy（pytest pass rate） | 02 reward_mean [-1..+1] |
|---|---|
| ![01_accuracy.png](https://github.com/HansBug/OpenClaw-RL/blob/a6981c9c1fddd84fd5e8ddf5fec8bd4402ac2917/01_accuracy.png?raw=true) | ![02_reward_mean.png](https://github.com/HansBug/OpenClaw-RL/blob/a6981c9c1fddd84fd5e8ddf5fec8bd4402ac2917/02_reward_mean.png?raw=true) |

| 03 raw_reward（batch mean） | 04 grad_norm |
|---|---|
| ![03_raw_reward.png](https://github.com/HansBug/OpenClaw-RL/blob/a6981c9c1fddd84fd5e8ddf5fec8bd4402ac2917/03_raw_reward.png?raw=true) | ![04_grad_norm.png](https://github.com/HansBug/OpenClaw-RL/blob/a6981c9c1fddd84fd5e8ddf5fec8bd4402ac2917/04_grad_norm.png?raw=true) |

| 05 kl_loss | 06 entropy_loss |
|---|---|
| ![05_kl_loss.png](https://github.com/HansBug/OpenClaw-RL/blob/a6981c9c1fddd84fd5e8ddf5fec8bd4402ac2917/05_kl_loss.png?raw=true) | ![06_entropy_loss.png](https://github.com/HansBug/OpenClaw-RL/blob/a6981c9c1fddd84fd5e8ddf5fec8bd4402ac2917/06_entropy_loss.png?raw=true) |

| 07 non_trainable_ratio | 08 response_len |
|---|---|
| ![07_non_trainable_ratio.png](https://github.com/HansBug/OpenClaw-RL/blob/a6981c9c1fddd84fd5e8ddf5fec8bd4402ac2917/07_non_trainable_ratio.png?raw=true) | ![08_response_len.png](https://github.com/HansBug/OpenClaw-RL/blob/a6981c9c1fddd84fd5e8ddf5fec8bd4402ac2917/08_response_len.png?raw=true) |

> **图上肉眼可见的几个特征**：
> - `01_accuracy`：早期 step 0-100 快速从 0.18 爬到 0.45 → 然后进入 0.45-0.55 长期震荡，无明显塌缩
> - `04_grad_norm`：可以看到 step 517 那个 5,150,010 的极端尖刺 — 单点突起，前后步全部 <1，megatron `--clip-grad 1.0` 已兜底
> - `05_kl_loss`：step 517 同步出现一个 3572 的极值（同一 batch 异常），但前后立即恢复 ~0.1
> - `07_non_trainable_ratio`：**全程 < 0.15**，与 issue #2 末段冲到 0.92 的 mode-collapse 形成强对比
> - `08_response_len`：稳定在 100-500 token 区间，**没有出现 issue #2 的 "塌缩到 7 token" 现象**

---

## 4 vs issue #1, #2 的横向对比

| 项 | issue #1 (4B) | issue #2 (8B) | **本次 (8B)** |
|---|---|---|---|
| runtime | ~10h | ~45h | **60h12m** |
| step 数 | ~270 | ~1107 | **565** |
| ckpt 数 | ~33 | ~138 | **35** |
| accuracy peak (单 batch) | 0.60 | 0.59 | **0.711** ✨ |
| accuracy peak (25-step bucket avg) | 0.385 | **0.45** ✨ | 0.522 (50-rollout avg) |
| reward_mean peak | n/a | +0.40 | **+0.422** ✨ |
| 最终是否 mode-collapse | plateau | **是**（grad_norm→0 / non_trainable→0.92 / resp_len→7） | **否**（grad_norm=0.45 / non_trainable=0.001 / resp_len=300+） |
| step 数比 | 270/270 | 1107/270 = 4.1× | 565/270 = 2.1× |
| 单 batch peak / 长期均值差距 | 0.60/0.385 = 1.55× | 0.59/0.45 = 1.31× | 0.71/0.52 = 1.37× |

**关键差异**：
- **run-3 收尾时还没 mode-collapse**（issue #2 在 step ~800 后开始塌缩，run-3 在 step 565 停止时还稳）
- **峰值更高**（0.711 vs 0.59）但**长期均值在同一水平线**（0.52 vs 0.45）
- **step 数较少**：因为本轮 dataset 侧卡顿损耗 ~19%（issue #3 主帖讨论的 1024-lease pool 顶住了，但 sample 内部慢得拖整个 rollout）

---

## 5 训练侧 3 个数值异常事件（全部 grad_clip 兜底成功）

run-3 训练中出现 3 个**单步数值异常**，全部被 megatron 默认 `--clip-grad 1.0` 兜底，**模型权重未实际损坏**：

### 5.1 step 483 — grad_norm 26.26（首次警报）

```text
[2026-04-28 19:06:14] step 483: train/loss=-0.1469  grad_norm=26.2644  kl_loss=0.0677
                              previous step 482   grad_norm=0.5024
                              next     step 484   grad_norm=0.7731
```

单点尖刺，下一步立即恢复。**不影响后续**。

### 5.2 step 517 — grad_norm 5,150,010 ⚠️（**100,000× over EARLY-STOP 阈值**）

```text
[2026-04-28 21:29:52] step 517:
    train/loss = 27,473.78
    train/pg_loss = 27,473.78
    grad_norm = 5,150,010.68    ← 触发 EARLY-STOP 阈值（>50）100,000 倍
    kl_loss = 0.1340            ← 但 kl_loss 没飙升（重要）
```

**事后分析**：
- 该批数据某 sample 含 inf/NaN 类极端 advantage
- megatron `--clip-grad 1.0` 把实际权重更新裁到 ≤1.0
- **kl_loss 没飙升**说明权重未实际偏离 ref policy
- 后续 8 step grad_norm 全部回到 0.4-0.8 健康区间
- accuracy 在 step 517 后反而创下 0.711 新高（rollout 259）

→ 严格按规则触发 EARLY-STOP，但实际由 grad_clip 兜底，模型完全 alive。**写入"动手前先看 5 step 内 kl_loss 是否升 + accuracy 是否崩"的新规则**。

### 5.3 step 531 — kl_loss 0.4326（86% 阈值）

```text
[2026-04-28 22:33:06] step 531: train/loss=+0.0337  kl_loss=0.4326  grad_norm=0.6196
                              previous step 530   kl_loss=0.1130
                              next     step 532   kl_loss=0.1048
```

单点尖刺，未破 0.5 阈值。下一步恢复 0.10。

---

## 6 dataset 侧卡顿事件（**6 类，累计损耗 ~19% 训练时间**）

⚠️ **这是 run-3 最重要的发现**：在 env-pool 完全健康的前提下，**单容器内 sample 行为本身**会拖垮整个 GRPO rollout（GRPO 必须等 group 内所有 sample 完成才能 train）。

完整真实数据集任务 + 模型行为 + 触发日志见 **[issue #3 的 60h+ pool 实测 comment](https://github.com/HansBug/OpenClaw-RL/issues/3#issuecomment-4340486381)**（避免本帖过长），这里只列汇总：

| # | task_id | 触发模式 | 持续 | 自愈 |
|---|---|---|---:|---|
| 1 | **786** (1st 发作) | nmap×66 复读 | ~3h45m | ✅ |
| 2 | **96** | FHS `find /` × 121 calls × 24MB output | ~86min | ✅ |
| 3 | **90** (×2 sample) | Swap monitoring 5 min/turn (task instruction 自身要求) | ~60min/sample | ✅ |
| 4 | **456** | Python memleak Turn 9 静默 hang（新失败模式） | ~70min | ✅ |
| 5 | **786** (2nd 发作) | nmap×94 复读（更狠） | **~6h09m** | ✅ |
| 6 | **856** | apt + postfix 慢 shell_exec | ~67min | ✅ |

**关键观察**：
- ✅ **6/6 事件全部自然自愈**（agent_runner 10-turn cap + lease idle TTL 兜底链路可靠）
- ✅ **0 次需要人工 docker rm 干预**
- ⚠️ task=786 是**惯犯**，在同一 run 内复发 2 次，第 2 次比第 1 次更严重（94 vs 66 calls）
- ⚠️ **run-3 累计 dataset 卡顿损耗 ~11h35m / 60h12m ≈ 19%**

---

## 7 复盘建议（按可执行优先级）

### 7.1 短期（下次重启前必做）

**对比 issue #2 末段的 mode-collapse 路径，run-3 没塌缩的原因**：本次 step 数（565）远未到 issue #2 塌缩开始的 step ~800-1000 区间。**预期下次跑到 step 800+ 仍会进入相同的塌缩路径**。下次重启时建议：

1. **加 dataset 黑名单**（消除 ~19% 空转）：

   ```python
   # terminal-rl/data_utils/convert_task_to_dataset.py
   BLACKLIST = {'786', '96', '90', '456', '856', '210'}  # 加 issue #3 §7.6 那几个
   ```

2. **加 holdout eval split** —— 详见 issue #3 主帖关于 `tbench_test` 的讨论。无 holdout 等于盲调，run-1/2/3 之间**只能用 train-rollout subset 的 accuracy 做对比**（不公平）。

3. **加 wall-clock per-sample timeout**（在 agent_runner 加 30min 硬上限）：

   ```python
   # terminal-rl/agent_runner.py
   TOTAL_ROLLOUT_TIMEOUT_S = 30 * 60
   async def run_episode(...):
       start_ts = time.time()
       while turn < max_iterations:
           if time.time() - start_ts > TOTAL_ROLLOUT_TIMEOUT_S:
               logger.error("Wall-clock timeout, aborting rollout")
               break
   ```

### 7.2 中期（避免 issue #2 mode-collapse）

GRPO outcome-only 在 step 800+ 必塌缩这个观察，issue #2 已经验证过。建议下次：

1. **加 PRM**（OpenClaw-RL 主仓库支持 `--prm-enable`），把 reward 信号变密
2. **加 entropy bonus** 防止 policy 过度收敛
3. **lr decay**（当前 lr 1e-6 constant，建议 cosine to 0.5e-6）

### 7.3 长期（治理模型 mass parallel 复读）

run-3 已观察到的"task=786 单 turn 发 94 个完全相同 nmap" pattern 是 **GRPO 的 reward shaping 失败**：模型学会了"短时间多发命令 = 多次试错 = 偶尔 + reward"。建议：

```python
# rollout_log.py 或 reward_fn 端
def compute_dup_penalty(tool_calls):
    cmd_counts = Counter(tc.args.get('command') for tc in tool_calls)
    return -0.05 * sum(c-1 for c in cmd_counts.values() if c > 1)
```

---

## 8 与 [issue #3] 的双向引用

- **本帖**：run-3 算法/训练侧实验记录（accuracy 曲线、step 异常、横向对比）
- **[issue #3 的 60h+ pool 实测 comment](https://github.com/HansBug/OpenClaw-RL/issues/3#issuecomment-4340486381)**：env-pool 60h+ 实测层面（6 类 dataset 卡顿 pattern + 完整 task.yaml + 实际 ToolCallRequest 列表 + pool 错误统计）

---

## 9 本次实验结束状态（2026-04-29 02:43 UTC）

- 训练进程已 SIGTERM 收尾 ✅
- env-pool 容器仍运行（draining 中，留作下次复用）
- 30min cron 健康广播已停 ✅
- 最后 ckpt: `iter_0000279`（对应 train step ≈ 558，rollout 约 279）
- 完整 wandb history：https://wandb.ai/hansbug/openclaw-terminal-rl/runs/msp60ius


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

实验记录：terminal-rl Qwen3-8B run-3 (60h, 565 step, wandb msp60ius) — accuracy 0.32→0.52，无 mode-collapse + 6 类 dataset 卡顿 #4

TL;DR

1 期望训练结果（同 issue #1, #2，不重复）

2 启动顺序和命令（与 issue #2 完全相同）

2.1 pool launch（唯一与 issue #2 不同的环境侧）

2.2 wrapper 层 / 算法 args 全 0 行变更

2.3 配置摘要（核心 RL args）

3 wandb 全程曲线（按 50-rollout 桶平均）

3.1 主指标

3.2 训练侧指标（按 50-step 桶取中位数 — 中位数防 step 517 异常值污染）

3.3 极值点

3.4 wandb 训练曲线（图）

4 vs issue #1, #2 的横向对比

5 训练侧 3 个数值异常事件（全部 grad_clip 兜底成功）

5.1 step 483 — grad_norm 26.26（首次警报）

5.2 step 517 — grad_norm 5,150,010 ⚠️（100,000× over EARLY-STOP 阈值）

5.3 step 531 — kl_loss 0.4326（86% 阈值）

6 dataset 侧卡顿事件（6 类，累计损耗 ~19% 训练时间）

7 复盘建议（按可执行优先级）

7.1 短期（下次重启前必做）

7.2 中期（避免 issue #2 mode-collapse）

7.3 长期（治理模型 mass parallel 复读）

8 与 [issue #3] 的双向引用

9 本次实验结束状态（2026-04-29 02:43 UTC）

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

来源	setting	outcome-only
Table 4	tool-call (250 step)	0.17
Table 4	GUI (120 step)	0.31
Table 4	terminal	空
Table 3	personal-agent Binary RL (8/16 step)	0.25 / 0.23

阶段	rollout 0-49	50-99	100-149	150-199	200-249	250-282
`terminal/accuracy`	0.318	0.466 ⬆️+0.148	0.487	0.483	0.526	0.522
`terminal/reward_mean`	-0.364	-0.068 ⬆️+0.296	-0.027	-0.034	+0.052	+0.045
`terminal/non_trainable_ratio`	0.017	0.022	0.017	0.016	0.028	0.028
`terminal/truncated` (count)	1104	1136	1147	1118	1049	1062

指标	step 0-50	200-250	400-450	500-535
`train/grad_norm`	0.79	0.66	0.51	0.51
`train/kl_loss`	0.04	0.08	0.12	0.11
`train/loss`	-0.025	-0.045	-0.026	-0.038

项	issue #1 (4B)	issue #2 (8B)	本次 (8B)
runtime	~10h	~45h	60h12m
step 数	~270	~1107	565
ckpt 数	~33	~138	35
accuracy peak (单 batch)	0.60	0.59	0.711 ✨
accuracy peak (25-step bucket avg)	0.385	0.45 ✨	0.522 (50-rollout avg)
reward_mean peak	n/a	+0.40	+0.422 ✨
最终是否 mode-collapse	plateau	是（grad_norm→0 / non_trainable→0.92 / resp_len→7）	否（grad_norm=0.45 / non_trainable=0.001 / resp_len=300+）
step 数比	270/270	1107/270 = 4.1×	565/270 = 2.1×
单 batch peak / 长期均值差距	0.60/0.385 = 1.55×	0.59/0.45 = 1.31×	0.71/0.52 = 1.37×

#	task_id	触发模式	持续	自愈
1	786 (1st 发作)	nmap×66 复读	~3h45m	✅
2	96	FHS `find /` × 121 calls × 24MB output	~86min	✅
3	90 (×2 sample)	Swap monitoring 5 min/turn (task instruction 自身要求)	~60min/sample	✅
4	456	Python memleak Turn 9 静默 hang（新失败模式）	~70min	✅
5	786 (2nd 发作)	nmap×94 复读（更狠）	~6h09m	✅
6	856	apt + postfix 慢 shell_exec	~67min	✅

实验记录：terminal-rl Qwen3-8B run-3 (60h, 565 step, wandb msp60ius) — accuracy 0.32→0.52，无 mode-collapse + 6 类 dataset 卡顿 #4

Description

TL;DR

1 期望训练结果（同 issue #1, #2，不重复）

2 启动顺序和命令（与 issue #2 完全相同）

2.1 pool launch（唯一与 issue #2 不同的环境侧）

2.2 wrapper 层 / 算法 args 全 0 行变更

2.3 配置摘要（核心 RL args）

3 wandb 全程曲线（按 50-rollout 桶平均）

3.1 主指标

3.2 训练侧指标（按 50-step 桶取中位数 — 中位数防 step 517 异常值污染）

3.3 极值点

3.4 wandb 训练曲线（图）

4 vs issue #1, #2 的横向对比

5 训练侧 3 个数值异常事件（全部 grad_clip 兜底成功）

5.1 step 483 — grad_norm 26.26（首次警报）

5.2 step 517 — grad_norm 5,150,010 ⚠️（100,000× over EARLY-STOP 阈值）

5.3 step 531 — kl_loss 0.4326（86% 阈值）

6 dataset 侧卡顿事件（6 类，累计损耗 ~19% 训练时间）

7 复盘建议（按可执行优先级）

7.1 短期（下次重启前必做）

7.2 中期（避免 issue #2 mode-collapse）

7.3 长期（治理模型 mass parallel 复读）

8 与 [issue #3] 的双向引用

9 本次实验结束状态（2026-04-29 02:43 UTC）

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions