实验记录：Qwen3-8B TB v0.1.x eval-as-train capacity probe (84h, 320 rollout, wandb fdhgc9j7) — Codex pass@1 upper bound 9.2%（vs run-3 OOD 5.6% / vs Qwen3-32B 15.5% / vs AfterQuery 17%）

## TL;DR

承 [issue #4 run-3 主帖](https://github.com/HansBug/OpenClaw-RL/issues/4) 与 [issue #8 OOD eval](https://github.com/HansBug/OpenClaw-RL/issues/8)，开了一个**特殊配置**的 capacity probe —— 把 **TB v0.1.x eval set（86 task）直接当 train set** 训 Qwen3-8B（"eval-as-train" overfit 探针）。launch 脚本：[`run_qwen3_8b_tboverfit.sh`](https://github.com/HansBug/OpenClaw-RL/blob/main/terminal-rl/run_qwen3_8b_tboverfit.sh)，仅在 [`run_qwen3_8b_experiment.sh`](https://github.com/HansBug/OpenClaw-RL/blob/main/terminal-rl/run_qwen3_8b_experiment.sh) 上 override `ROLLOUT_PROMPT_DATA` 为 `tbench_test_convert/train.jsonl`，**算法侧 0 改动**。

> ⚠️ **本 ckpt 不可提交 TB leaderboard / 不可作为 fair benchmark** —— train/eval 同集 = 数据泄漏。本 run 的输出只能做 (1) **capacity 上界探针**，(2) reward landscape / 学习行为分析，(3) 后续 probing / distillation 的 ckpt 来源。

- **W&B run**：https://wandb.ai/hansbug/openclaw-terminal-rl/runs/fdhgc9j7
- **runtime**：~84h（2026-05-02 12:20 UTC → 2026-05-05 11:55 UTC，**NFS 爆盘 ENOQUOTA forced stop**，详见 §9）
- **训练量**：320 rollout / 320 train step / 37,236 trial / 最后 ckpt = `iter_0000311`
- **核心结论：eval-as-train 给出 pass@1 ≈ 9.2% 的 upper bound**，**距同尺寸开源 setup 仍差 1.7-1.9×，距 frontier 闭源差 5-7×**：

| 对照 | TB v0.1.x pass@1 | vs 本 run |
|---|---:|---:|
| Claude 4.5 Sonnet (Apex2) [闭源 frontier] | **0.645** | 7.0× |
| Claude Opus 4.1 (Droid) [闭源 frontier] | 0.588 | 6.4× |
| GPT-5 (Droid) [闭源 frontier] | 0.525 | 5.7× |
| MiniMax-M2 230B (iFlow CLI) [开源 large] | 0.420 | 4.6× |
| GLM-4.5 355B (Terminus 1) [开源 large] | 0.399 | 4.3× |
| Qwen3-Coder-480B (iFlow CLI) [开源 large] | 0.390 | 4.2× |
| **Qwen3-32B + TerminalAgent [4× 同 family]** | **0.155** | **1.7×** |
| gpt-5-nano (Terminus 2) | 0.122 | 1.3× |
| codex-mini-latest (Codex CLI) | 0.113 | 1.2× |
| **本 run tboverfit eval-as-train upper bound** | **0.092** | **1.0×** |
| Qwen3-235B-A22B (Terminus 1) | 0.066 | 0.7× |
| **run-3 iter215 OOD peak (issue #8)** | **0.056** | **0.6×** |
| DeepSeek-R1 (Terminus 1) | 0.057 | 0.6× |
| Qwen3-8B base (Terminus 2) | 0.020 | 0.2× |

- vs **run-3 同 8B 模型 OOD eval**：tboverfit 9.2% / iter215 OOD 5.6% = **1.65×**，eval-as-train 确实给出更高 upper bound（消除 OOD gap）
- vs **同 family Qwen3-32B + TerminalAgent**: 0.155 / 0.092 = **1.7×**。同 base family 把模型放大 4 倍 + 换 better agent harness 已经把 8B 的 eval-as-train 上界拉爆
- vs **AfterQuery GPT-OSS-20B (SFT+RL fine-tune)**：在 TB Core 2.0（更难）上达到 **0.170**。同尺寸量级（8B vs 21B）、同样开源、同样 Apache-2.0，但他们走 **SFT cold-start + RL** 流程 + 专门的 agent-tuning，**做到 17%**，比我们 9.2% 高 1.85×。说明 **8B-21B 这个尺寸档对 TB 是有空间的，问题不是 capacity bound 而是 setup**

- **训练侧基础设施健康但 reward 中后期变稀**：`grad_norm` mean 0.564、`kl_loss` mean 0.089、`response_len` 50w 在 [87, 283]、`pg_clipfrac` 全 0、无 mode-collapse；但 `terminal/non_trainable_ratio` 前 225 step 平均 3.4%（与 run-3 < 3% 同档），**rollout 250-299 这一窗内升到 22.65% 并出现 7 次 single-step > 0.5 的 spike**（详见 §8）。reward 分布 53.5% trial 全 0 / 41.3% partial / 5.1% strict success，GRPO 优势估计信噪比低（详见 §7）

- **Per-task 拆解**：320 rollout × 86 task 中，**只有 19/86 task 曾经 strict-solved**，其中 6 个 task（swe-bench-fsspec / broken-python / fix-permissions / fibonacci-server / hello-world / vim-terminal-task）pass rate ≥ 40%（broken-python 仅 1 trial 置信度低，有效高频是 5 个）；**67/86 task 整 84h 训练 0 strict success**。"9.2%" 不是模型在 86 task 上均匀解决了 9.2%，而是**集中 overfitting 在 ~5 个简单 task** 上、再加少量 partial credit

- **保留 ckpt（清理 3.5 TB 后）**：`iter_0000007 / 0000103 / 0000199 / 0000303 / 0000311`，共 5 个 × ~104 GB = 515 GB

---

## 1 实验设计：为什么要做 eval-as-train？

issue #4 run-3 在 1376-task seta_env 上跑出 0.517 mean accuracy（train-side），但 issue #8 在 TB v0.1.x 66-task harbor 子集上做 OOD eval 时只有 **iter215 peak 0.056 / iter279 0.025**。这中间存在两条**不可分辨**的可能：

1. **数据分布不匹配（OOD gap）**：seta_env 训练分布 ≠ TB v0.1.x 评测分布
2. **模型容量上限（capacity bound）**：Qwen3-8B 在 TB v0.1.x 上的天花板就这么高

要区分这两条，**最干净的实验是把 TB v0.1.x 自己当训练集**：消除数据分布差，只剩"模型能不能学会"的问题。本次 probe 的产出：

- ✅ **eval-as-train 把 pass@1 从 OOD 5.6% 推到 9.2%，差 +3.6 pp**：OOD gap 是真实存在的，但**不是 run-3 涨不动的全部原因**
- ✅ **9.2% 仍远低于同尺寸开源 setup（AfterQuery 17.0% on TB 2.0）和同 family 大模型（Qwen3-32B 15.5% on v0.1.x）**：**capacity 上界存在，但远比我们这套 setup 能逼近的更高**
- ✅ **集中 overfitting 而非分散学习**：19/86 任务才有 strict pass，剩 67 个全程没解开 → eval-as-train 也救不了 hard task

这个 probe 的"失败"姿态本身就是信息：**8B base + GRPO outcome-only + 默认 Terminus 2 scaffolding + 无 SFT warmup** 这套组合在 TB v0.1.x 这个难度上 **不够**。

---

## 2 启动配置（与 run-3 仅差 1 个环境变量）

完整 wrapper：[`run_qwen3_8b_tboverfit.sh`](https://github.com/HansBug/OpenClaw-RL/blob/main/terminal-rl/run_qwen3_8b_tboverfit.sh)

```bash
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
export ROLLOUT_PROMPT_DATA="${SCRIPT_DIR}/dataset/tbench_test_convert/train.jsonl"  # ← 唯一区别
export SAVE_CKPT="${SCRIPT_DIR}/ckpt/qwen3-8b-tboverfit"
export WANDB_GROUP="qwen3-8b-tboverfit-probe"
exec "${SCRIPT_DIR}/run_qwen3_8b_experiment.sh"
```

`tbench_test_convert/train.jsonl` 是 TB v0.1.x 86 个 task 转换后的 prompt-only jsonl（task 描述 + Dockerfile 路径）。

### 配置一致性

| 配置项 | run-3 (issue #4) | tboverfit-probe (本 run) |
|---|---|---|
| 模型 | Qwen3-8B base | Qwen3-8B base（同 ckpt） |
| algorithm | GRPO outcome-only (dense pass-rate reward) | **同** |
| `rollout_batch` | 16 prompt × 8 sample = 128 traj | **同** |
| `lr` | 1e-6 const | **同** |
| `kl_loss_coef` | 0.01 (k3) | **同** |
| `save_interval` | 8 round | **同** |
| `num_rollout` target | 2000 | **同** |
| **训练数据** | seta_env 1376 task | **TB v0.1.x 86 task（eval-as-train）** |
| pool 配置 | 1024 lease (issue #3) | **同** |
| Agent | Terminus 2 默认 | **同** |
| SFT warmup | 无 | **无** |

> 算法侧 / agent 侧 / 环境侧 0 改动，只换 `ROLLOUT_PROMPT_DATA` 一个变量。注意 **没有用 SFT warmup**（vs AfterQuery 走 SFT-then-RL 流程），也没改 Terminus 2 默认 scaffolding（vs leaderboard top 用 Apex2 / TerminalAgent / Codex CLI 这种 agent-side 优化的版本）。

---

## 3 主指标：Codex pass@1 trajectory

### 3.1 metric 定义（与 issue #8 §1.5 完全一致）

按 [Codex 论文 (Chen et al. 2021)](https://arxiv.org/abs/2107.03374) §2.1 公式：每 task 计算 $c_i / n_i$（success trial / total trial），dataset-level pass@1 = $\frac{1}{|\text{tasks}|} \sum_i c_i / n_i$。

本次：
- `n_i` = 该 task 在 320 rollout 中被采样到的总 trial 数（mean ~430，因为 86 task × 128 trial / round, 但每 task 不一定每 round 都被采到）
- `c_i` = 这些 trial 里 `accuracy >= 0.99` 的 strict success 数
- denominator = **86**（即使日志中只观察到 85 task 有 trial，分母仍按完整 86 算 → 1 个 task 全 0）

### 3.2 trajectory（50-rollout sliding window）

![tboverfit_pass1.png](https://github.com/user-attachments/assets/fac73e64-d5a3-46ca-baaf-0b6586dda65e)

| 锚点 | rollout | Codex pass@1 |
|---|---:|---:|
| Start (50w 第一个有效点) | 49 | **2.49%** |
| Peak | **318** | **9.27%** ⭐ |
| Final | 319 | 9.23% |
| All-time micro-avg (sum c / sum n) | — | 5.12% |
| Tasks-ever-solved (pass@∞) | — | 19/86 = 22.09% |

**关键观察**：

1. **0 → 320 rollout 学到 +6.78 pp pass@1**（2.49% → 9.27%）。绝对涨幅是真实的，**但分母小、绝对水平低**
2. **训练 50 → 250 rollout 这 200 个 rollout 里 pass@1 在 [4%, 6%] 横盘**，rollout 250 后才再涨到 9%。说明 GRPO 在前 200 rollout 主要靠 partial credit 累积，strict success (pass-rate=1) 很难突破
3. **vs run-3 iter215 OOD peak 5.6%**：tboverfit final 9.2% 高 +3.6 pp ≈ **1.65×**，**eval-as-train 设置确实把 OOD gap 消掉了一些**
4. **vs Qwen3-32B + TerminalAgent 15.5%**：差 -6.3 pp ≈ **1.7×**。**4× 模型 + better agent harness 已经压过我们 eval-as-train 的上界**

---

## 4 vs run-3：同坐标系下的 trajectory 对照

直接对比 wandb canonical 字段 `terminal/accuracy`（rollout-level mean of per-trial accuracy ∈ [0, 1]，run-3 issue #4 §3.1 用的就是这个）。

![tboverfit_vs_run3.png](https://github.com/user-attachments/assets/6182f983-a079-4050-bd83-f2b9a3449052)

| Run | 训练数据 | start (50w) | peak (50w) | final (50w) | rollout 数 |
|---|---|---:|---:|---:|---:|
| run-3 (issue #4) | seta_env 1376 task | **0.318** | **0.528** ⭐ | 0.517 | 284 |
| tboverfit (本 run) | TB v0.1.x 86 task (eval-as-train) | 0.174 | 0.360 | 0.355 | 318 |
| **gap** | — | **-0.144** | **-0.168** | **-0.162** | — |

**关键发现 — eval-as-train 下 train-side accuracy 显著低于 run-3 的 OOD-evaluable train-side**：

- **起点低 0.14**：tboverfit 起点 0.17 vs run-3 起点 0.32 = TB task 难度本身高于 seta_env，模型从 base 出发就要克服更难的分布
- **峰值低 0.17**：run-3 在 seta_env 上 hit 0.53，tboverfit 在 TB v0.1.x 上即使 eval-as-train 也只到 0.36
- **末值低 0.16**：差距没有缩小

这意味着：

> **Eval-as-train 不能"自动地"让 train-side accuracy 飙升**。直觉上，把答案集喂回去训练应该 trivially 飙到 90%+，但实际上 partial credit + 多步 task + GRPO 优势估计的高方差导致**模型很难把"知道答案"转化为"能稳定输出 successful trajectory"**。

run-3 在 1376 task 大池子上学习曲线更平滑、峰值更高，是因为：
1. **task 数大** → reward 梯度估计方差小、GRPO 优势更稳
2. **task 难度梯度** → 简单 task 给信号、难 task 不影响整体（curriculum 自带）
3. **task 多样性** → 不容易 overfit 到几个单一 pattern

而 tboverfit 的 86 task 池子里大部分是 evaluation-grade hard task，模型很快被 dominate 在少数几个能解的 task 上（详见 §6 per-task 拆解）。

---

## 5 vs Leaderboard：tboverfit 9.2% 在 TB v0.1.x 上的真实位置

![tboverfit_leaderboard.png](https://github.com/user-attachments/assets/9d73f40b-9255-4a7a-b227-7c3377af9e6f)

数据来源：[issue #8 §2.4-§2.5 调研](https://github.com/HansBug/OpenClaw-RL/issues/8) + [tbench.ai 官方 v0.1.x leaderboard](https://www.tbench.ai/leaderboard/terminal-bench/1.0)。

### 5.1 同 base family（Qwen3）梯度

| 模型 + agent | TB v0.1.x pass@1 | 倍率 vs base 0.020 | 倍率 vs tboverfit 0.092 |
|---|---:|---:|---:|
| Qwen3-Coder-480B (iFlow CLI) [agent 优化] | 0.390 | 19.5× | 4.2× |
| Qwen3-Coder-480B (Orchestrator) | 0.197 | 9.85× | 2.1× |
| **Qwen3-32B + TerminalAgent** | **0.155** | 7.75× | **1.7×** |
| Qwen3-235B-A22B (Terminus 1) | 0.066 | 3.3× | 0.72× |
| **本 run tboverfit (eval-as-train)** | **0.092** | **4.6×** | **1.0×** |
| **run-3 iter215 OOD (8B + RL)** | **0.056** | **2.8×** | 0.61× |
| Qwen3-8B base (Terminus 2) | 0.020 | 1.0× | 0.22× |

**核心 takeaway**：

1. **eval-as-train 8B (本 run) 0.092** > **OOD 8B + RL run-3 0.056** > **base 8B 0.020**：8B 模型上 RL 给了 2.8×，eval-as-train 又给了 1.6×，**绝对值从 0.02 → 0.06 → 0.09，每步都是真实的，但每步幅度都不大**
2. **8B (eval-as-train) 0.092** vs **32B + TerminalAgent 0.155**：4× 模型 + better agent harness 比 8B + eval-as-train **多 +6.3 pp**。说明 **agent scaffolding + 模型大小** 在这个尺度上比 **eval-as-train + 8B + 默认 scaffolding** 更有效
3. **8B (eval-as-train) 0.092** vs **Qwen3-235B 0.066**：**29× 模型 + 默认 Terminus 1 scaffolding 反而比我们低**！ 这就是 issue #8 §2.5.4 takeaway #4 写的"**RL 提升 magnitude 在小模型上很显著**"——但前提是 RL 真的在做有效 fine-tuning（同尺寸 vanilla 235B 不带 RL 也只有 0.066）

### 5.2 跨尺寸对照（TB Core 2.0 上 <100B 模型，issue #8 §2.5.1）

| 模型 + agent | TB 2.0 pass@1 | base | 路线 |
|---|---:|---|---|
| **AfterQuery-GPT-OSS-20B (SFT+RL fine-tune)** | **0.170** | GPT-OSS-20B (21B/3.6B MoE) | ✅ 专门 agentic SFT cold-start + RL |
| GPT-5-Nano (Codex CLI) | 0.115 | unsure (闭源) | ✅ Codex 路线 |
| GPT-OSS-20B (vanilla, Mini-SWE-Agent) | 0.034 | 21B/3.6B MoE | 默认 |

**AfterQuery 的对照特别有教育意义**：

- **base 类型相近**（GPT-OSS-20B 21B vs Qwen3-8B 8B，规模差 2.6×）
- **同样开源 / 同样 Apache-2.0** 系
- 他们用 **SFT cold-start (从公开 dataset) + RL on real-world tasks** 的路线 → **0.170 (TB 2.0)**
- 我们 **直接 RL on eval-as-train，无 SFT** → **0.092 (v0.1.x，但实际等价)**

**差距 ~1.85× 主要不是模型容量差距**（21B vs 8B 容量差不会到 1.85×），**而是训练流程：SFT warmup + 专门的 agent-tuning**。这是 [AfterQuery 公开 blog](https://www.afterquery.com/blog/terminal-bench-improvement) 主打的方法论。

### 5.3 vs Frontier（top-tier 闭源）

| 模型 + agent | TB v0.1.x pass@1 | 倍率 vs tboverfit |
|---|---:|---:|
| Claude 4.5 Sonnet (Apex2) | 0.645 | **7.0×** |
| Claude Opus 4.1 (Droid) | 0.588 | 6.4× |
| GPT-5 (Droid) | 0.525 | 5.7× |
| Claude Sonnet 4.5 (Terminus 2) | 0.510 | 5.5× |

frontier 用大量参数 + 专门 agent SFT/RL + 业界最强 agent scaffolding（Apex2 / Droid 都是各家自研），把 v0.1.x 推到 60%+。**0.092 vs 0.645 = 7× 差距是开源 8B + minimal post-train + 默认 scaffolding 与他们的全部综合差距**，无法分解为单一因子。

---

## 6 Per-task 拆解：86 task 中 19 个被学过、6 个被高质量学过、67 个一次都没解开

![tboverfit_per_task.png](https://github.com/user-attachments/assets/53b21fde-273e-4643-a383-6b09afaf4447)

把 320 rollout × 86 task 总共 **~37,200 trial** 按 task 聚合（log-parse 抽样，约 9% 因 timeout 没采到 evaluation 行），统计每 task 的 strict pass rate（`accuracy >= 0.99`）：

### 6.1 高 pass rate（≥ 40%）— 6 task

| Task | strict success / total trials | strict pass rate | 类型 |
|---|---:|---:|---|
| swe-bench-fsspec | 467 / 467 | **100.0%** | SWE patch-style，可能有捷径解（详见 §6.5） |
| broken-python | 1 / 1 | **100.0%** | 单 trial，置信度低 |
| fix-permissions | 371 / 400 | 92.8% | 单步 chmod，简单 |
| fibonacci-server | 210 / 378 | 55.6% | flask 单 endpoint，trivial |
| hello-world | 216 / 478 | 45.2% | 写文件 |
| vim-terminal-task | 197 / 480 | 41.0% | vim 命令固定 pattern |

### 6.2 中 pass rate（10-40%）— 3 task

| Task | strict success / total trials | strict pass rate |
|---|---:|---:|
| qemu-startup | 183 / 478 | 38.3% |
| recover-obfuscated-files | 121 / 409 | 29.6% |
| get-bitcoin-nodes | 63 / 458 | 13.8% |

### 6.3 低 pass rate（< 10%）— 10 task

prove-plus-comm 2.4% / heterogeneous-dates 1.5% / csv-to-parquet 1.4% / chem-property-targeting 0.8% / sqlite-with-gcov 0.5% 等 — 偶发 strict success

### 6.4 NEVER strict-solved — **67 / 86 task**

包括但不限于：

- 全部 swe-bench-astropy-* / swe-bench-django 等 SWE patch task（部分 partial credit > 0.86 但永远不到 1.0 — 模型能改对部分 test 但碰不到完整 patch）
- tmux-advanced-workflow / decommissioning-service-with-sensitive-data（多步、需要安全意识）
- build-linux-kernel-qemu / blind-maze-explorer-5x5（多步规划）
- incompatible-python-fasttext.* / nginx-request-logging（环境配置）
- chess-best-move / cartpole-rl-training（需要算法 reasoning）

### 6.5 关键观察

1. **84h 训练把 strict pass 集中在 6 个简单 task 上**（swe-bench-fsspec 100% 是异常 outlier，broken-python 仅 1 trial 置信度低；剩下 4 个是 trivial single-step task）
2. **partial credit 普遍但碰不到 1.0**：很多 task 的 mean reward > 0.5（swe-bench-astropy-1 mean 0.867、swe-bench-astropy-2 mean 0.889）但 strict success 0 —— 模型能拿大部分 partial reward 但碰不到最后一关
3. **eval-as-train 没有"广度学习"** —— 既没有学到 86 task 的代表性子集，也没有从简单 task 推广到难 task。**集中 overfit 而非泛化**
4. **swe-bench-fsspec 100% pass 值得单独排查**：极有可能是 task 的 verifier 给了过于宽松的 reward signal（或者 task 本身有 trivial 解），不应该被解读为模型真正解决了 SWE 修 bug 任务

---

## 7 Reward sparsity：为什么"训练侧健康"≠"在学习"

![tboverfit_reward_dist.png](https://github.com/user-attachments/assets/3eda992e-fe8a-4f4c-95e5-71505f20c0a1)

37,236 个 trial 的 per-trial accuracy 分布：

| 区间 | 数量 | 占比 |
|---|---:|---:|
| `accuracy < 0.05`（基本完全失败） | 19,933 | **53.5%** |
| `accuracy ∈ [0.05, 0.99)`（partial credit） | 15,396 | **41.4%** |
| `accuracy >= 0.99`（strict success） | 1,907 | **5.1%** |

**reward signal 主要由 partial credit 主导（41.4%）**。这看起来是好事 —— 模型在很多 task 上能拿到部分分数，但实际上对 GRPO 的影响是混合的：

### 7.1 GRPO advantage 分布的高方差

GRPO (outcome-only) 把 raw_reward 标准化成 advantage：$A_i = \frac{r_i - \mu}{\sigma + \epsilon}$，其中 $\mu, \sigma$ 是 batch 内（128 trial）的均值和标准差。当：

- **53.5% 的 trial r=0、5.1% r=1、剩 41.4% 散布在 (0, 1)**
- batch 内 $\mu \approx 0.05$（micro-avg）
- $\sigma$ 在 [0.2, 0.4] 量级

advantage 的信噪比是这样的：失败 trial 拿到 -0.1 ～ -0.2 的微弱 negative，partial trial 拿到 +0.5 ～ +1.5 的中等 positive，strict success 拿到 +1.5 ～ +2.0 的强 positive。**问题在于 partial credit 的"几乎成功"和"完全成功"的 advantage 差距不大**，而 outcome-only RL 真正想优化的是后者。

### 7.2 这与 issue #2 / run-3 的对比

- **issue #2** mode-collapse: response_len → 7 / non_trainable_ratio → 0.92。原因是 batch 内 r 全 0 / 全 1 → advantage 全 0 → 没有梯度
- **run-3** 没塌陷因为 1376 task 大池子里总有一些拿到 r=1 的 task，advantage 分布合理
- **本 run** 处于"不塌陷但也学不动"的中间状态：reward 信号不全 0（partial credit 占 41.4%），但**真正驱动学习的 r=1 trial 只占 5.1%**，且高度集中在 6 个 task 上 → GRPO 等于在反复优化"做对那几个简单 task"，对其他 80 个 task 没有有效梯度

### 7.3 训练侧"健康"的精确含义

| 指标 | 值 | 健康定义 | 真实含义 |
|---|---:|---|---|
| `grad_norm` mean | 0.564 | 不爆炸（>50）/ 不消失（<0.05） | gradient 在动 |
| `kl_loss` mean | 0.089 | 不漂太远 ref（>0.5） | policy 没被拉走 |
| `non_trainable_ratio` | <0.05 | 不全 0 reward | 大部分 batch 有信号 |
| `response_len` 50w | [87, 283] | 不退化（<30） | 不是 mode-collapse |
| `entropy_loss` | 0.04-0.08 | 稳定 | 探索 ok |

**全绿 = 训练 setup 没崩、policy 在合理范围内动**。但**"在动" ≠ "在学到 86 task 的分布"**：

- ❌ **没有体现学到的"task 多样性"**：5 个 task 主导，不是 healthy distribution learning
- ❌ **没有体现"OOD 泛化"**：本来就是 eval-as-train，本质上不需要泛化
- ❌ **没有体现"hard task 学习"**：67/86 task strict success = 0
- ✅ **体现了 setup 稳定**：可以跑 84h 不崩

→ **不能因为训练侧曲线漂亮就认为"模型已经学到上限"**。"训练侧健康"只意味着 **infrastructure 工作正常**，与 **policy 收敛到了什么 distribution** 是两件事。

---

## 8 全程 Dashboard（run-1/2/3 同款 9 张图，数据源：wandb fdhgc9j7）

### 8.1 总览（00_dashboard.png）

![00_dashboard.png](https://github.com/user-attachments/assets/f13f5ce6-9fd3-4fe7-99d1-1076cd77fd67)

3×3 grid，按 issue #1 / #2 / #4 dashboard 同款格式：8 个 metric 子图（每个含 per-rollout raw + 50w 滑窗均值 + 100w 滑窗均值 + 阈值线）+ 右下角 summary 文本。**数据全部来自 wandb run `fdhgc9j7` 的 canonical metrics**（不是 log parse 估算）。

### 8.2 Per-50-rollout bucket（与 issue #4 §3.1 同口径，wandb 数据）

| bucket | `terminal/accuracy` | `terminal/reward_mean` | `terminal/non_trainable_ratio` | `rollout/response_lengths` |
|---|---:|---:|---:|---:|
| rollout 0-49 | 0.174 | -0.653 | **3.41%** | 151.2 |
| 50-99 | 0.222 | -0.556 | 5.50% | 182.9 |
| 100-149 | 0.249 | -0.502 | 3.20% | 197.8 |
| 150-199 | 0.281 | -0.438 | 5.33% | 193.4 |
| 200-249 | 0.318 | -0.363 | 7.49% | 260.1 |
| **250-299** | **0.355** | -0.291 | **22.65% ⚠️** | 264.3 |
| 300-319 | 0.349 | -0.301 | 7.91% | 261.7 |

> 对比 [issue #4 §3.1 run-3 同口径表](https://github.com/HansBug/OpenClaw-RL/issues/4)：run-3 50w `terminal/accuracy` 由 0.318 → 0.522（涨 +0.20），`non_trainable_ratio` 全程 0.017-0.028（< 3%）。本 run `terminal/accuracy` 由 0.174 → 0.355（涨 +0.18，**绝对水平低 0.17**）；`non_trainable_ratio` 前 200 rollout 平均 4.4%（与 run-3 < 3% 同档），**rollout 250-299 这段出现集中 degradation 升到 22.65%**，第一次 single-step > 0.5 出现在 wandb step 433。Pre-step-225 mean 3.36%（≈ run-3 水位），post-step-225 mean 8.68% 含 7 个 > 0.5 的 spike（详见 §8.3 子图 07 + §8.4）。

### 8.3 单指标分图（4×2 table）

| | |
|---|---|
| **01 — terminal/accuracy** | **02 — terminal/reward_mean** |
| ![01_accuracy.png](https://github.com/user-attachments/assets/59903af2-09d8-42c3-9888-1c520531eda3) | ![02_reward_mean.png](https://github.com/user-attachments/assets/6fc08b03-ad4b-47d9-be08-b33a9ea53236) |
| 50w start=0.174 → peak=0.360 → final=0.355；gap vs run-3 0.522 = **-0.16**。涨幅曲线和 run-3 同形状但抬高不起来 | 50w 范围 [-0.91, +0.13]，全程为负但单调上升；vs run-3 后期 +0.05 还差 0.1+ |
| **03 — rollout/raw_reward** | **04 — train/grad_norm** |
| ![03_raw_reward.png](https://github.com/user-attachments/assets/387970e7-3d0b-4e4b-bb2f-8131133332f9) | ![04_grad_norm.png](https://github.com/user-attachments/assets/201de1b2-25ef-486e-8de9-7ab47612f06a) |
| `raw_reward` 单 trial post-discount，整 320 rollout 50w 从未稳定为正 | 范围 [0.000, 2.069]，mean 0.564，全程在 [0.05, 50] 健康区间，仅 step 318 半空 anomaly 触底 |
| **05 — train/kl_loss** | **06 — train/entropy_loss** |
| ![05_kl_loss.png](https://github.com/user-attachments/assets/57911a5f-f41d-4faf-ae8f-3564adbf1178) | ![06_entropy_loss.png](https://github.com/user-attachments/assets/6007c2d0-9cb3-4537-bde3-b8bb6fc9605e) |
| 范围 [0.000, 0.237]，mean 0.089，从未踩 0.5 阈值 | 稳定 0.04–0.08，无 entropy collapse |
| **07 — terminal/non_trainable_ratio ⚠️** | **08 — rollout/response_lengths** |
| ![07_non_trainable_ratio.png](https://github.com/user-attachments/assets/f93bae39-6d7b-4525-88f3-3018ee5eeb76) | ![08_response_len.png](https://github.com/user-attachments/assets/8a02e6fe-0816-4964-9012-a059047fe8e0) |
| **mean 7.93% / pre-step-225 mean 3.36% (≈ run-3 水位) / post-step-225 mean 8.68%** ⚠️。第一次 > 0.5 在 step 433；总共 7/320 步 > 0.5 = 间歇性"组内全相同 reward"导致该 batch 无 advantage 信号 | 50w 范围 [87, 283]，从未跌破 30，无 mode-collapse。早期 ~150、后期 ~260 token |

### 8.4 训练侧定性结论

> **Infrastructure 全程健康（grad_norm / kl_loss / entropy / response_len 全部不踩阈值），non_trainable_ratio 前 225 step 与 run-3 持平 (3.4% ≈ run-3 < 3%)、后 95 step 升到 8.7% 并出现间歇性 100% spike。policy 没崩，只是 reward 信号在中后期变得越来越稀疏（partial credit 主导，strict success (pass-rate=1) 少且集中在 5 个 task）→ §3 看到的 pass@1 在 [4%, 9%] 区间晃荡 + §6 看到的"5 task overfit 而非 86 task 广度学习"，是这个 reward sparsity 的直接表现**。

---

## 9 Forced Stop：NFS ENOQUOTA（教训保留 + ckpt 清理）

### 9.1 时间线

| UTC 时间 | 事件 |
|---|---|
| 2026-05-02 12:20 | 训练启动 |
| 2026-05-05 11:55:42 | step 319 完成训练，开始写 `iter_0000319` ckpt |
| 2026-05-05 11:55:57 | 最后一行 RolloutManager log（rollout 320 进行到 56/128） |
| 2026-05-05 11:57:18 | `filesystem_async.py:326 - Local process 0 encountered an error: [Errno 122] Disk quota exceeded` |
| 2026-05-05 11:57+ | 8 个 Megatron rank 全部 exit，GPU 显存 70 GB → 0 MiB，pool 容器还活着但训练已死 |
| 2026-05-05 12:22 | 用户首次发现训练 stale（30min stall 检测触发） |
| 2026-05-06 02:13 | 清理 35 个中间 ckpt + 损坏的 iter_0000319 → 释放 3.5 TB |

### 9.2 根因

- `save_interval=8`，从 iter_0000007 开始 → 共 40 个 ckpt
- 每个 ~104 GB → **40 × 104 ≈ 4.0 TB**
- NFS quota 25 TB（共享）→ 写到 iter_0000319 时 ENOQUOTA
- iter_0000319 只写出 38 GB（vs 正常 104 GB）→ 损坏，已删除

### 9.3 后续防护建议

1. launch 脚本默认 `--save-interval 24`（单 8B run 控制在 ~15 ckpt 内）
2. 加 `--ckpt-keep-last 5` 滚动删除（待确认 slime arg）
3. NFS quota 监控加到 pre-flight 检查（启动训练前 `df -h` 警告）

### 9.4 Pool 侧 84h+ 0 fatal

整 84h，`sudo docker logs openclaw_pool_server` 中 `500 Internal | Too many open | address pools` 偶发但**无 fatal**。pool 配置（1024 lease）与 issue #3 / #4 完全一致。

---

## 10 结论 + 后续 Actions

### 10.1 Takeaway

1. **eval-as-train 给出 8B+RL 的 upper bound ≈ 9.2% Codex pass@1**（standard 定义，详见 §3）。这是消除 OOD gap 后的天花板。vs run-3 iter215 OOD peak 5.6% = **+3.6 pp / 1.65×**，OOD gap 是真实存在的但不是全部
2. **`terminal/accuracy` (canonical wandb metric) 50w**：tboverfit 0.174 → 0.360 → 0.355 vs run-3 0.318 → 0.528 → 0.517 = **gap 0.16-0.17**。同坐标下，eval-as-train 的 train-side accuracy 反而**显著低于** run-3 的 OOD-evaluable train-side
3. **9.2% 的内部结构是"集中 overfit 6 个简单 task"，不是"广度学习"**：19/86 task 学过，6 个 pass rate ≥ 40%（含 1 个置信度低的 1-trial outlier），67 个一次没解开
4. **8B + 默认 setup 与 SOTA 的差距来自 (1) base 模型大小、(2) agent scaffolding、(3) SFT warmup、(4) 训练数据 + RL 流程**。eval-as-train 只 address 了 (4) 中的 data 部分，不能解决其他三项
5. **训练侧 infrastructure 健康 ≠ 训练有效**：grad_norm / kl_loss / entropy / response_len 全程不踩阈值；但 `non_trainable_ratio` 前 225 step mean 3.4%（≈ run-3 < 3% 同档），**rollout 250-299 这一段升到 22.65%**、整后 95 step mean 8.7% 含 7 个 single-step > 0.5 的 spike。reward signal 53.5% 全 0、41.3% partial、5.1% strict success（且高度集中），GRPO advantage 估计信噪比低
6. **vs 同尺寸/同档位开源对照**：AfterQuery GPT-OSS-20B (SFT+RL) 在 TB Core 2.0 达到 0.170，Qwen3-32B + TerminalAgent 在 v0.1.x 达到 0.155。本 run 9.2% 距这两个标杆都有 1.7-1.85×，**说明 capacity 真位置在 0.15-0.20 附近，不是 0.09**
7. **本 run 的 ckpt 不可作为 TB 提交**（污染），但可作为：(a) 同 base 不同训练流程的对照样本，(b) 后续 SFT-then-RL 流程的 ablation baseline，(c) reward shaping 实验的 baseline

### 10.2 后续 actions（issue #6 roadmap 已部分覆盖）

- [ ] **强 prerequisite**：launch 脚本加 `--save-interval 24 --ckpt-keep-last 5`
- [ ] **方向 1：SFT cold-start**：参考 AfterQuery 路线，先用公开 agent trajectory dataset SFT，再 RL on TB-related domain
- [ ] **方向 2：Agent scaffolding**：用 TerminalAgent / 自研 prompt-stack 替换 Terminus 2 默认，看 0.092 → 多少
- [ ] **方向 3：reward shaping**：除 outcome 外加 step-level reward（per-tool-call / per-test-pass），看能否突破 partial credit 41.4% 的 plateau
- [ ] **方向 4：模型放大**：Qwen3-14B / 32B 上重跑同 setup，看 capacity bound 真位置在哪
- [ ] **方向 5：reward verifier 排查**：swe-bench-fsspec 100% pass 异常，排查 verifier 是否有捷径

### 10.3 保留的 ckpt 路径

```
/nfs/terminal-rl-workspace/OpenClaw-RL/terminal-rl/ckpt/qwen3-8b-tboverfit/
├── iter_0000007/   # 起点（Phase 1 中）
├── iter_0000103/   # Phase 3 末（pass@1 plateau 中）
├── iter_0000199/   # Phase 4 慢回升中
├── iter_0000303/   # Phase 5 第二平台早期
├── iter_0000311/   # 最后健康 ckpt（Phase 5 apex 附近）
└── latest_checkpointed_iteration.txt → 311
```

每个 ~104 GB，共 515 GB。`iter_0000319` 因 ENOQUOTA 写到一半（38GB），已删除。



01 — terminal/accuracy	02 — terminal/reward_mean

50w start=0.174 → peak=0.360 → final=0.355；gap vs run-3 0.522 = -0.16。涨幅曲线和 run-3 同形状但抬高不起来	50w 范围 [-0.91, +0.13]，全程为负但单调上升；vs run-3 后期 +0.05 还差 0.1+
03 — rollout/raw_reward	04 — train/grad_norm

`raw_reward` 单 trial post-discount，整 320 rollout 50w 从未稳定为正	范围 [0.000, 2.069]，mean 0.564，全程在 [0.05, 50] 健康区间，仅 step 318 半空 anomaly 触底
05 — train/kl_loss	06 — train/entropy_loss

范围 [0.000, 0.237]，mean 0.089，从未踩 0.5 阈值	稳定 0.04–0.08，无 entropy collapse
07 — terminal/non_trainable_ratio ⚠️	08 — rollout/response_lengths

mean 7.93% / pre-step-225 mean 3.36% (≈ run-3 水位) / post-step-225 mean 8.68% ⚠️。第一次 > 0.5 在 step 433；总共 7/320 步 > 0.5 = 间歇性"组内全相同 reward"导致该 batch 无 advantage 信号	50w 范围 [87, 283]，从未跌破 30，无 mode-collapse。早期 ~150、后期 ~260 token

对照	TB v0.1.x pass@1	vs 本 run
Claude 4.5 Sonnet (Apex2) [闭源 frontier]	0.645	7.0×
Claude Opus 4.1 (Droid) [闭源 frontier]	0.588	6.4×
GPT-5 (Droid) [闭源 frontier]	0.525	5.7×
MiniMax-M2 230B (iFlow CLI) [开源 large]	0.420	4.6×
GLM-4.5 355B (Terminus 1) [开源 large]	0.399	4.3×
Qwen3-Coder-480B (iFlow CLI) [开源 large]	0.390	4.2×
Qwen3-32B + TerminalAgent [4× 同 family]	0.155	1.7×
gpt-5-nano (Terminus 2)	0.122	1.3×
codex-mini-latest (Codex CLI)	0.113	1.2×
本 run tboverfit eval-as-train upper bound	0.092	1.0×
Qwen3-235B-A22B (Terminus 1)	0.066	0.7×
run-3 iter215 OOD peak (issue #8)	0.056	0.6×
DeepSeek-R1 (Terminus 1)	0.057	0.6×
Qwen3-8B base (Terminus 2)	0.020	0.2×

配置项	run-3 (issue #4)	tboverfit-probe (本 run)
模型	Qwen3-8B base	Qwen3-8B base（同 ckpt）
algorithm	GRPO outcome-only (dense pass-rate reward)	同
`rollout_batch`	16 prompt × 8 sample = 128 traj	同
`lr`	1e-6 const	同
`kl_loss_coef`	0.01 (k3)	同
`save_interval`	8 round	同
`num_rollout` target	2000	同
训练数据	seta_env 1376 task	TB v0.1.x 86 task（eval-as-train）
pool 配置	1024 lease (issue #3)	同
Agent	Terminus 2 默认	同
SFT warmup	无	无

锚点	rollout	Codex pass@1
Start (50w 第一个有效点)	49	2.49%
Peak	318	9.27% ⭐
Final	319	9.23%
All-time micro-avg (sum c / sum n)	—	5.12%
Tasks-ever-solved (pass@∞)	—	19/86 = 22.09%

Run	训练数据	start (50w)	peak (50w)	final (50w)	rollout 数
run-3 (issue #4)	seta_env 1376 task	0.318	0.528 ⭐	0.517	284
tboverfit (本 run)	TB v0.1.x 86 task (eval-as-train)	0.174	0.360	0.355	318
gap	—	-0.144	-0.168	-0.162	—

模型 + agent	TB v0.1.x pass@1	倍率 vs base 0.020	倍率 vs tboverfit 0.092
Qwen3-Coder-480B (iFlow CLI) [agent 优化]	0.390	19.5×	4.2×
Qwen3-Coder-480B (Orchestrator)	0.197	9.85×	2.1×
Qwen3-32B + TerminalAgent	0.155	7.75×	1.7×
Qwen3-235B-A22B (Terminus 1)	0.066	3.3×	0.72×
本 run tboverfit (eval-as-train)	0.092	4.6×	1.0×
run-3 iter215 OOD (8B + RL)	0.056	2.8×	0.61×
Qwen3-8B base (Terminus 2)	0.020	1.0×	0.22×

模型 + agent	TB 2.0 pass@1	base	路线
AfterQuery-GPT-OSS-20B (SFT+RL fine-tune)	0.170	GPT-OSS-20B (21B/3.6B MoE)	✅ 专门 agentic SFT cold-start + RL
GPT-5-Nano (Codex CLI)	0.115	unsure (闭源)	✅ Codex 路线
GPT-OSS-20B (vanilla, Mini-SWE-Agent)	0.034	21B/3.6B MoE	默认

模型 + agent	TB v0.1.x pass@1	倍率 vs tboverfit
Claude 4.5 Sonnet (Apex2)	0.645	7.0×
Claude Opus 4.1 (Droid)	0.588	6.4×
GPT-5 (Droid)	0.525	5.7×
Claude Sonnet 4.5 (Terminus 2)	0.510	5.5×

Task	strict success / total trials	strict pass rate	类型
swe-bench-fsspec	467 / 467	100.0%	SWE patch-style，可能有捷径解（详见 §6.5）
broken-python	1 / 1	100.0%	单 trial，置信度低
fix-permissions	371 / 400	92.8%	单步 chmod，简单
fibonacci-server	210 / 378	55.6%	flask 单 endpoint，trivial
hello-world	216 / 478	45.2%	写文件
vim-terminal-task	197 / 480	41.0%	vim 命令固定 pattern

Task	strict success / total trials	strict pass rate
qemu-startup	183 / 478	38.3%
recover-obfuscated-files	121 / 409	29.6%
get-bitcoin-nodes	63 / 458	13.8%

区间	数量	占比
`accuracy < 0.05`（基本完全失败）	19,933	53.5%
`accuracy ∈ [0.05, 0.99)`（partial credit）	15,396	41.4%
`accuracy >= 0.99`（strict success）	1,907	5.1%

指标	值	健康定义	真实含义
`grad_norm` mean	0.564	不爆炸（>50）/ 不消失（<0.05）	gradient 在动
`kl_loss` mean	0.089	不漂太远 ref（>0.5）	policy 没被拉走
`non_trainable_ratio`	<0.05	不全 0 reward	大部分 batch 有信号
`response_len` 50w	[87, 283]	不退化（<30）	不是 mode-collapse
`entropy_loss`	0.04-0.08	稳定	探索 ok

bucket	`terminal/accuracy`	`terminal/reward_mean`	`terminal/non_trainable_ratio`	`rollout/response_lengths`
rollout 0-49	0.174	-0.653	3.41%	151.2
50-99	0.222	-0.556	5.50%	182.9
100-149	0.249	-0.502	3.20%	197.8
150-199	0.281	-0.438	5.33%	193.4
200-249	0.318	-0.363	7.49%	260.1
250-299	0.355	-0.291	22.65% ⚠️	264.3
300-319	0.349	-0.301	7.91%	261.7

UTC 时间	事件
2026-05-02 12:20	训练启动
2026-05-05 11:55:42	step 319 完成训练，开始写 `iter_0000319` ckpt
2026-05-05 11:55:57	最后一行 RolloutManager log（rollout 320 进行到 56/128）
2026-05-05 11:57:18	`filesystem_async.py:326 - Local process 0 encountered an error: [Errno 122] Disk quota exceeded`
2026-05-05 11:57+	8 个 Megatron rank 全部 exit，GPU 显存 70 GB → 0 MiB，pool 容器还活着但训练已死
2026-05-05 12:22	用户首次发现训练 stale（30min stall 检测触发）
2026-05-06 02:13	清理 35 个中间 ckpt + 损坏的 iter_0000319 → 释放 3.5 TB

实验记录：Qwen3-8B TB v0.1.x eval-as-train capacity probe (84h, 320 rollout, wandb fdhgc9j7) — Codex pass@1 upper bound 9.2%（vs run-3 OOD 5.6% / vs Qwen3-32B 15.5% / vs AfterQuery 17%） #10

Description

TL;DR

1 实验设计：为什么要做 eval-as-train？

2 启动配置（与 run-3 仅差 1 个环境变量）

配置一致性

3 主指标：Codex pass@1 trajectory

3.1 metric 定义（与 issue #8 §1.5 完全一致）

3.2 trajectory（50-rollout sliding window）

4 vs run-3：同坐标系下的 trajectory 对照

5 vs Leaderboard：tboverfit 9.2% 在 TB v0.1.x 上的真实位置

5.1 同 base family（Qwen3）梯度

5.2 跨尺寸对照（TB Core 2.0 上 <100B 模型，issue #8 §2.5.1）

5.3 vs Frontier（top-tier 闭源）

6 Per-task 拆解：86 task 中 19 个被学过、6 个被高质量学过、67 个一次都没解开

6.1 高 pass rate（≥ 40%）— 6 task

6.2 中 pass rate（10-40%）— 3 task

6.3 低 pass rate（< 10%）— 10 task

6.4 NEVER strict-solved — 67 / 86 task

6.5 关键观察

7 Reward sparsity：为什么"训练侧健康"≠"在学习"

7.1 GRPO advantage 分布的高方差

7.2 这与 issue #2 / run-3 的对比

7.3 训练侧"健康"的精确含义

8 全程 Dashboard（run-1/2/3 同款 9 张图，数据源：wandb fdhgc9j7）

8.1 总览（00_dashboard.png）

8.2 Per-50-rollout bucket（与 issue #4 §3.1 同口径，wandb 数据）

8.3 单指标分图（4×2 table）

8.4 训练侧定性结论

9 Forced Stop：NFS ENOQUOTA（教训保留 + ckpt 清理）

9.1 时间线

9.2 根因

9.3 后续防护建议

9.4 Pool 侧 84h+ 0 fatal

10 结论 + 后续 Actions

10.1 Takeaway

10.2 后续 actions（issue #6 roadmap 已部分覆盖）

10.3 保留的 ckpt 路径

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions