[CI] Add RL-specific Claude review guidance

YanhuiDua · HAOCHENYE · commit 334e80c895c0 · 2026-06-03T20:39:43.000+08:00
diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md
@@ -93,8 +93,11 @@
 5. **Performance**: Unnecessary CPU-GPU synchronization (`item()`, `.cpu()` in hot paths), redundant data copies, inefficient collective communication patterns.
 6. **API Contracts**: Do public interface changes maintain backward compatibility? Are deprecation warnings added for breaking changes?
 7. **Resource Cleanup**: Are file handles, NCCL communicators, and CUDA streams properly cleaned up?
-8. **Metric Changes in `ProduceBatchResult`**: Check whether any metrics in `ProduceBatchResult` will be changed significantly. Call out any obvious increases or decreases, and highlight anything that looks abnormal or unexpected.
-9. **Routed Experts Memory Leak Risk**: Review whether this PR could unintentionally retain additional references to routed experts or make their lifecycle/state harder to track, which could eventually result in memory leaks.
+
+### Area-Specific Review Rules
+
+If a PR changes code under `xtuner/v1/rl`, read `.claude/rules/rl_review.md` before commenting or
+posting the final review summary.
 
 ### Review Output Format
 
@@ -130,4 +133,4 @@ APPROVE / REQUEST_CHANGES / COMMENT
 
 ## Rules
 
-Please refer to the docs in `./dev-rules` for the development guidelines.
+Please refer to the docs in `.claude/rules` for the development guidelines.
diff --git a/.claude/rules/rl_review.md b/.claude/rules/rl_review.md
@@ -0,0 +1,80 @@
+# RL Review Guidelines
+
+Use these rules whenever a PR changes code under `xtuner/v1/rl`. They extend the general review
+standards in `.claude/CLAUDE.md`; they do not replace them.
+
+## Required Review Output
+
+Every review for an RL change must state the ProduceBatchResult impact near the top-level summary:
+
+```text
+ProduceBatchResult impact: <specific impact or "not affected">
+```
+
+Also include routed-experts impact when the PR touches routed-experts logic, rollout response
+handling, `RolloutState.extra_fields`, object references, or memory ownership around rollout outputs:
+
+```text
+RoutedExperts impact: <specific impact or "not affected">
+```
+
+Do not leave these impacts implicit in the finding text. If a finding is about rollout status,
+pause/abort/timeout behavior, response handling, producer aggregation, or routed-experts ownership,
+repeat the relevant impact line inside that finding so the downstream effect is visible at the point
+of review.
+
+## ProduceBatchResult Checklist
+
+Review whether the PR can change trainer-visible batch accounting, timing, or reward semantics in
+`ProduceBatchResult`.
+
+Check this area when the PR changes any of these paths:
+
+- `RolloutState.status`, `finish_reason`, or status conversion.
+- Abort, filter, expire, retry, timeout, cancellation, or failure handling.
+- Producer, agent loop, judger, replay buffer, rollout worker, rollout controller, or backend pause
+  cleanup logic.
+- Writers, readers, tests, or fake implementations that construct or consume `ProduceBatchResult`,
+  including trainer-facing code outside `xtuner/v1/rl`.
+
+Name the concrete field-level impact when any of these fields can change:
+
+- Batch status: `status`.
+- Returned groups: `rollout_states`.
+- Generation timing: `group_gen_count`, `group_gen_mean_s`, `group_gen_p50_s`, `group_gen_p99_s`,
+  `group_gen_p99_p50_ratio`, `group_gen_pause_time_s`.
+- Replay-buffer leftovers: `leftover_init`, `leftover_completed`, `leftover_aborted`,
+  `leftover_expired`, `leftover_failed`, `leftover_filtered`.
+- Reward accounting: `raw_rewards_sum`, `raw_rewards_count`.
+- Produced work counters: `produced_samples`, `produced_tokens`, `produce_time_s`.
+- Multi-task aggregation: `task_batch_sizes`, `task_results`.
+
+Common impacts to call out explicitly:
+
+- A sample moving between `ABORTED`, `FAILED`, `FILTERED`, `EXPIRED`, and `COMPLETED` can change the
+  corresponding `leftover_*` counts.
+- Pause, abort, timeout, and cancellation changes can inflate or deflate `group_gen_*` timing,
+  especially `group_gen_pause_time_s`.
+- Reward or filter-path changes can change `raw_rewards_sum`, `raw_rewards_count`,
+  `produced_samples`, and `produced_tokens`.
+
+## RoutedExperts Checklist
+
+Review whether routed-experts ownership and cleanup remain correct.
+
+Check this area when the PR changes any of these paths:
+
+- LMDeploy rollout response handling.
+- `return_routed_experts`, `routed_experts`, `RolloutState.extra_fields`, or object-ref plumbing.
+- Abort, cancellation, timeout, filter, retry, or failure paths before response handling completes.
+- Background tasks, replay-buffer storage, trainer batches, metrics, tests, or fake rollout
+  responses that can retain routed-experts object refs.
+
+For LMDeploy rollout, `rollout_worker` obtains routed-experts object refs from the LMDeploy shared
+store. At that point, ownership moves from LMDeploy to XTuner. A review finding should call out both
+sides of the ownership boundary when relevant:
+
+- Leak before transfer: requests whose routed experts remain in LMDeploy because XTuner never
+  obtains the object refs.
+- Leak after transfer: object refs that XTuner keeps alive too long through `RolloutState`, replay
+  buffer, trainer batches, metrics, fake tests, or background tasks.
diff --git a/.github/workflows/claude-general.yml b/.github/workflows/claude-general.yml
@@ -223,4 +223,3 @@ jobs:
             }
           allowed_non_write_users: "*"
           track_progress: false
-

Original file line number	Diff line number	Diff line change
`@@ -223,4 +223,3 @@ jobs:`
`223`	`223`	`}`
`224`	`224`	`allowed_non_write_users: "*"`
`225`	`225`	`track_progress: false`
`226`		`-`