|
| 1 | +# RL Review Guidelines |
| 2 | + |
| 3 | +Use these rules whenever a PR changes code under `xtuner/v1/rl`. They extend the general review |
| 4 | +standards in `.claude/CLAUDE.md`; they do not replace them. |
| 5 | + |
| 6 | +## Required Review Output |
| 7 | + |
| 8 | +Every review for an RL change must state the ProduceBatchResult impact near the top-level summary: |
| 9 | + |
| 10 | +```text |
| 11 | +ProduceBatchResult impact: <specific impact or "not affected"> |
| 12 | +``` |
| 13 | + |
| 14 | +Also include routed-experts impact when the PR touches routed-experts logic, rollout response |
| 15 | +handling, `RolloutState.extra_fields`, object references, or memory ownership around rollout outputs: |
| 16 | + |
| 17 | +```text |
| 18 | +RoutedExperts impact: <specific impact or "not affected"> |
| 19 | +``` |
| 20 | + |
| 21 | +Do not leave these impacts implicit in the finding text. If a finding is about rollout status, |
| 22 | +pause/abort/timeout behavior, response handling, producer aggregation, or routed-experts ownership, |
| 23 | +repeat the relevant impact line inside that finding so the downstream effect is visible at the point |
| 24 | +of review. |
| 25 | + |
| 26 | +## ProduceBatchResult Checklist |
| 27 | + |
| 28 | +Review whether the PR can change trainer-visible batch accounting, timing, or reward semantics in |
| 29 | +`ProduceBatchResult`. |
| 30 | + |
| 31 | +Check this area when the PR changes any of these paths: |
| 32 | + |
| 33 | +- `RolloutState.status`, `finish_reason`, or status conversion. |
| 34 | +- Abort, filter, expire, retry, timeout, cancellation, or failure handling. |
| 35 | +- Producer, agent loop, judger, replay buffer, rollout worker, rollout controller, or backend pause |
| 36 | + cleanup logic. |
| 37 | +- Writers, readers, tests, or fake implementations that construct or consume `ProduceBatchResult`, |
| 38 | + including trainer-facing code outside `xtuner/v1/rl`. |
| 39 | + |
| 40 | +Name the concrete field-level impact when any of these fields can change: |
| 41 | + |
| 42 | +- Batch status: `status`. |
| 43 | +- Returned groups: `rollout_states`. |
| 44 | +- Generation timing: `group_gen_count`, `group_gen_mean_s`, `group_gen_p50_s`, `group_gen_p99_s`, |
| 45 | + `group_gen_p99_p50_ratio`, `group_gen_pause_time_s`. |
| 46 | +- Replay-buffer leftovers: `leftover_init`, `leftover_completed`, `leftover_aborted`, |
| 47 | + `leftover_expired`, `leftover_failed`, `leftover_filtered`. |
| 48 | +- Reward accounting: `raw_rewards_sum`, `raw_rewards_count`. |
| 49 | +- Produced work counters: `produced_samples`, `produced_tokens`, `produce_time_s`. |
| 50 | +- Multi-task aggregation: `task_batch_sizes`, `task_results`. |
| 51 | + |
| 52 | +Common impacts to call out explicitly: |
| 53 | + |
| 54 | +- A sample moving between `ABORTED`, `FAILED`, `FILTERED`, `EXPIRED`, and `COMPLETED` can change the |
| 55 | + corresponding `leftover_*` counts. |
| 56 | +- Pause, abort, timeout, and cancellation changes can inflate or deflate `group_gen_*` timing, |
| 57 | + especially `group_gen_pause_time_s`. |
| 58 | +- Reward or filter-path changes can change `raw_rewards_sum`, `raw_rewards_count`, |
| 59 | + `produced_samples`, and `produced_tokens`. |
| 60 | + |
| 61 | +## RoutedExperts Checklist |
| 62 | + |
| 63 | +Review whether routed-experts ownership and cleanup remain correct. |
| 64 | + |
| 65 | +Check this area when the PR changes any of these paths: |
| 66 | + |
| 67 | +- LMDeploy rollout response handling. |
| 68 | +- `return_routed_experts`, `routed_experts`, `RolloutState.extra_fields`, or object-ref plumbing. |
| 69 | +- Abort, cancellation, timeout, filter, retry, or failure paths before response handling completes. |
| 70 | +- Background tasks, replay-buffer storage, trainer batches, metrics, tests, or fake rollout |
| 71 | + responses that can retain routed-experts object refs. |
| 72 | + |
| 73 | +For LMDeploy rollout, `rollout_worker` obtains routed-experts object refs from the LMDeploy shared |
| 74 | +store. At that point, ownership moves from LMDeploy to XTuner. A review finding should call out both |
| 75 | +sides of the ownership boundary when relevant: |
| 76 | + |
| 77 | +- Leak before transfer: requests whose routed experts remain in LMDeploy because XTuner never |
| 78 | + obtains the object refs. |
| 79 | +- Leak after transfer: object refs that XTuner keeps alive too long through `RolloutState`, replay |
| 80 | + buffer, trainer batches, metrics, fake tests, or background tasks. |
0 commit comments