Commit 1b70bf3
feat(rl): composable off-policy importance-sampling correction
Expose the current grad-carrying log-probs to the policy-loss TIS hook as
`cur_log_probs`, and add `off_policy_is_function` (in ppo_utils, next to
compute_policy_loss/compute_cispo_loss) -- a truncated-IS correction between the
*current* policy and the *actual rollout generator*: the (detached) weight is
`clip(pi_theta / pi_rollout)` against the real rollout logprob, so one weight
corrects both the train/inference mismatch and async (multi-version) staleness.
The existing TIS hook only had pi_theta_old / pi_rollout, which equals this only
in the single-update-per-rollout limit.
On a plain REINFORCE base (`--advantage-estimator reinforce`) this reproduces the
CISPO surrogate expressed as a correction rather than the dedicated
`compute_cispo_loss` estimator. Existing corrections ignore the new kwarg via **kwargs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent e46ca0a commit 1b70bf3
3 files changed
Lines changed: 122 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
846 | 846 | | |
847 | 847 | | |
848 | 848 | | |
| 849 | + | |
| 850 | + | |
| 851 | + | |
849 | 852 | | |
850 | 853 | | |
851 | 854 | | |
| |||
919 | 922 | | |
920 | 923 | | |
921 | 924 | | |
| 925 | + | |
| 926 | + | |
922 | 927 | | |
923 | 928 | | |
924 | 929 | | |
| 930 | + | |
925 | 931 | | |
926 | 932 | | |
927 | 933 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
| |||
171 | 172 | | |
172 | 173 | | |
173 | 174 | | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
174 | 206 | | |
175 | 207 | | |
176 | 208 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
0 commit comments