Commit 945adff
fix(opd): score teacher logprobs at rollout temperature, not 0
The on-policy-distillation teacher reward_func scored teacher log-probs via SGLang
with a hardcoded `temperature: 0`. SGLang computes input_token_logprobs WITH
temperature scaling (compute_temp_top_p_normalized_logprobs), and the student
log-probs are temperature-scaled by rollout_temperature (get_responses). So when
rollout_temperature != 1 the OPD reverse-KL (student - teacher) compares log-probs
at different effective temperatures and is biased.
Score the teacher at rollout_temperature so both sides of the KL match. No change
at the default rollout_temperature=1.0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent e46ca0a commit 945adff
1 file changed
Lines changed: 5 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
14 | 18 | | |
15 | 19 | | |
16 | 20 | | |
| |||
0 commit comments