Commit efbb291
committed
fix(opd): score teacher logprobs at rollout temperature, not 0
The on-policy-distillation teacher reward_func scored teacher log-probs via SGLang
with a hardcoded temperature=0. SGLang computes input_token_logprobs WITH
temperature scaling (compute_temp_top_p_normalized_logprobs), and the student
log-probs are temperature-scaled by rollout_temperature (get_responses). So when
rollout_temperature != 1 the OPD reverse-KL (student - teacher) compares log-probs
at different effective temperatures and is biased.
Score the teacher at rollout_temperature so both sides of the KL match. No change
at the default rollout_temperature=1.0.1 parent a2158f1 commit efbb291
1 file changed
Lines changed: 5 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
14 | 18 | | |
15 | 19 | | |
16 | 20 | | |
| |||
0 commit comments