fix(opd): score teacher logprobs at rollout temperature, not 0

EazyReal · claude · EazyReal · commit 945adff90845 · 2026-06-15T20:43:52.000Z
The on-policy-distillation teacher reward_func scored teacher log-probs via SGLang
with a hardcoded `temperature: 0`. SGLang computes input_token_logprobs WITH
temperature scaling (compute_temp_top_p_normalized_logprobs), and the student
log-probs are temperature-scaled by rollout_temperature (get_responses). So when
rollout_temperature != 1 the OPD reverse-KL (student - teacher) compares log-probs
at different effective temperatures and is biased.

Score the teacher at rollout_temperature so both sides of the KL match. No change
at the default rollout_temperature=1.0.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/slime/rollout/on_policy_distillation.py b/slime/rollout/on_policy_distillation.py
@@ -10,7 +10,11 @@ async def reward_func(args, sample, **kwargs):
         # "text": sample.prompt + sample.response,
         "input_ids": sample.tokens,
         "sampling_params": {
-            "temperature": 0,
+            # Score teacher log-probs at rollout_temperature: SGLang scales
+            # input_token_logprobs by the sampling temperature, and the student
+            # log-probs are temperature-scaled too (get_responses), so the OPD KL is
+            # only consistent when both are at the same temperature.
+            "temperature": getattr(args, "rollout_temperature", 1.0),
             "max_new_tokens": 0,
             "skip_special_tokens": False,
         },