Skip to content

fix(opd): score teacher logprobs at rollout temperature, not 0#2085

Open
EazyReal wants to merge 2 commits into
THUDM:mainfrom
EazyReal:opd-teacher-temperature
Open

fix(opd): score teacher logprobs at rollout temperature, not 0#2085
EazyReal wants to merge 2 commits into
THUDM:mainfrom
EazyReal:opd-teacher-temperature

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What changed

Score the on-policy-distillation teacher log-probs at rollout_temperature instead of a hardcoded temperature: 0 in slime/rollout/on_policy_distillation.py.

Why

The OPD reverse-KL is student - teacher. The student log-probs are temperature-scaled by rollout_temperature (get_responses), while the teacher was scored by SGLang at temperature: 0 — SGLang scales input_token_logprobs by the sampling temperature, so the two sides of the KL were at mismatched effective temperatures whenever rollout_temperature != 1, biasing the distillation signal. The default rollout_temperature=1.0 path is byte-identical.

Validation

The OPD CI test (tests/test_qwen2.5_0.5B_opd_sglang.py) runs at --rollout-temperature 0.8, exercising the changed path.

The on-policy-distillation teacher reward_func scored teacher log-probs via SGLang
with a hardcoded temperature=0. SGLang computes input_token_logprobs WITH
temperature scaling (compute_temp_top_p_normalized_logprobs), and the student
log-probs are temperature-scaled by rollout_temperature (get_responses). So when
rollout_temperature != 1 the OPD reverse-KL (student - teacher) compares log-probs
at different effective temperatures and is biased.

Score the teacher at rollout_temperature so both sides of the KL match. No change
at the default rollout_temperature=1.0.
@EazyReal EazyReal force-pushed the opd-teacher-temperature branch from 945adff to 6c995f4 Compare June 24, 2026 02:01
@EazyReal

Copy link
Copy Markdown
Contributor Author

@zhuzilin could you review this one? OPD teacher scoring was using temperature=0 while rollout samples use --rollout-temperature; this makes the teacher logprobs match the actual sampled distribution instead of silently scoring a greedy distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant