fix(opd): score teacher logprobs at rollout temperature, not 0 by EazyReal · Pull Request #2085 · THUDM/slime

EazyReal · 2026-06-15T20:44:46Z

What changed

Score the on-policy-distillation teacher log-probs at rollout_temperature instead of a hardcoded temperature: 0 in slime/rollout/on_policy_distillation.py.

Why

The OPD reverse-KL is student - teacher. The student log-probs are temperature-scaled by rollout_temperature (get_responses), while the teacher was scored by SGLang at temperature: 0 — SGLang scales input_token_logprobs by the sampling temperature, so the two sides of the KL were at mismatched effective temperatures whenever rollout_temperature != 1, biasing the distillation signal. The default rollout_temperature=1.0 path is byte-identical.

Validation

The OPD CI test (tests/test_qwen2.5_0.5B_opd_sglang.py) runs at --rollout-temperature 0.8, exercising the changed path.

The on-policy-distillation teacher reward_func scored teacher log-probs via SGLang with a hardcoded temperature=0. SGLang computes input_token_logprobs WITH temperature scaling (compute_temp_top_p_normalized_logprobs), and the student log-probs are temperature-scaled by rollout_temperature (get_responses). So when rollout_temperature != 1 the OPD reverse-KL (student - teacher) compares log-probs at different effective temperatures and is biased. Score the teacher at rollout_temperature so both sides of the KL match. No change at the default rollout_temperature=1.0.

EazyReal · 2026-06-25T08:45:52Z

@zhuzilin could you review this one? OPD teacher scoring was using temperature=0 while rollout samples use --rollout-temperature; this makes the teacher logprobs match the actual sampled distribution instead of silently scoring a greedy distribution.

EazyReal mentioned this pull request Jun 21, 2026

RFC: factor the policy loss into orthogonal axes (advantage × policy-loss × is-level × correction × regularizer) EazyReal/slime#1

Open

EazyReal force-pushed the opd-teacher-temperature branch from 945adff to 6c995f4 Compare June 24, 2026 02:01

fix(opd): use rollout temperature directly

fb9cab9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(opd): score teacher logprobs at rollout temperature, not 0#2085

fix(opd): score teacher logprobs at rollout temperature, not 0#2085
EazyReal wants to merge 2 commits into
THUDM:mainfrom
EazyReal:opd-teacher-temperature

EazyReal commented Jun 15, 2026 •

edited

Loading

Uh oh!

EazyReal commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

EazyReal commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Validation

Uh oh!

EazyReal commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EazyReal commented Jun 15, 2026 •

edited

Loading