Skip to content

fix: harden OPD teacher logprob scoring#2683

Closed
tim0120 wants to merge 2 commits into
mainfrom
feat/hosted-opd-teacher-logprobs
Closed

fix: harden OPD teacher logprob scoring#2683
tim0120 wants to merge 2 commits into
mainfrom
feat/hosted-opd-teacher-logprobs

Conversation

@tim0120
Copy link
Copy Markdown
Contributor

@tim0120 tim0120 commented Jun 1, 2026

Summary

  • harden OPD teacher logprob scoring on PrimeRL's native vLLM /inference/v1/generate route
  • remove the Prime API /api/v1/generate branch from PrimeRL now that the gateway path is platform/pi-inference-owned
  • select the exact target token prompt logprob and fail on length/token mismatches
  • raise on teacher HTTP errors before parsing prompt-logprob responses
  • document the native token_ids scoring contract in the local config skill

Scope

This PR is PrimeRL-native scoring only. It intentionally does not route hosted Prime API base URLs to /api/v1/generate; that gateway path is owned by platform/pi-inference in PrimeIntellect-ai/platform#2342.

For this PR, teacher scoring expects a vLLM-compatible endpoint that serves /inference/v1/generate with token_ids + prompt_logprobs.

Verification

  • GitHub checks are passing, including reverse_text_rl_opd
  • Local macOS: uv run pytest tests/unit/orchestrator/test_teacher_logprobs.py is blocked by the workspace dependency conflict (prime-pydantic-config:dev pins ruff==0.5.0; prime-rl:dev requires ruff>=0.12.1)
  • Local macOS: UV_NO_SYNC=1 uv run pytest tests/unit/orchestrator/test_teacher_logprobs.py is blocked because pytest is not installed in the unsynced env
  • git diff --check passed for the staged PrimeRL changes before commit

Review Notes

  • Codex CLI review found no issue in the OPD helper/docs cleanup itself; it did flag the unrelated dirty local deps/renderers submodule rewind, which was intentionally not staged or pushed
  • Claude review could not run because the org has hit its monthly usage limit
  • Fresh @codex review trigger comment added

Linear: APR-57


Note

Medium Risk
Changes OPD distillation’s teacher scoring contract and error handling; wrong logprobs would skew training, but scope is isolated to compute_teacher_logprobs with added validation and tests.

Overview
Hardens OPD teacher logprob scoring so the orchestrator only uses PrimeRL’s vLLM-native /inference/v1/generate path with token_ids (prompt + completion) and prompt_logprobs, instead of relying on vLLM’s GenerateResponse pydantic parsing.

compute_teacher_logprobs now builds the request via helpers, parses the raw JSON response, and picks the logprob for each scored token id (not the first alternative in the dict). It errors on length mismatches, missing non-leading entries, missing target token ids, and HTTP failures before parsing. The config skill documents SFT vs OPD flows and the token_ids scoring contract; unit tests cover happy path, validation failures, and 401 handling.

Reviewed by Cursor Bugbot for commit 31b5fd1. Bugbot is set up for automated code reviews on this repo. Configure here.

Copilot AI review requested due to automatic review settings June 1, 2026 21:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates OPD teacher logprob scoring so it works with both hosted Prime API teachers and self-hosted vLLM teachers, and tightens logprob parsing to align per-token logprobs to the exact target token id with explicit validation on response mismatches.

Changes:

  • Route teacher scoring to /api/v1/generate when the teacher client base URL ends with /api/v1, otherwise use vLLM’s /inference/v1/generate.
  • Parse prompt_logprobs from raw JSON and select the logprob corresponding to the actual token id at each position, failing on length/token-id mismatches.
  • Add unit tests for endpoint routing and mismatch handling, and document SFT vs OPD teacher scoring behavior in the config skill doc.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/unit/orchestrator/test_teacher_logprobs.py Extends unit coverage for hosted Prime API teacher routing and stricter logprob response validation.
src/prime_rl/orchestrator/utils.py Implements endpoint selection and token-id-aligned prompt logprob flattening for teacher scoring.
skills/config/SKILL.md Documents SFT vs OPD distillation modes and which teacher endpoints are supported for scoring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/prime_rl/orchestrator/utils.py Outdated
Comment on lines +134 to +149
@@ -115,20 +145,8 @@ async def _compute_single(client_config: vf.ClientConfig, sample: TrainingSample
},
},
)
response = GenerateResponse.model_validate_json(http_response.content)
# ``prompt_logprobs[i]`` is a ``{token_id: Logprob}`` dict for tokens
# the engine could score, or ``None`` for the leading token which has
# no preceding context. Flatten to ``list[float]`` with 0.0 in the
# unscored slot.
flat: list[float] = []
for entry in response.prompt_logprobs or []:
if not entry:
flat.append(0.0)
continue
first = next(iter(entry.values()))
lp = first.logprob if hasattr(first, "logprob") else first.get("logprob")
flat.append(float(lp) if lp is not None else 0.0)
return flat
response = http_response.json()
return _flatten_prompt_logprobs(response, token_ids)
@tim0120 tim0120 force-pushed the feat/hosted-opd-teacher-logprobs branch 5 times, most recently from 3638f41 to 7a55a28 Compare June 1, 2026 21:37
@tim0120
Copy link
Copy Markdown
Contributor Author

tim0120 commented Jun 1, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@tim0120
Copy link
Copy Markdown
Contributor Author

tim0120 commented Jun 2, 2026

@codex review

@tim0120 tim0120 force-pushed the feat/hosted-opd-teacher-logprobs branch from 9d34387 to 31b5fd1 Compare June 2, 2026 03:36
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 31b5fd1. Configure here.

scored_token_ids: list[int],
) -> tuple[str, dict[str, Any]]:
base = base_url.rstrip("/")
return f"{base.removesuffix('/v1')}/inference/v1/generate", {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Prime API URL routing for hosted teacher

High Severity

_teacher_generate_request always constructs a /inference/v1/generate URL, but the PR claims to also route to Prime API /api/v1/generate when a hosted teacher base URL is used. With a Prime API base URL like https://api.pinference.ai/api/v1 (as seen in configs/debug/training_modes/sft_external.toml), the function strips /v1 and appends /inference/v1/generate, producing https://api.pinference.ai/api/inference/v1/generate — an incorrect path that would 404. The expected Prime API URL is https://api.pinference.ai/api/v1/generate. No conditional routing exists and no unit test covers a Prime API base URL, despite the PR caveat stating the path is "unit/dry-run covered."

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 31b5fd1. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is stale after the latest scope change. PrimeRL no longer branches to /api/v1/generate; this PR intentionally keeps OPD teacher scoring on the native vLLM /inference/v1/generate route. The Prime Inference /api/v1/generate gateway is platform/pi-inference-owned via PrimeIntellect-ai/platform#2342.

@tim0120
Copy link
Copy Markdown
Contributor Author

tim0120 commented Jun 2, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@tim0120 tim0120 changed the title fix: support hosted OPD teacher logprobs fix: harden OPD teacher logprob scoring Jun 3, 2026
Comment on lines +88 to +225
def test_compute_teacher_logprobs_rejects_wrong_length(monkeypatch):
async def _run():
fake_client = _FakeOpenAIClient(
{
"request_id": "gen-test",
"choices": [],
"prompt_logprobs": [None, {"2": {"logprob": -0.7}}],
"kv_transfer_params": None,
}
)
monkeypatch.setattr(orchestrator_utils, "setup_openai_client", lambda _: fake_client)

sample = SimpleNamespace(
prompt_ids=[1],
prompt_mask=[True],
completion_ids=[2, 3],
completion_mask=[True, True],
completion_logprobs=[-0.1, -0.2],
completion_temperatures=[1.0, 1.0],
env_name="test-env",
)

try:
await orchestrator_utils.compute_teacher_logprobs(
clients=[vf.ClientConfig()],
model_name="teacher-model",
samples=[sample],
)
except ValueError as exc:
assert "teacher prompt_logprobs length != sample length" in str(exc)
else:
raise AssertionError("Expected ValueError")

asyncio.run(_run())


def test_compute_teacher_logprobs_rejects_missing_token_id(monkeypatch):
async def _run():
fake_client = _FakeOpenAIClient(
{
"request_id": "gen-test",
"choices": [],
"prompt_logprobs": [None, {"13": {"logprob": -0.1}}, {"3": {"logprob": -0.3}}],
"kv_transfer_params": None,
}
)
monkeypatch.setattr(orchestrator_utils, "setup_openai_client", lambda _: fake_client)

sample = SimpleNamespace(
prompt_ids=[1],
prompt_mask=[True],
completion_ids=[2, 3],
completion_mask=[True, True],
completion_logprobs=[-0.1, -0.2],
completion_temperatures=[1.0, 1.0],
env_name="test-env",
)

try:
await orchestrator_utils.compute_teacher_logprobs(
clients=[vf.ClientConfig()],
model_name="teacher-model",
samples=[sample],
)
except ValueError as exc:
assert "teacher prompt_logprobs missing token id 2" in str(exc)
else:
raise AssertionError("Expected ValueError")

asyncio.run(_run())


def test_compute_teacher_logprobs_rejects_missing_non_leading_entry(monkeypatch):
async def _run():
fake_client = _FakeOpenAIClient(
{
"request_id": "gen-test",
"choices": [],
"prompt_logprobs": [None, None, {"3": {"logprob": -0.3}}],
"kv_transfer_params": None,
}
)
monkeypatch.setattr(orchestrator_utils, "setup_openai_client", lambda _: fake_client)

sample = SimpleNamespace(
prompt_ids=[1],
prompt_mask=[True],
completion_ids=[2, 3],
completion_mask=[True, True],
completion_logprobs=[-0.1, -0.2],
completion_temperatures=[1.0, 1.0],
env_name="test-env",
)

try:
await orchestrator_utils.compute_teacher_logprobs(
clients=[vf.ClientConfig()],
model_name="teacher-model",
samples=[sample],
)
except ValueError as exc:
assert "teacher prompt_logprobs missing entry at position 1 for token id 2" in str(exc)
else:
raise AssertionError("Expected ValueError")

asyncio.run(_run())


def test_compute_teacher_logprobs_raises_for_teacher_http_error(monkeypatch):
async def _run():
fake_client = _FakeOpenAIClient(
{"error": {"message": "invalid teacher api key"}},
status_code=401,
)
monkeypatch.setattr(orchestrator_utils, "setup_openai_client", lambda _: fake_client)

sample = SimpleNamespace(
prompt_ids=[1],
prompt_mask=[True],
completion_ids=[2, 3],
completion_mask=[True, True],
completion_logprobs=[-0.1, -0.2],
completion_temperatures=[1.0, 1.0],
env_name="test-env",
)

try:
await orchestrator_utils.compute_teacher_logprobs(
clients=[vf.ClientConfig()],
model_name="teacher-model",
samples=[sample],
)
except httpx.HTTPStatusError as exc:
assert exc.response.status_code == 401
else:
raise AssertionError("Expected HTTPStatusError")

asyncio.run(_run())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove tests here

Comment thread skills/configs/SKILL.md
Comment on lines +63 to +73
## Distillation training modes

Set `orchestrator.training_mode = "sft"` or `orchestrator.training_mode = "opd"` (or top-level `training_mode`, which auto-propagates) and configure `orchestrator.teacher` with the teacher endpoint.

In `sft`, rollouts are generated by the teacher. The shared `training_mode` validator sets `trainer.loss.type = "sft"`, and trainer loss dispatch uses `sft_loss_fn`.

In `opd`, rollouts are generated by the student. The orchestrator scores the student token sequence with the teacher, writes those values to `TrainingSample.teacher_logprobs`, and trainer loss dispatch uses `opd_loss_fn`.

`[inference]` is required for the usual online path because it starts the student inference server and auto-configures `orchestrator.student.client.base_url`. The student pool is used for online evals and policy weight sync. For externally started student inference, set `orchestrator.student.client.base_url` explicitly instead.

Teacher logprob scoring uses PrimeRL's vLLM-native `/inference/v1/generate` route. The request field is `token_ids`, meaning the prompt plus completion tokens to score; response `choices[].token_ids` remains generated completion tokens and is not used for OPD scoring.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove also ? not sure what is the link with the pr

scored_token_ids: list[int],
) -> tuple[str, dict[str, Any]]:
base = base_url.rstrip("/")
return f"{base.removesuffix('/v1')}/inference/v1/generate", {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this break opd in prime rl , we can't do this

@tim0120 tim0120 closed this Jun 3, 2026
@tim0120 tim0120 reopened this Jun 4, 2026
@tim0120 tim0120 marked this pull request as draft June 4, 2026 23:45
@tim0120
Copy link
Copy Markdown
Contributor Author

tim0120 commented Jun 4, 2026

Closing again to keep the active workstream focused on replay-backed SFT. OPD follow-up is not needed here.

@tim0120 tim0120 closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants