Skip to content

feature: prime train routes full_finetune TOMLs to hosted endpoint#592

Open
JannikSt wants to merge 14 commits into
mainfrom
feature/train-hosted-full-finetune
Open

feature: prime train routes full_finetune TOMLs to hosted endpoint#592
JannikSt wants to merge 14 commits into
mainfrom
feature/train-hosted-full-finetune

Conversation

@JannikSt
Copy link
Copy Markdown
Member

@JannikSt JannikSt commented May 2, 2026

Note

Medium Risk
Changes how expensive training jobs are launched and deleted (hosted vs LoRA dispatch) and handles per-run credentials; mistakes could send the wrong workload or delete path, though behavior is covered by tests and explicit confirmation for dispatch.

Overview
prime train now detects full-parameter (prime-rl) TOMLs before the LoRA schema runs and dispatches them to POST /v1/training/runs via a new HostedTrainingClient, while unchanged LoRA configs still use the shared RFT path.

Detection covers type = "full_finetune" and [deployment] signals (including multi-node num_train_nodes / num_infer_nodes, which previously fell through to LoRA). The hosted path ships the full TOML as config, applies team_id and secrets from env files, strips local env_file paths, and does not print the minted API token in CLI/JSON output.

RLRun gains optional kind and makes LoRA-centric fields optional so dedicated full-FT runs list/get cleanly; UI shows placeholders when those fields are absent.

prime train delete tries the hosted full-FT delete endpoint first and falls back to LoRA on typed NotFoundError (404 from the HTTP client).

prime train logs adds trainer / inference components and routes most env-server reads through unified /rft/runs/.../logs with component + env_name; name/N qualifiers still use the legacy env-server-logs endpoint.

Reviewed by Cursor Bugbot for commit bc982f4. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b88b5f7d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Comment thread packages/prime/src/prime_cli/commands/rl.py
willccbb
willccbb previously approved these changes May 19, 2026
Comment thread packages/prime/src/prime_cli/commands/rl.py
@JannikSt
Copy link
Copy Markdown
Member Author

JannikSt commented Jun 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b07791f22c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/prime/src/prime_cli/api/rl.py
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
JannikSt added 10 commits June 1, 2026 14:54
…o hosted endpoint

`prime train <toml>` stays the single entry point. When the TOML carries
`type = "full_finetune"` (or a `[hosted]` block, or a `[deployment]`
block matching prime-rl's qwen30b_math/rl.toml shape), the CLI routes
to the new public API at /api/v1/training/runs instead of the LoRA
shared-cluster path. Backwards compatible: configs without these
markers run through the existing flow unchanged.

* api/training.py: new HostedTrainingClient + build_payload_from_toml
  (whitelist-maps prime-rl example fields onto the API payload).
* api/rl.py: surface `kind` on RLRun so `prime train delete` can route
  to the right endpoint based on run kind.
* commands/rl.py: peek the TOML before strict RLConfig parse; on full-FT
  hand off to _dispatch_full_finetune_run with shared env-file/secrets
  plumbing. Delete looks up kind and dispatches via HostedTrainingClient
  for DEDICATED_FULL_FT runs.

Tested end-to-end against local backend on rft-freyr cluster: dispatch
+ status mirroring + completion + delete all clean.
Drop the friction of looking up a cluster cuid for the common single-
cluster setup. Backend now auto-picks the first uncordoned PrimeCluster
when the field is omitted, so `prime train backend/examples/training/
reverse_text.toml` is zero-config.
…eyRef

The platform materialises the per-run RFT API token into the per-run
k8s Secret on dispatch and the chart binds it to the orchestrator pod's
PRIME_API_KEY env var. The token already lives where prime-monitor
needs it — surfacing it on stdout just makes it easy to leak into
shared shell history or CI logs.
- Drop prime_cluster_id from CLI plumbing entirely. Backend auto-picks
  the first uncordoned PrimeCluster; the CLI never threads a cluster
  id, removing a footgun (mistargeting a config to the wrong cluster).
  Drops the [hosted] discriminator path too — type = 'full_finetune'
  or a [deployment] block remain the only triggers.

- codex P1 / cursor: --output json on the full-FT path no longer
  short-circuits with a {would_dispatch} preview. Mirrors the LoRA
  path's 'create then format' contract — automation that pipes the
  JSON to grab run_id now actually dispatches the run.

- codex P2: env_file (deprecated, singular) is loaded BEFORE env_files
  (canonical, plural) so env_files wins on key collision. Matches the
  LoRA path's documented precedence.
build_payload_from_toml used to whitelist ~12 individual fields; the
backend rebuilt a minimal TOML from them, dropping anything outside
the whitelist (custom optim schedules, eval configs, custom scheduler
params, …). E2E prime-rl runs that depended on those knobs silently
diverged from `uv run rl @ rl.toml` behaviour.

Now: ship the whole TOML as 'config' (companion to platform PR #1824
faa934d56). The backend's build_values takes the config dict directly
so the same TOML works locally with prime-rl and remote-dispatched
through prime-cli with no fork in behaviour.
Unrelated to the full-FT training payload change; accidentally picked
up by 'git add -A'. Restoring inference.py to its 4b9be8f state. The
streaming improvements will land in a sibling PR off main.
Cursor caught: when rl_client.get_run fails (e.g. pydantic
ValidationError because a DEDICATED_FULL_FT row doesn't carry the
LoRA-required RLRun fields), the prior except APIError block silently
set kind=None and routed delete to the LoRA endpoint. The hosted
helm release + namespace would stay live with no signal back.

Restructure: try hosted_client.delete_run first; on HTTP 404 the
backend's kind gate told us 'not a DEDICATED_FULL_FT', so we fall
through to the LoRA path. Removes the get_run discriminator
roundtrip entirely — and any pydantic surprise it could have raised.
Cursor flagged: distinguishing 404 by 'HTTP 404' not in str(e) is
fragile — depends on the message format never changing AND on no
unrelated error body coincidentally containing the substring.

- Add NotFoundError(APIError) subclass; APIClient raises it for 404
  responses (sync + async paths).
- HostedTrainingClient.create_run / delete_run no longer catch
  Exception and rewrap into a generic APIError — typed APIError
  subclasses propagate so callers can branch by class.
- prime train delete uses except NotFoundError as the 'not a hosted
  run, try LoRA' fallback signal.
Cursor caught: `if output != "json" and not yes` meant a researcher
piping JSON for scripting would have full-FT runs auto-launched
without ack. The LoRA path always prompts unless --yes regardless of
output format. Match that contract — gate confirmation purely on
--yes.
`_dispatch_full_finetune_run` was building the payload without a
teamId, so every dispatched run landed on the caller's personal
account even when `prime config view` showed a team selected. The
LoRA path (rl.py:1216) already passes `app_config.team_id` — match
that behaviour for the full-FT path.

Backend's `CreateDedicatedRunRequest.team_id` is Optional, so the
prior omission was silently wrong instead of returning a 400.
@JannikSt JannikSt force-pushed the feature/train-hosted-full-finetune branch from b07791f to 8aafec7 Compare June 1, 2026 13:00
Backend's /api/v1/rft/runs/{run_id}/logs now accepts component +
env_name params (dedicated full-FT). Surface them through the CLI:

  prime train logs <run_id> -c trainer
  prime train logs <run_id> -c inference
  prime train logs <run_id> -c env-server --env <name>

Legacy --env <name>/<idx> still routes through the env-server-logs
endpoint (shared-RFT pods, cluster_id-backed lookup). Dedicated
env-server (slug, no slash) goes through the unified /logs route.

Per-rank --pod-index intentionally not exposed yet: the chart's
torchrun --local-ranks-filter=0 already collapses in-pod rank fan-out
to rank 0 stdout, and Loki's pod-label indexing in this tenant
doesn't actually filter the prime-job-* streams — per-pod inspection
on multi-node runs is kubectl + the PVC log files for now.
@JannikSt JannikSt force-pushed the feature/train-hosted-full-finetune branch from 8aafec7 to feea570 Compare June 1, 2026 13:02
Comment thread packages/prime/src/prime_cli/api/rl.py
Comment thread packages/prime/src/prime_cli/commands/rl.py
@JannikSt
Copy link
Copy Markdown
Member Author

JannikSt commented Jun 2, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5c0e2f79a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


Cluster targeting is backend-side (auto-pick first uncordoned).
"""
payload: Dict[str, Any] = {"config": cfg}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Strip CLI-only env file keys before dispatch

When a full-finetune TOML uses the supported env_file/env_files keys, _dispatch_full_finetune_run reads those files into secrets but then passes the original raw_cfg here unchanged, so the backend receives config.env_file(s) as part of the prime-rl config. Those local paths will not exist in the hosted pod (and can also be rejected as unknown prime-rl settings), so runs that rely on config-level env files can fail even though the secrets were already extracted; copy the config and remove these CLI-only keys before putting it under config.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a41e4b8. Configure here.

name=name,
team_id=team_id,
wandb_api_key=secrets.get("WANDB_API_KEY"),
hf_token=secrets.get("HF_TOKEN"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full-FT path silently drops non-WANDB/HF secrets

Medium Severity

The _dispatch_full_finetune_run function collects all secrets from -e/--env-file via collect_env_vars, but only forwards WANDB_API_KEY and HF_TOKEN to build_payload_from_toml. Any other user-provided secrets (e.g., -e CUSTOM_API_KEY=value) are silently discarded without warning. In contrast, the LoRA path passes all secrets via secrets=secrets. A user relying on custom env vars reaching the training container would experience silent runtime failures.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a41e4b8. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants