feature: prime train routes full_finetune TOMLs to hosted endpoint#592
feature: prime train routes full_finetune TOMLs to hosted endpoint#592JannikSt wants to merge 14 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b88b5f7d0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b07791f22c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…o hosted endpoint `prime train <toml>` stays the single entry point. When the TOML carries `type = "full_finetune"` (or a `[hosted]` block, or a `[deployment]` block matching prime-rl's qwen30b_math/rl.toml shape), the CLI routes to the new public API at /api/v1/training/runs instead of the LoRA shared-cluster path. Backwards compatible: configs without these markers run through the existing flow unchanged. * api/training.py: new HostedTrainingClient + build_payload_from_toml (whitelist-maps prime-rl example fields onto the API payload). * api/rl.py: surface `kind` on RLRun so `prime train delete` can route to the right endpoint based on run kind. * commands/rl.py: peek the TOML before strict RLConfig parse; on full-FT hand off to _dispatch_full_finetune_run with shared env-file/secrets plumbing. Delete looks up kind and dispatches via HostedTrainingClient for DEDICATED_FULL_FT runs. Tested end-to-end against local backend on rft-freyr cluster: dispatch + status mirroring + completion + delete all clean.
Drop the friction of looking up a cluster cuid for the common single- cluster setup. Backend now auto-picks the first uncordoned PrimeCluster when the field is omitted, so `prime train backend/examples/training/ reverse_text.toml` is zero-config.
…eyRef The platform materialises the per-run RFT API token into the per-run k8s Secret on dispatch and the chart binds it to the orchestrator pod's PRIME_API_KEY env var. The token already lives where prime-monitor needs it — surfacing it on stdout just makes it easy to leak into shared shell history or CI logs.
- Drop prime_cluster_id from CLI plumbing entirely. Backend auto-picks
the first uncordoned PrimeCluster; the CLI never threads a cluster
id, removing a footgun (mistargeting a config to the wrong cluster).
Drops the [hosted] discriminator path too — type = 'full_finetune'
or a [deployment] block remain the only triggers.
- codex P1 / cursor: --output json on the full-FT path no longer
short-circuits with a {would_dispatch} preview. Mirrors the LoRA
path's 'create then format' contract — automation that pipes the
JSON to grab run_id now actually dispatches the run.
- codex P2: env_file (deprecated, singular) is loaded BEFORE env_files
(canonical, plural) so env_files wins on key collision. Matches the
LoRA path's documented precedence.
build_payload_from_toml used to whitelist ~12 individual fields; the backend rebuilt a minimal TOML from them, dropping anything outside the whitelist (custom optim schedules, eval configs, custom scheduler params, …). E2E prime-rl runs that depended on those knobs silently diverged from `uv run rl @ rl.toml` behaviour. Now: ship the whole TOML as 'config' (companion to platform PR #1824 faa934d56). The backend's build_values takes the config dict directly so the same TOML works locally with prime-rl and remote-dispatched through prime-cli with no fork in behaviour.
Unrelated to the full-FT training payload change; accidentally picked up by 'git add -A'. Restoring inference.py to its 4b9be8f state. The streaming improvements will land in a sibling PR off main.
Cursor caught: when rl_client.get_run fails (e.g. pydantic ValidationError because a DEDICATED_FULL_FT row doesn't carry the LoRA-required RLRun fields), the prior except APIError block silently set kind=None and routed delete to the LoRA endpoint. The hosted helm release + namespace would stay live with no signal back. Restructure: try hosted_client.delete_run first; on HTTP 404 the backend's kind gate told us 'not a DEDICATED_FULL_FT', so we fall through to the LoRA path. Removes the get_run discriminator roundtrip entirely — and any pydantic surprise it could have raised.
Cursor flagged: distinguishing 404 by 'HTTP 404' not in str(e) is fragile — depends on the message format never changing AND on no unrelated error body coincidentally containing the substring. - Add NotFoundError(APIError) subclass; APIClient raises it for 404 responses (sync + async paths). - HostedTrainingClient.create_run / delete_run no longer catch Exception and rewrap into a generic APIError — typed APIError subclasses propagate so callers can branch by class. - prime train delete uses except NotFoundError as the 'not a hosted run, try LoRA' fallback signal.
Cursor caught: `if output != "json" and not yes` meant a researcher piping JSON for scripting would have full-FT runs auto-launched without ack. The LoRA path always prompts unless --yes regardless of output format. Match that contract — gate confirmation purely on --yes.
`_dispatch_full_finetune_run` was building the payload without a teamId, so every dispatched run landed on the caller's personal account even when `prime config view` showed a team selected. The LoRA path (rl.py:1216) already passes `app_config.team_id` — match that behaviour for the full-FT path. Backend's `CreateDedicatedRunRequest.team_id` is Optional, so the prior omission was silently wrong instead of returning a 400.
b07791f to
8aafec7
Compare
Backend's /api/v1/rft/runs/{run_id}/logs now accepts component +
env_name params (dedicated full-FT). Surface them through the CLI:
prime train logs <run_id> -c trainer
prime train logs <run_id> -c inference
prime train logs <run_id> -c env-server --env <name>
Legacy --env <name>/<idx> still routes through the env-server-logs
endpoint (shared-RFT pods, cluster_id-backed lookup). Dedicated
env-server (slug, no slash) goes through the unified /logs route.
Per-rank --pod-index intentionally not exposed yet: the chart's
torchrun --local-ranks-filter=0 already collapses in-pod rank fan-out
to rank 0 stdout, and Loki's pod-label indexing in this tenant
doesn't actually filter the prime-job-* streams — per-pod inspection
on multi-node runs is kubectl + the PVC log files for now.
8aafec7 to
feea570
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e5c0e2f79a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| Cluster targeting is backend-side (auto-pick first uncordoned). | ||
| """ | ||
| payload: Dict[str, Any] = {"config": cfg} |
There was a problem hiding this comment.
Strip CLI-only env file keys before dispatch
When a full-finetune TOML uses the supported env_file/env_files keys, _dispatch_full_finetune_run reads those files into secrets but then passes the original raw_cfg here unchanged, so the backend receives config.env_file(s) as part of the prime-rl config. Those local paths will not exist in the hosted pod (and can also be rejected as unknown prime-rl settings), so runs that rely on config-level env files can fail even though the secrets were already extracted; copy the config and remove these CLI-only keys before putting it under config.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a41e4b8. Configure here.
| name=name, | ||
| team_id=team_id, | ||
| wandb_api_key=secrets.get("WANDB_API_KEY"), | ||
| hf_token=secrets.get("HF_TOKEN"), |
There was a problem hiding this comment.
Full-FT path silently drops non-WANDB/HF secrets
Medium Severity
The _dispatch_full_finetune_run function collects all secrets from -e/--env-file via collect_env_vars, but only forwards WANDB_API_KEY and HF_TOKEN to build_payload_from_toml. Any other user-provided secrets (e.g., -e CUSTOM_API_KEY=value) are silently discarded without warning. In contrast, the LoRA path passes all secrets via secrets=secrets. A user relying on custom env vars reaching the training container would experience silent runtime failures.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a41e4b8. Configure here.


Note
Medium Risk
Changes how expensive training jobs are launched and deleted (hosted vs LoRA dispatch) and handles per-run credentials; mistakes could send the wrong workload or delete path, though behavior is covered by tests and explicit confirmation for dispatch.
Overview
prime trainnow detects full-parameter (prime-rl) TOMLs before the LoRA schema runs and dispatches them toPOST /v1/training/runsvia a newHostedTrainingClient, while unchanged LoRA configs still use the shared RFT path.Detection covers
type = "full_finetune"and[deployment]signals (including multi-nodenum_train_nodes/num_infer_nodes, which previously fell through to LoRA). The hosted path ships the full TOML asconfig, applies team_id and secrets from env files, strips localenv_filepaths, and does not print the minted API token in CLI/JSON output.RLRungains optionalkindand makes LoRA-centric fields optional so dedicated full-FT runs list/get cleanly; UI shows placeholders when those fields are absent.prime train deletetries the hosted full-FT delete endpoint first and falls back to LoRA on typedNotFoundError(404 from the HTTP client).prime train logsaddstrainer/inferencecomponents and routes most env-server reads through unified/rft/runs/.../logswithcomponent+env_name;name/Nqualifiers still use the legacy env-server-logs endpoint.Reviewed by Cursor Bugbot for commit bc982f4. Bugbot is set up for automated code reviews on this repo. Configure here.