Support iris runners other than ryan#6
Open
forklady42 wants to merge 6 commits into
Open
Conversation
Pure `ruff check --fix` + `ruff format` output on
scripts/{iris-sync,cron_iris_sync_modal,iris_attempts_dump}.py — line
wrapping, unused-import removal, and f-string-without-placeholder rewrites.
No behavior change. Split out so the multi-runner change that follows reads
as logic-only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Iris namespaces every job as `/<user>/<job-name>`, where <user> is the
submitting machine's local username (`getpass.getuser()`, per iris's
`resolve_job_user`). The tooling hardcoded `/ryan/`, so a run fired by
anyone else landed in a namespace the CLI and dashboard never looked at:
`iris ls` showed nothing, dashboard links 404'd, cost/eval lookups missed it.
- tomat: IRIS_USERS roster (known runners + local user; override
TOMAT_IRIS_USERS). DEFAULT_IRIS_PREFIXES fans `/<user>/{tomat,train,eval,kl}`
across all runners — transparently fixes iris ls/kill/logs/summary,
preempts show/watch, evals harvest. `iris attempts` tries each namespace;
`_iris_url` resolves a run's owner from the R2 iris-state snapshot (falls
back to /ryan/ for pre-2026-06 runs); `cost summary/compute` scan all
runners; `iris sync` strips any `/<user>/` prefix when naming sidecars.
- scripts/{iris-sync,cron_iris_sync_modal}.py: same multi-user prefix fan-out.
- scripts/iris_attempts_dump.py: user-agnostic label stripping.
- marin/train_tomat_tpu.py: ADC docstring no longer names a specific account.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The multi-runner change grew the prefix set 3 → 8 (2 users ×
{tomat,train,eval,kl}) but left both sync crons issuing those iris RPCs
sequentially. Both crons fire every minute, so 8×~60s ≈ 8min of sequential
tunnel-setup can't keep the cadence — runs overlap or skip.
- iris-sync.py / cron_iris_sync_modal.py: fan the per-prefix RPCs out via a
bounded ThreadPoolExecutor (MAX_SYNC_WORKERS=4) so wall-clock stays under
the interval without spiking simultaneous controller tunnels. ex.map
preserves input order, so the zip into rows_by_prefix is exact.
- iris-sync.py: fix the stale timing comment (said "every 5min / 3 prefixes",
contradicting the file's own every-minute docstring).
- tomat: drop IRIS_USER / TOMAT_IRIS_USER. Post multi-runner it was referenced
only in its own definition, and its "submitting user" comment was misleading
— iris namespaces from the OS user, which tomat never overrides. Fold
getpass.getuser() into the IRIS_USERS default; TOMAT_IRIS_USERS affects
lookup only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5914be9 to
a1f7c12
Compare
The roster was hardcoded in three places (tomat's IRIS_KNOWN_USERS and the
two sync crons' USERS), kept aligned only by "keep in sync" comments — adding
a runner meant three edits, and missing the cron copies would silently drop
that runner's jobs from the dashboard / cost lookups.
New scripts/iris_runners.py is the single source of truth (KNOWN_USERS +
JOB_PREFIXES, dependency-free so the lean sync envs can import it without
dragging in the tomat package):
- tomat: imports KNOWN_USERS / JOB_PREFIXES (adds scripts/ to sys.path,
mirroring the existing src/ insert).
- scripts/iris-sync.py: sibling import — run it from the repo checkout
(`~/tomat/scripts/`), which the GCE cutover already does, not the legacy
bare-file copy.
- scripts/cron_iris_sync_modal.py: ships it into the image via
add_local_python_source("iris_runners") (the repo's established idiom for
local modules), imported in _sync.
Also restores the `from concurrent.futures import ThreadPoolExecutor` import
in iris-sync.py, dropped by the on-edit formatter in the fan-out commit
(a1f7c12) in the window before its usage existed — a real F821 that
`py_compile` doesn't catch but `ruff check` does.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Member
Author
|
Ready for a different bot to review |
There was a problem hiding this comment.
Pull request overview
Updates the tomat tooling and iris dashboard sync jobs to work with Iris’s /<user>/<job-name> job namespaces by fanning job listing/lookup/sync across a configurable roster of runner usernames, rather than hardcoding /ryan/.
Changes:
- Add a shared runner roster (
scripts/iris_runners.py) and update the CLI + sync jobs to use it for prefix fan-out. - Make iris lookups (listing, attempts, dashboard URLs, cost/eval lookups) work across multiple runner namespaces.
- Speed up the per-prefix iris sync crons by running prefix RPCs concurrently with a bounded
ThreadPoolExecutor.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
tomat |
Introduces dynamic IRIS_USERS, expands default prefixes across users, and updates multiple iris lookup paths (URLs, attempts, cost) to be multi-runner-aware. |
scripts/iris-sync.py |
Uses shared prefix roster and adds bounded concurrent fan-out for per-prefix iris job list RPCs. |
scripts/iris_runners.py |
New single source of truth for known iris runner usernames and job-name prefixes. |
scripts/iris_attempts_dump.py |
Makes attempt label derivation user-namespace agnostic (/<user>/<label>). |
scripts/cron_iris_sync_modal.py |
Ships iris_runners into the Modal image and runs per-prefix RPCs concurrently. |
marin/train_tomat_tpu.py |
Generalizes ADC prereq docs and applies minor formatting cleanups. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Member
|
(this lgtm, i'm just having Claude respond to copilot above, then will approve+land) |
- `scripts/iris-sync.py`: catch `subprocess.TimeoutExpired` in
`iris_job_list_json` and return `[]` so a single slow prefix skips
its upload (per the function comment) instead of aborting the sync
and losing every prefix's snapshot.
- `scripts/cron_iris_sync_modal.py`: same fix in the Modal cron's
mirror function (`_iris_job_list_json`).
- `tomat` (`cost summary`):
- Default `--prefix` was single-segment `/<user>` per known runner.
Iris's `job list --prefix` requires `/<user>/<jobname>` (2+
segments) — see the comment at line 253 — so the prior default
errored out silently and the summary reported "no matching jobs".
Switch the default to `DEFAULT_IRIS_PREFIXES` (the same 8-prefix
fan-out `iris ls` uses); explicit `-p` is honored as-is.
- Check `iris_run("list", ...)`'s return code before parsing
stdout. Failed prefix → warn + skip instead of silently falling
through to "no matching jobs".
- Refresh the `--prefix` help text to match the new default.
- `scripts/iris-sync.py` docstring: drop the "standalone, no repo needed" framing — the file does sibling-import `iris_runners`, which is only satisfied when both files are shipped from `scripts/`. Replace with an accurate description: `setup-iris-cron-vm.sh` rsyncs the `scripts/` dir, the cron invokes this from there, the sibling import resolves locally. - `scripts/cron_iris_sync_modal.py` Modal function timeouts: bump from 120s → 240s on both `cron_sync` and `sync_once`. The 120s value gave zero headroom at the new 8-prefix / 4-worker fan-out (worst-case ceil(8/4) × 60s = 120s == the function timeout); a single slow prefix would race the cron's own cutoff. 240s gives a clean 2× margin, and the `count == 0` safety guard still bails if iris is wedged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Iris namespaces every job as
/<user>/<job-name>, where<user>is the submitting machine's local username (getpass.getuser(), per iris'sresolve_job_user). The tomat tooling hardcoded/ryan/everywhere, so a run fired by anyone else landed in a namespace the CLI and dashboard never looked at —iris lsshowed nothing, dashboard links 404'd, and cost/eval lookups missed the job.This makes the runner identity dynamic and fans the listing / sync / lookup paths across every known runner.
Commits
chore: ruff auto-fixes on the iris sync scripts— pure formatter output (line-wrapping, unused-import removal, f-string-without-placeholder rewrites) on the three sync scripts, isolated so the functional diff reads clean. No behavior change.feat: support iris runners other than ryan— the multi-runner change itself (logic only).sync: concurrent prefix fan-out; drop dead IRIS_USER— review follow-ups (see below).refactor: single-source the iris runner roster— collapse the three hardcoded roster copies into one module; also fixes a latent bug (see below).Changes
tomatCLIIRIS_USERSroster —("ryan", "betsy")plus the local user, deduped (override:TOMAT_IRIS_USERS, comma-separated). There is no per-invocation "submitting user" override: iris derives the namespace from the OS user, which tomat never overrides — soTOMAT_IRIS_USERSaffects lookup only, not where your jobs land.DEFAULT_IRIS_PREFIXESnow spans/<user>/{tomat,train,eval,kl}for every runner (8 prefixes for the 2-runner default), which transparently fixesiris ls/kill/logs/summary,preempts show/watch, andevals harvest(all match against the listing). Theklprefix catches loss-name-prefixed labels (e.g.kl-s05-tpu-cont-v3);TRAINING_JOB_REmatches them too.iris attempts <label>tries the bare label under each user namespace until one resolves._iris_url(dashboard links inruns links/status) resolves a run's owner from the R2iris-state.jsonsnapshot (one memoized GET), falling back to/ryan/for pre-2026-06 runs.cost summarydefaults to scanning all runners' prefixes;cost computelooks the job up under each namespace.iris syncstrips any/<user>/prefix when naming attempts sidecars (wasremoveprefix("/ryan/")).Dashboard sync scripts (
scripts/cron_iris_sync_modal.py,scripts/iris-sync.py,scripts/iris_attempts_dump.py): same multi-user prefix fan-out and user-agnostic label stripping.marin/train_tomat_tpu.py: ADC prereq docstring no longer names a specific account.Review follow-ups (commit 3)
ThreadPoolExecutor(MAX_SYNC_WORKERS=4) — wall-clock back under the interval without spiking simultaneous controller tunnels. Fixed the stale "every 5min / 3 prefixes" timing comment too.IRIS_USER/TOMAT_IRIS_USER. Post-change it was referenced only in its own definition, and its "submitting user" comment promised behavior it didn't have. Foldedgetpass.getuser()straight into theIRIS_USERSdefault.Roster de-duplication (commit 4)
The roster lived in three hardcoded copies (the CLI's
IRIS_KNOWN_USERSplus each sync cron'sUSERS), kept aligned only by "keep in sync" comments — a new runner needed three edits, and missing the cron copies would silently drop that runner's jobs from the dashboard/cost lookups. Newscripts/iris_runners.pyis the single source of truth (KNOWN_USERS+JOB_PREFIXES), dependency-free so the lean sync environments can import it without dragging in thetomatpackage. The CLI imports it (via ascripts/sys.pathinsert),iris-sync.pyimports it as a sibling, and the Modal cron ships it into the image withadd_local_python_source("iris_runners")(the repo's established idiom).While here, this also restores a
ThreadPoolExecutorimport iniris-sync.pythat the on-edit formatter had dropped in commit 3 (in the window before its usage existed) — a realF821thatpy_compilemissed butruff checkcatches. Lint/format now pass clean on all touched scripts.Test plan
tomat --helploads; each commit compiles independently.IRIS_USERS = ('ryan', 'betsy'),DEFAULT_IRIS_PREFIXES= 8 prefixes;TOMAT_IRIS_USERSoverride respected.iris ls/ dry-run against the live controller.🤖 Generated with Claude Code