Skip to content

Support iris runners other than ryan#6

Open
forklady42 wants to merge 6 commits into
mainfrom
multi-runner-iris
Open

Support iris runners other than ryan#6
forklady42 wants to merge 6 commits into
mainfrom
multi-runner-iris

Conversation

@forklady42

@forklady42 forklady42 commented Jun 15, 2026

Copy link
Copy Markdown
Member

Summary

Iris namespaces every job as /<user>/<job-name>, where <user> is the submitting machine's local username (getpass.getuser(), per iris's resolve_job_user). The tomat tooling hardcoded /ryan/ everywhere, so a run fired by anyone else landed in a namespace the CLI and dashboard never looked at — iris ls showed nothing, dashboard links 404'd, and cost/eval lookups missed the job.

This makes the runner identity dynamic and fans the listing / sync / lookup paths across every known runner.

Commits

  1. chore: ruff auto-fixes on the iris sync scripts — pure formatter output (line-wrapping, unused-import removal, f-string-without-placeholder rewrites) on the three sync scripts, isolated so the functional diff reads clean. No behavior change.
  2. feat: support iris runners other than ryan — the multi-runner change itself (logic only).
  3. sync: concurrent prefix fan-out; drop dead IRIS_USER — review follow-ups (see below).
  4. refactor: single-source the iris runner roster — collapse the three hardcoded roster copies into one module; also fixes a latent bug (see below).

Changes

tomat CLI

  • IRIS_USERS roster — ("ryan", "betsy") plus the local user, deduped (override: TOMAT_IRIS_USERS, comma-separated). There is no per-invocation "submitting user" override: iris derives the namespace from the OS user, which tomat never overrides — so TOMAT_IRIS_USERS affects lookup only, not where your jobs land.
  • DEFAULT_IRIS_PREFIXES now spans /<user>/{tomat,train,eval,kl} for every runner (8 prefixes for the 2-runner default), which transparently fixes iris ls/kill/logs/summary, preempts show/watch, and evals harvest (all match against the listing). The kl prefix catches loss-name-prefixed labels (e.g. kl-s05-tpu-cont-v3); TRAINING_JOB_RE matches them too.
  • iris attempts <label> tries the bare label under each user namespace until one resolves.
  • _iris_url (dashboard links in runs links/status) resolves a run's owner from the R2 iris-state.json snapshot (one memoized GET), falling back to /ryan/ for pre-2026-06 runs.
  • cost summary defaults to scanning all runners' prefixes; cost compute looks the job up under each namespace.
  • iris sync strips any /<user>/ prefix when naming attempts sidecars (was removeprefix("/ryan/")).

Dashboard sync scripts (scripts/cron_iris_sync_modal.py, scripts/iris-sync.py, scripts/iris_attempts_dump.py): same multi-user prefix fan-out and user-agnostic label stripping.

marin/train_tomat_tpu.py: ADC prereq docstring no longer names a specific account.

Review follow-ups (commit 3)

  • Concurrent sync fan-out. Growing the prefix set 3 → 8 left both crons issuing the iris RPCs sequentially; both fire every minute, so ~8×60s of sequential tunnel-setup couldn't keep the cadence. Both scripts now fan the per-prefix RPCs out via a bounded ThreadPoolExecutor (MAX_SYNC_WORKERS=4) — wall-clock back under the interval without spiking simultaneous controller tunnels. Fixed the stale "every 5min / 3 prefixes" timing comment too.
  • Dropped dead IRIS_USER / TOMAT_IRIS_USER. Post-change it was referenced only in its own definition, and its "submitting user" comment promised behavior it didn't have. Folded getpass.getuser() straight into the IRIS_USERS default.

Roster de-duplication (commit 4)

The roster lived in three hardcoded copies (the CLI's IRIS_KNOWN_USERS plus each sync cron's USERS), kept aligned only by "keep in sync" comments — a new runner needed three edits, and missing the cron copies would silently drop that runner's jobs from the dashboard/cost lookups. New scripts/iris_runners.py is the single source of truth (KNOWN_USERS + JOB_PREFIXES), dependency-free so the lean sync environments can import it without dragging in the tomat package. The CLI imports it (via a scripts/ sys.path insert), iris-sync.py imports it as a sibling, and the Modal cron ships it into the image with add_local_python_source("iris_runners") (the repo's established idiom).

While here, this also restores a ThreadPoolExecutor import in iris-sync.py that the on-edit formatter had dropped in commit 3 (in the window before its usage existed) — a real F821 that py_compile missed but ruff check catches. Lint/format now pass clean on all touched scripts.

Test plan

  • All touched files compile; tomat --help loads; each commit compiles independently.
  • Constant resolution: IRIS_USERS = ('ryan', 'betsy'), DEFAULT_IRIS_PREFIXES = 8 prefixes; TOMAT_IRIS_USERS override respected.
  • Verified the cosmetic/functional split: the functional commit's per-script diffs are logic-only (no reflow), and the branch tree is unchanged by the history rewrite.
  • End-to-end iris ls / dry-run against the live controller.

Note: the earlier uv.lock refresh was dropped — it was a net no-op against main (the merge had already reconciled it) and unrelated to this change.

🤖 Generated with Claude Code

forklady42 and others added 3 commits June 16, 2026 16:01
Pure `ruff check --fix` + `ruff format` output on
scripts/{iris-sync,cron_iris_sync_modal,iris_attempts_dump}.py — line
wrapping, unused-import removal, and f-string-without-placeholder rewrites.
No behavior change. Split out so the multi-runner change that follows reads
as logic-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Iris namespaces every job as `/<user>/<job-name>`, where <user> is the
submitting machine's local username (`getpass.getuser()`, per iris's
`resolve_job_user`). The tooling hardcoded `/ryan/`, so a run fired by
anyone else landed in a namespace the CLI and dashboard never looked at:
`iris ls` showed nothing, dashboard links 404'd, cost/eval lookups missed it.

- tomat: IRIS_USERS roster (known runners + local user; override
  TOMAT_IRIS_USERS). DEFAULT_IRIS_PREFIXES fans `/<user>/{tomat,train,eval,kl}`
  across all runners — transparently fixes iris ls/kill/logs/summary,
  preempts show/watch, evals harvest. `iris attempts` tries each namespace;
  `_iris_url` resolves a run's owner from the R2 iris-state snapshot (falls
  back to /ryan/ for pre-2026-06 runs); `cost summary/compute` scan all
  runners; `iris sync` strips any `/<user>/` prefix when naming sidecars.
- scripts/{iris-sync,cron_iris_sync_modal}.py: same multi-user prefix fan-out.
- scripts/iris_attempts_dump.py: user-agnostic label stripping.
- marin/train_tomat_tpu.py: ADC docstring no longer names a specific account.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The multi-runner change grew the prefix set 3 → 8 (2 users ×
{tomat,train,eval,kl}) but left both sync crons issuing those iris RPCs
sequentially. Both crons fire every minute, so 8×~60s ≈ 8min of sequential
tunnel-setup can't keep the cadence — runs overlap or skip.

- iris-sync.py / cron_iris_sync_modal.py: fan the per-prefix RPCs out via a
  bounded ThreadPoolExecutor (MAX_SYNC_WORKERS=4) so wall-clock stays under
  the interval without spiking simultaneous controller tunnels. ex.map
  preserves input order, so the zip into rows_by_prefix is exact.
- iris-sync.py: fix the stale timing comment (said "every 5min / 3 prefixes",
  contradicting the file's own every-minute docstring).
- tomat: drop IRIS_USER / TOMAT_IRIS_USER. Post multi-runner it was referenced
  only in its own definition, and its "submitting user" comment was misleading
  — iris namespaces from the OS user, which tomat never overrides. Fold
  getpass.getuser() into the IRIS_USERS default; TOMAT_IRIS_USERS affects
  lookup only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The roster was hardcoded in three places (tomat's IRIS_KNOWN_USERS and the
two sync crons' USERS), kept aligned only by "keep in sync" comments — adding
a runner meant three edits, and missing the cron copies would silently drop
that runner's jobs from the dashboard / cost lookups.

New scripts/iris_runners.py is the single source of truth (KNOWN_USERS +
JOB_PREFIXES, dependency-free so the lean sync envs can import it without
dragging in the tomat package):
- tomat: imports KNOWN_USERS / JOB_PREFIXES (adds scripts/ to sys.path,
  mirroring the existing src/ insert).
- scripts/iris-sync.py: sibling import — run it from the repo checkout
  (`~/tomat/scripts/`), which the GCE cutover already does, not the legacy
  bare-file copy.
- scripts/cron_iris_sync_modal.py: ships it into the image via
  add_local_python_source("iris_runners") (the repo's established idiom for
  local modules), imported in _sync.

Also restores the `from concurrent.futures import ThreadPoolExecutor` import
in iris-sync.py, dropped by the on-edit formatter in the fan-out commit
(a1f7c12) in the window before its usage existed — a real F821 that
`py_compile` doesn't catch but `ruff check` does.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@forklady42 forklady42 requested a review from ryan-williams June 17, 2026 15:43
@forklady42

Copy link
Copy Markdown
Member Author

Ready for a different bot to review

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the tomat tooling and iris dashboard sync jobs to work with Iris’s /<user>/<job-name> job namespaces by fanning job listing/lookup/sync across a configurable roster of runner usernames, rather than hardcoding /ryan/.

Changes:

  • Add a shared runner roster (scripts/iris_runners.py) and update the CLI + sync jobs to use it for prefix fan-out.
  • Make iris lookups (listing, attempts, dashboard URLs, cost/eval lookups) work across multiple runner namespaces.
  • Speed up the per-prefix iris sync crons by running prefix RPCs concurrently with a bounded ThreadPoolExecutor.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tomat Introduces dynamic IRIS_USERS, expands default prefixes across users, and updates multiple iris lookup paths (URLs, attempts, cost) to be multi-runner-aware.
scripts/iris-sync.py Uses shared prefix roster and adds bounded concurrent fan-out for per-prefix iris job list RPCs.
scripts/iris_runners.py New single source of truth for known iris runner usernames and job-name prefixes.
scripts/iris_attempts_dump.py Makes attempt label derivation user-namespace agnostic (/<user>/<label>).
scripts/cron_iris_sync_modal.py Ships iris_runners into the Modal image and runs per-prefix RPCs concurrently.
marin/train_tomat_tpu.py Generalizes ADC prereq docs and applies minor formatting cleanups.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/iris-sync.py
Comment thread scripts/cron_iris_sync_modal.py
Comment thread tomat Outdated
Comment thread tomat
@ryan-williams

Copy link
Copy Markdown
Member

(this lgtm, i'm just having Claude respond to copilot above, then will approve+land)

- `scripts/iris-sync.py`: catch `subprocess.TimeoutExpired` in
  `iris_job_list_json` and return `[]` so a single slow prefix skips
  its upload (per the function comment) instead of aborting the sync
  and losing every prefix's snapshot.

- `scripts/cron_iris_sync_modal.py`: same fix in the Modal cron's
  mirror function (`_iris_job_list_json`).

- `tomat` (`cost summary`):
    - Default `--prefix` was single-segment `/<user>` per known runner.
      Iris's `job list --prefix` requires `/<user>/<jobname>` (2+
      segments) — see the comment at line 253 — so the prior default
      errored out silently and the summary reported "no matching jobs".
      Switch the default to `DEFAULT_IRIS_PREFIXES` (the same 8-prefix
      fan-out `iris ls` uses); explicit `-p` is honored as-is.
    - Check `iris_run("list", ...)`'s return code before parsing
      stdout. Failed prefix → warn + skip instead of silently falling
      through to "no matching jobs".
    - Refresh the `--prefix` help text to match the new default.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread scripts/iris-sync.py
Comment thread scripts/cron_iris_sync_modal.py
- `scripts/iris-sync.py` docstring: drop the "standalone, no repo needed"
  framing — the file does sibling-import `iris_runners`, which is only
  satisfied when both files are shipped from `scripts/`. Replace with an
  accurate description: `setup-iris-cron-vm.sh` rsyncs the `scripts/`
  dir, the cron invokes this from there, the sibling import resolves
  locally.
- `scripts/cron_iris_sync_modal.py` Modal function timeouts: bump from
  120s → 240s on both `cron_sync` and `sync_once`. The 120s value gave
  zero headroom at the new 8-prefix / 4-worker fan-out (worst-case
  ceil(8/4) × 60s = 120s == the function timeout); a single slow prefix
  would race the cron's own cutoff. 240s gives a clean 2× margin, and
  the `count == 0` safety guard still bails if iris is wedged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants