Skip to content

WIP: Modal nightly cron — stacked on Greene PR (do not merge)#498

Draft
eugenevinitsky wants to merge 1 commit into
ev/nightly_runs_greenefrom
ev/nightly_runs_modal
Draft

WIP: Modal nightly cron — stacked on Greene PR (do not merge)#498
eugenevinitsky wants to merge 1 commit into
ev/nightly_runs_greenefrom
ev/nightly_runs_modal

Conversation

@eugenevinitsky

Copy link
Copy Markdown

Summary

WIP — runs the same nightly fan-out as the Greene launcher (sibling PR), on Modal with a built-in cron schedule.

Stacked on top of #497 — base is `ev/nightly_runs_greene` because `modal_app.py` reads the yamls added in that PR at runtime. Merge order: #497, then this.

  • `scripts/modal/Dockerfile` — CUDA 12.8.1 / Ubuntu 24.04 / Python 3.12 / uv / torch cu128 base. Multi-arch CUDA (sm_75 / sm_80 / sm_89 / sm_90) so one image runs on T4 / A100 / L4 / L40S / H100 / H200 without rebuild.
  • `scripts/modal/modal_app.py` — Modal app with:
    • per-seed `train()` on 1× A100-80GB (cpu=24, memory=32GB, timeout=12h, retries=1)
    • `nightly()` cron at 04:00 UTC fanning out to 6 parallel runs (3 seeds × {single, multi})
    • routes to wandb projects `nightly-single` / `nightly-multi`, group = today's UTC date
  • `scripts/modal/README.md` — setup / deploy / manual trigger workflow.

Status

Smoke-validated end-to-end on A100-80GB:

Why WIP / do not merge

  • Hasn't been run with the multi-agent (`nightly_best.yaml`) yaml on Modal yet — needs to fit in the 12h container timeout.
  • Cost is non-trivial — 6 × A100-80GB-hours/night ≈ a few hundred / month. Wait for a real multi-agent number before flipping cron on.
  • Image-cache caveat: changing the Dockerfile invalidates everything below it. Want to settle the apt list before this lands so subsequent edits don't trigger 5-min apt rebuilds.

Test plan

  • Smoke training run on A100-80GB to ≥10M steps (in progress)
  • Manual `modal run scripts/modal/modal_app.py::nightly` to verify the 6-job fan-out
  • Multi-agent yaml runs to completion in a 12h container
  • Verify wandb run finalization (model upload) lands

🤖 Generated with Claude Code

Adds a Modal app that runs the same nightly fan-out as the Greene
launchers, on Modal infrastructure with a built-in cron schedule.

- scripts/modal/Dockerfile — CUDA 12.8.1 / Ubuntu 24.04 / Python 3.12 /
  uv / torch cu128. Multi-arch CUDA (sm_75 / sm_80 / sm_89 / sm_90) so
  one image runs on T4 / A100 / L4/L40S / H100/H200 without rebuild.
- scripts/modal/modal_app.py — Modal app with a per-seed train() on
  1x A100-80GB and a nightly() cron at 04:00 UTC that fans out to
  6 parallel runs (3 seeds x {single, multi}). Routes to wandb
  projects nightly-single / nightly-multi, group = today's UTC date.
- scripts/modal/README.md — modal token new -> modal secret create
  wandb-emerge -> modal deploy workflow.

Smoke-validated end-to-end on A100-80GB:
  wandb: https://wandb.ai/emerge_/nightly-modal/runs/n8a85v5z
  (~200K SPS on single-agent, all 6 metrics pipelines confirmed.)

Depends on the Greene PR for the cluster_configs/*.yaml files that the
Modal app loads at runtime.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@eugenevinitsky eugenevinitsky force-pushed the ev/nightly_runs_modal branch from 1e3b718 to 6e0d7ba Compare June 28, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant