Skip to content

WIP: Modal nightly cron for PufferDrive training (do not merge)#496

Closed
eugenevinitsky wants to merge 4 commits into
3.0from
ev/nightly_runs
Closed

WIP: Modal nightly cron for PufferDrive training (do not merge)#496
eugenevinitsky wants to merge 4 commits into
3.0from
ev/nightly_runs

Conversation

@eugenevinitsky

Copy link
Copy Markdown

Summary

WIP scaffolding to run nightly PufferDrive training on Modal.

Adds:

  • `scripts/modal/Dockerfile` — CUDA 12.8.1 / Ubuntu 24.04 / Python 3.12 / uv / torch cu128 base image. Multi-arch CUDA (sm_75 / sm_80 / sm_89 / sm_90) so one image runs on T4 / A100 / L4 / H100.
  • `scripts/modal/modal_app.py` — Modal app with a per-seed `train()` function on 1× A100-80GB and a `nightly()` cron at 04:00 UTC that fans out to 6 parallel runs (3 seeds × {single_agent, multi_agent}). Logs to wandb project `nightly-modal` under `emerge_`.
  • `scripts/modal/README.md` — `modal token new` → `modal secret create wandb-emerge ...` → `modal deploy` workflow.
  • `scripts/cluster_configs/nightly_best.yaml` — multi-agent training config (ported from `nightly_best_launch` branch) so the same yaml works for the Greene launcher and Modal.

Status

Smoke test on A100-80GB is running end-to-end as of opening this PR:

What's known to work after the smoke:

  • ✓ Image build (~5 min first time, ~2 min on iterations; cached layers shared across runs)
  • ✓ `uv pip install -e .` builds C + CUDA extensions on the image
  • ✓ wandb auth via Modal secret
  • ✓ Training subprocess starts, PPO loop converges (entropy ↓, episode_return ↑, offroad_rate ↓)

Why WIP / do not merge

  • Hasn't been run with the multi-agent (`nightly_best.yaml`) config yet — it has 720k agents and 10B total_timesteps; needs to fit in the 12h Modal timeout per container.
  • Cost on Modal is non-trivial — 6 × A100-80GB-hours/night = a few hundred / month. Wait for a real cost number from the first full multi-agent run before merging anything that auto-fires.
  • A few rough edges I'd want to clean up first: cpu/memory split per yaml (smaller for single-agent, larger for multi), maybe an env knob to scale total_timesteps for cheaper smoke runs.

Test plan

  • Smoke training run on A100-80GB to ≥10M steps without OOM / crash (in progress now)
  • Manual `modal run scripts/modal/modal_app.py::nightly` to verify the 6-job fan-out
  • Multi-agent yaml runs to completion in a 12h container
  • Verify wandb run finalization (model upload) lands

🤖 Generated with Claude Code

Eugene Vinitsky and others added 4 commits June 27, 2026 18:33
Modal app that runs every night and launches 3 seeds of
single_agent_speed_run.yaml plus 3 seeds of nightly_best.yaml on 1x
A100-80GB each, all in parallel, logging to wandb project nightly-modal
under emerge_.

Files:
- scripts/modal/Dockerfile: CUDA 12.8.1 base, system libs, Python 3.12,
  torch. Stable slow layer.
- scripts/modal/modal_app.py: bakes the repo + builds C extensions on
  top of the Dockerfile, defines train() and the nightly() cron at
  04:00 UTC.
- scripts/modal/README.md: setup / deploy / manual-trigger flow.
- scripts/cluster_configs/nightly_best.yaml: multi-agent training
  config (ported from the nightly_best_launch sister branch) so both
  the Greene launcher and Modal can share it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline now works: image builds, container boots on A100-80GB, C+CUDA
extensions import, wandb logs to nightly-modal under emerge_, training
PPO loop runs at ~200K SPS.

Notable shifts vs the original scaffold:

- uv replaces pip in the Dockerfile. uv's resolver respects the inline
  `numpy<2 pandas<2.2` constraints during `uv pip install -e .`, so the
  C extension's numpy 1 ABI survives setup.py's unconstrained `pandas`
  dep (which would otherwise pull pandas 3 → numpy 2 → ImportError).
- Project venv at /opt/venv via `uv venv` (was python3.12 -m venv against
  the externally-managed system python which refused to upgrade pip).
- torch from cu128 wheel index (pypi default is cu13 — incompatible
  with the base image's CUDA 12.8 toolkit).
- TORCH_CUDA_ARCH_LIST="7.5;8.0;8.9;9.0" — single image covers T4 / A100
  / L4/L40S / H100/H200, swap gpu= string without rebuild.
- Robust REPO_ROOT resolution: walk up looking for setup.py instead of
  hard-coded parents[2]. Survives both `<repo>/scripts/modal/modal_app.py`
  locally and `/root/modal_app.py` in the container.
- train() function: cpu=24, memory=32GB, --vec.num-workers=8 override.
  PufferLib refuses if num_workers > os.cpu_count() and drive.ini default
  (20) exceeded Modal's observed core count.
- Drop the unsupported --wandb-entity CLI flag (pufferl.py doesn't take
  one); set WANDB_ENTITY env var instead so wandb.init() picks it up.
- add_local_dir replaces the removed modal.Mount (Modal 1.x API).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…agent Modal job

Mirrors launch_single_agent.sh but points at nightly_best.yaml and bumps
TIME (1800) and MEM (192gb) to match the multi-agent config's heavier
profile (720k agents, 10B steps, inline validation_gigaflow eval that
spikes past 128gb at the 250-epoch eval boundary).

Greene and Modal now both ship single + multi launchers backed by the
same yamls in scripts/cluster_configs/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-agent and multi-agent nightly runs now go to different wandb
projects so the two scales don't pile into one search/comparison space:
  - single_agent_speed_run.yaml -> project `nightly-single`
  - nightly_best.yaml           -> project `nightly-multi`

Within each project, wandb_group is set to today's UTC date at launch
time (overriding the static yaml value), so the night's 3 seeds cluster
together in the UI. Applied identically across:
  - Modal nightly() fan-out
  - Greene launch_single_agent.sh
  - Greene launch_nightly_best.sh

train(yaml_path, seed, run_name, wandb_group, wandb_project) — added
the project parameter; nightly() picks the right one per yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@eugenevinitsky

Copy link
Copy Markdown
Author

Superseded by #497 (Greene) and #498 (Modal). Splitting the original combined PR so the Greene-only and Modal-only changes can land independently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant