WIP: Modal nightly cron for PufferDrive training (do not merge) by eugenevinitsky · Pull Request #496 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-06-28T16:07:17Z

Summary

WIP scaffolding to run nightly PufferDrive training on Modal.

Adds:

`scripts/modal/Dockerfile` — CUDA 12.8.1 / Ubuntu 24.04 / Python 3.12 / uv / torch cu128 base image. Multi-arch CUDA (sm_75 / sm_80 / sm_89 / sm_90) so one image runs on T4 / A100 / L4 / H100.
`scripts/modal/modal_app.py` — Modal app with a per-seed `train()` function on 1× A100-80GB and a `nightly()` cron at 04:00 UTC that fans out to 6 parallel runs (3 seeds × {single_agent, multi_agent}). Logs to wandb project `nightly-modal` under `emerge_`.
`scripts/modal/README.md` — `modal token new` → `modal secret create wandb-emerge ...` → `modal deploy` workflow.
`scripts/cluster_configs/nightly_best.yaml` — multi-agent training config (ported from `nightly_best_launch` branch) so the same yaml works for the Greene launcher and Modal.

Status

Smoke test on A100-80GB is running end-to-end as of opening this PR:

wandb run: https://wandb.ai/emerge_/nightly-modal/runs/n8a85v5z
~200K SPS, ETA ~1h 20m to 1B steps

What's known to work after the smoke:

✓ Image build (~5 min first time, ~2 min on iterations; cached layers shared across runs)
✓ `uv pip install -e .` builds C + CUDA extensions on the image
✓ wandb auth via Modal secret
✓ Training subprocess starts, PPO loop converges (entropy ↓, episode_return ↑, offroad_rate ↓)

Why WIP / do not merge

Hasn't been run with the multi-agent (`nightly_best.yaml`) config yet — it has 720k agents and 10B total_timesteps; needs to fit in the 12h Modal timeout per container.
Cost on Modal is non-trivial — 6 × A100-80GB-hours/night = a few hundred / month. Wait for a real cost number from the first full multi-agent run before merging anything that auto-fires.
A few rough edges I'd want to clean up first: cpu/memory split per yaml (smaller for single-agent, larger for multi), maybe an env knob to scale total_timesteps for cheaper smoke runs.

Test plan

Smoke training run on A100-80GB to ≥10M steps without OOM / crash (in progress now)
Manual `modal run scripts/modal/modal_app.py::nightly` to verify the 6-job fan-out
Multi-agent yaml runs to completion in a 12h container
Verify wandb run finalization (model upload) lands

🤖 Generated with Claude Code

Modal app that runs every night and launches 3 seeds of single_agent_speed_run.yaml plus 3 seeds of nightly_best.yaml on 1x A100-80GB each, all in parallel, logging to wandb project nightly-modal under emerge_. Files: - scripts/modal/Dockerfile: CUDA 12.8.1 base, system libs, Python 3.12, torch. Stable slow layer. - scripts/modal/modal_app.py: bakes the repo + builds C extensions on top of the Dockerfile, defines train() and the nightly() cron at 04:00 UTC. - scripts/modal/README.md: setup / deploy / manual-trigger flow. - scripts/cluster_configs/nightly_best.yaml: multi-agent training config (ported from the nightly_best_launch sister branch) so both the Greene launcher and Modal can share it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Pipeline now works: image builds, container boots on A100-80GB, C+CUDA extensions import, wandb logs to nightly-modal under emerge_, training PPO loop runs at ~200K SPS. Notable shifts vs the original scaffold: - uv replaces pip in the Dockerfile. uv's resolver respects the inline `numpy<2 pandas<2.2` constraints during `uv pip install -e .`, so the C extension's numpy 1 ABI survives setup.py's unconstrained `pandas` dep (which would otherwise pull pandas 3 → numpy 2 → ImportError). - Project venv at /opt/venv via `uv venv` (was python3.12 -m venv against the externally-managed system python which refused to upgrade pip). - torch from cu128 wheel index (pypi default is cu13 — incompatible with the base image's CUDA 12.8 toolkit). - TORCH_CUDA_ARCH_LIST="7.5;8.0;8.9;9.0" — single image covers T4 / A100 / L4/L40S / H100/H200, swap gpu= string without rebuild. - Robust REPO_ROOT resolution: walk up looking for setup.py instead of hard-coded parents[2]. Survives both `<repo>/scripts/modal/modal_app.py` locally and `/root/modal_app.py` in the container. - train() function: cpu=24, memory=32GB, --vec.num-workers=8 override. PufferLib refuses if num_workers > os.cpu_count() and drive.ini default (20) exceeded Modal's observed core count. - Drop the unsupported --wandb-entity CLI flag (pufferl.py doesn't take one); set WANDB_ENTITY env var instead so wandb.init() picks it up. - add_local_dir replaces the removed modal.Mount (Modal 1.x API). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…agent Modal job Mirrors launch_single_agent.sh but points at nightly_best.yaml and bumps TIME (1800) and MEM (192gb) to match the multi-agent config's heavier profile (720k agents, 10B steps, inline validation_gigaflow eval that spikes past 128gb at the 250-epoch eval boundary). Greene and Modal now both ship single + multi launchers backed by the same yamls in scripts/cluster_configs/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Single-agent and multi-agent nightly runs now go to different wandb projects so the two scales don't pile into one search/comparison space: - single_agent_speed_run.yaml -> project `nightly-single` - nightly_best.yaml -> project `nightly-multi` Within each project, wandb_group is set to today's UTC date at launch time (overriding the static yaml value), so the night's 3 seeds cluster together in the UI. Applied identically across: - Modal nightly() fan-out - Greene launch_single_agent.sh - Greene launch_nightly_best.sh train(yaml_path, seed, run_name, wandb_group, wandb_project) — added the project parameter; nightly() picks the right one per yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

eugenevinitsky · 2026-06-28T16:36:34Z

Superseded by #497 (Greene) and #498 (Modal). Splitting the original combined PR so the Greene-only and Modal-only changes can land independently.

Eugene Vinitsky and others added 4 commits June 27, 2026 18:33

eugenevinitsky closed this Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Modal nightly cron for PufferDrive training (do not merge)#496

WIP: Modal nightly cron for PufferDrive training (do not merge)#496
eugenevinitsky wants to merge 4 commits into
3.0from
ev/nightly_runs

eugenevinitsky commented Jun 28, 2026

Uh oh!

eugenevinitsky commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

eugenevinitsky commented Jun 28, 2026

Summary

Status

Why WIP / do not merge

Test plan

Uh oh!

eugenevinitsky commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant