WIP: Modal nightly cron for PufferDrive training (do not merge)#496
Closed
eugenevinitsky wants to merge 4 commits into
Closed
WIP: Modal nightly cron for PufferDrive training (do not merge)#496eugenevinitsky wants to merge 4 commits into
eugenevinitsky wants to merge 4 commits into
Conversation
Modal app that runs every night and launches 3 seeds of single_agent_speed_run.yaml plus 3 seeds of nightly_best.yaml on 1x A100-80GB each, all in parallel, logging to wandb project nightly-modal under emerge_. Files: - scripts/modal/Dockerfile: CUDA 12.8.1 base, system libs, Python 3.12, torch. Stable slow layer. - scripts/modal/modal_app.py: bakes the repo + builds C extensions on top of the Dockerfile, defines train() and the nightly() cron at 04:00 UTC. - scripts/modal/README.md: setup / deploy / manual-trigger flow. - scripts/cluster_configs/nightly_best.yaml: multi-agent training config (ported from the nightly_best_launch sister branch) so both the Greene launcher and Modal can share it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline now works: image builds, container boots on A100-80GB, C+CUDA extensions import, wandb logs to nightly-modal under emerge_, training PPO loop runs at ~200K SPS. Notable shifts vs the original scaffold: - uv replaces pip in the Dockerfile. uv's resolver respects the inline `numpy<2 pandas<2.2` constraints during `uv pip install -e .`, so the C extension's numpy 1 ABI survives setup.py's unconstrained `pandas` dep (which would otherwise pull pandas 3 → numpy 2 → ImportError). - Project venv at /opt/venv via `uv venv` (was python3.12 -m venv against the externally-managed system python which refused to upgrade pip). - torch from cu128 wheel index (pypi default is cu13 — incompatible with the base image's CUDA 12.8 toolkit). - TORCH_CUDA_ARCH_LIST="7.5;8.0;8.9;9.0" — single image covers T4 / A100 / L4/L40S / H100/H200, swap gpu= string without rebuild. - Robust REPO_ROOT resolution: walk up looking for setup.py instead of hard-coded parents[2]. Survives both `<repo>/scripts/modal/modal_app.py` locally and `/root/modal_app.py` in the container. - train() function: cpu=24, memory=32GB, --vec.num-workers=8 override. PufferLib refuses if num_workers > os.cpu_count() and drive.ini default (20) exceeded Modal's observed core count. - Drop the unsupported --wandb-entity CLI flag (pufferl.py doesn't take one); set WANDB_ENTITY env var instead so wandb.init() picks it up. - add_local_dir replaces the removed modal.Mount (Modal 1.x API). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…agent Modal job Mirrors launch_single_agent.sh but points at nightly_best.yaml and bumps TIME (1800) and MEM (192gb) to match the multi-agent config's heavier profile (720k agents, 10B steps, inline validation_gigaflow eval that spikes past 128gb at the 250-epoch eval boundary). Greene and Modal now both ship single + multi launchers backed by the same yamls in scripts/cluster_configs/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-agent and multi-agent nightly runs now go to different wandb projects so the two scales don't pile into one search/comparison space: - single_agent_speed_run.yaml -> project `nightly-single` - nightly_best.yaml -> project `nightly-multi` Within each project, wandb_group is set to today's UTC date at launch time (overriding the static yaml value), so the night's 3 seeds cluster together in the UI. Applied identically across: - Modal nightly() fan-out - Greene launch_single_agent.sh - Greene launch_nightly_best.sh train(yaml_path, seed, run_name, wandb_group, wandb_project) — added the project parameter; nightly() picks the right one per yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WIP scaffolding to run nightly PufferDrive training on Modal.
Adds:
Status
Smoke test on A100-80GB is running end-to-end as of opening this PR:
What's known to work after the smoke:
Why WIP / do not merge
Test plan
🤖 Generated with Claude Code