Skip to content

Commit 653761d

Browse files
committed
Update Modal runtime for Princeton leaderboard
1 parent 2d9aaba commit 653761d

3 files changed

Lines changed: 182 additions & 7 deletions

File tree

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Modal Runtime Deploy and E2E
2+
3+
Use this when changing shared Modal dependencies in `kernelbot`, especially torch/CUDA, and when you need to prove the live leaderboard is actually using the new runtime.
4+
5+
## Scope
6+
7+
- Shared Modal image: `src/runners/modal_runner.py`
8+
- GPU-bound Modal functions: `src/runners/modal_runner_archs.py`
9+
- Live app name: `discord-bot-runner`
10+
- Popcorn e2e path: generate invite if needed, join closed leaderboard, submit with `popcorn-cli`
11+
12+
## Workflow
13+
14+
1. Make the smallest dependency change in `src/runners/modal_runner.py`.
15+
2. If changing torch/CUDA, inspect all later `.uv_pip_install(...)` blocks for conflicting CUDA/NCCL packages.
16+
3. Deploy to Modal `pytest` first.
17+
4. Run the narrow Modal integration test:
18+
```bash
19+
cd /Users/mark/Dev/kernelbot
20+
env MODAL_TOKEN_ID=... MODAL_TOKEN_SECRET=... \
21+
uv run --extra dev python -m pytest -s tests/test_modal.py -k 'test_modal_launcher_python_script and T4'
22+
```
23+
5. If that passes, deploy to Modal `main`:
24+
```bash
25+
cd /Users/mark/Dev/kernelbot/src/runners
26+
env MODAL_TOKEN_ID=... MODAL_TOKEN_SECRET=... \
27+
/Users/mark/Dev/kernelbot/.venv/bin/modal deploy --env main modal_runner_archs.py
28+
```
29+
6. Run a real `popcorn` submission in `test` mode against the target leaderboard.
30+
7. Confirm the returned report shows the expected `Torch:` version.
31+
8. Only then run `--mode leaderboard` if the user asked for a ranked submission.
32+
33+
## Closed Leaderboards
34+
35+
Generate an invite with admin token:
36+
37+
```bash
38+
cd /Users/mark/Dev/popcorn-cli
39+
env POPCORN_API_URL=... POPCORN_ADMIN_TOKEN=... \
40+
cargo run --quiet -- admin generate-invites --leaderboards <leaderboard> --count 1
41+
```
42+
43+
Join with the existing CLI identity in `~/.popcorn.yaml`:
44+
45+
```bash
46+
cd /Users/mark/Dev/popcorn-cli
47+
env POPCORN_API_URL=... \
48+
cargo run --quiet -- join '<invite_code>'
49+
```
50+
51+
## Real E2E Submit
52+
53+
```bash
54+
cd /Users/mark/Dev/popcorn-cli
55+
env POPCORN_API_URL=... \
56+
cargo run --quiet -- submit --no-tui --leaderboard <leaderboard> --gpu A100 --mode test <submission.py>
57+
```
58+
59+
Ranked submit:
60+
61+
```bash
62+
cd /Users/mark/Dev/popcorn-cli
63+
env POPCORN_API_URL=... \
64+
cargo run --quiet -- submit --no-tui --leaderboard <leaderboard> --gpu A100 --mode leaderboard <submission.py>
65+
```
66+
67+
Check recent runs:
68+
69+
```bash
70+
cd /Users/mark/Dev/popcorn-cli
71+
env POPCORN_API_URL=... \
72+
cargo run --quiet -- submissions list --leaderboard <leaderboard> --limit 5
73+
```
74+
75+
## Failure Mode To Remember
76+
77+
If a Modal run fails with:
78+
79+
```text
80+
libtorch_cuda.so: undefined symbol: ncclDevCommCreate
81+
```
82+
83+
then a later package install likely replaced torch's expected CUDA/NCCL dependency set. The practical fix is to install `torch` last so its dependency versions win.
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
---
2+
name: modal-runtime-deploy-e2e
3+
description: Upgrade shared Modal runtime dependencies in kernelbot and verify them end to end. Use when changing torch/CUDA or other shared Modal image dependencies, deploying the Modal app, and validating with both Modal integration tests and real popcorn leaderboard submissions.
4+
---
5+
6+
# Modal Runtime Deploy and E2E
7+
8+
Use this when changing shared Modal dependencies in `kernelbot`, especially torch/CUDA, and when you need to prove the live leaderboard is actually using the new runtime.
9+
10+
## Scope
11+
12+
- Shared Modal image: `src/runners/modal_runner.py`
13+
- GPU-bound Modal functions: `src/runners/modal_runner_archs.py`
14+
- Live app name: `discord-bot-runner`
15+
- Popcorn e2e path: generate invite if needed, join closed leaderboard, submit with `popcorn-cli`
16+
17+
## Workflow
18+
19+
1. Make the smallest dependency change in `src/runners/modal_runner.py`.
20+
2. If changing torch/CUDA, inspect all later `.uv_pip_install(...)` blocks for conflicting CUDA/NCCL packages.
21+
3. Deploy to Modal `pytest` first.
22+
4. Run the narrow Modal integration test:
23+
24+
```bash
25+
cd /Users/mark/Dev/kernelbot
26+
env MODAL_TOKEN_ID=... MODAL_TOKEN_SECRET=... \
27+
uv run --extra dev python -m pytest -s tests/test_modal.py -k 'test_modal_launcher_python_script and T4'
28+
```
29+
30+
5. If that passes, deploy to Modal `main`:
31+
32+
```bash
33+
cd /Users/mark/Dev/kernelbot/src/runners
34+
env MODAL_TOKEN_ID=... MODAL_TOKEN_SECRET=... \
35+
/Users/mark/Dev/kernelbot/.venv/bin/modal deploy --env main modal_runner_archs.py
36+
```
37+
38+
6. Run a real `popcorn` submission in `test` mode against the target leaderboard.
39+
7. Confirm the returned report shows the expected `Torch:` version.
40+
8. Only then run `--mode leaderboard` if the user asked for a ranked submission.
41+
42+
## Closed Leaderboards
43+
44+
Generate an invite with admin token:
45+
46+
```bash
47+
cd /Users/mark/Dev/popcorn-cli
48+
env POPCORN_API_URL=... POPCORN_ADMIN_TOKEN=... \
49+
cargo run --quiet -- admin generate-invites --leaderboards <leaderboard> --count 1
50+
```
51+
52+
Join with the existing CLI identity in `~/.popcorn.yaml`:
53+
54+
```bash
55+
cd /Users/mark/Dev/popcorn-cli
56+
env POPCORN_API_URL=... \
57+
cargo run --quiet -- join '<invite_code>'
58+
```
59+
60+
## Real E2E Submit
61+
62+
```bash
63+
cd /Users/mark/Dev/popcorn-cli
64+
env POPCORN_API_URL=... \
65+
cargo run --quiet -- submit --no-tui --leaderboard <leaderboard> --gpu A100 --mode test <submission.py>
66+
```
67+
68+
Ranked submit:
69+
70+
```bash
71+
cd /Users/mark/Dev/popcorn-cli
72+
env POPCORN_API_URL=... \
73+
cargo run --quiet -- submit --no-tui --leaderboard <leaderboard> --gpu A100 --mode leaderboard <submission.py>
74+
```
75+
76+
Check recent runs:
77+
78+
```bash
79+
cd /Users/mark/Dev/popcorn-cli
80+
env POPCORN_API_URL=... \
81+
cargo run --quiet -- submissions list --leaderboard <leaderboard> --limit 5
82+
```
83+
84+
## Failure Mode To Remember
85+
86+
If a Modal run fails with:
87+
88+
```text
89+
libtorch_cuda.so: undefined symbol: ncclDevCommCreate
90+
```
91+
92+
then a later package install likely replaced torch's expected CUDA/NCCL dependency set. The practical fix is to install `torch` last so its dependency versions win.

src/runners/modal_runner.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
# Create a stub for the Modal app
1010
# IMPORTANT: This has to stay in separate file or modal breaks
1111
app = App("discord-bot-runner")
12-
cuda_version = "13.1.0"
12+
cuda_version = "12.9.1"
1313
flavor = "devel"
1414
operating_sys = "ubuntu24.04"
1515
tag = f"{cuda_version}-{flavor}-{operating_sys}"
@@ -37,6 +37,7 @@
3737
.run_commands("ln -sf $(which python) /usr/local/bin/python3")
3838
.apt_install(
3939
"git",
40+
"curl",
4041
"gcc-13",
4142
"g++-13",
4243
"clang-18",
@@ -50,12 +51,6 @@
5051
"pytest",
5152
"PyYAML",
5253
)
53-
.uv_pip_install(
54-
"torch==2.9.1",
55-
"torchvision",
56-
"torchaudio",
57-
index_url="https://download.pytorch.org/whl/cu130",
58-
)
5954
# other frameworks
6055
.uv_pip_install(
6156
"tinygrad~=0.10",
@@ -70,6 +65,11 @@
7065
# "nvmath-python[cu13]~=0.4",
7166
# "numba-cuda[cu13]~=0.15",
7267
)
68+
# Install torch last so its CUDA/NCCL dependency set wins over broader CUDA Python packages.
69+
.uv_pip_install(
70+
"torch==2.11.0",
71+
index_url="https://download.pytorch.org/whl/cu129",
72+
)
7373
# CUTLASS C++ headers for #include <cutlass/...>
7474
.run_commands(
7575
"git clone --depth 1 --branch v4.3.5 https://github.com/NVIDIA/cutlass.git /opt/cutlass",

0 commit comments

Comments
 (0)