Skip to content

Commit 86946ff

Browse files
beveradbclaude
andauthored
feat: target GPU runner for integration tests (#267)
Update integration test workflow to run on [self-hosted, gpu] label instead of bare self-hosted. This routes tests to the new GPU runner VM with NVIDIA T4, reducing CI time from 30+ minutes to ~5 minutes. - Change runs-on to [self-hosted, gpu] - Install poetry dependencies with -E gpu (onnxruntime-gpu) - Add nvidia-smi verification step - Add 30-minute timeout - Update plan with resolved open questions Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 992cc2c commit 86946ff

2 files changed

Lines changed: 202 additions & 4 deletions

File tree

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Plan: GCP GPU Runner for Integration Tests
2+
3+
**Created:** 2026-03-16
4+
**Branch:** feat/gha-gpu-runner
5+
**Status:** Implemented (pending `pulumi up` and PR merge)
6+
7+
## Overview
8+
9+
The python-audio-separator integration tests currently run on a CPU-only self-hosted
10+
GHA runner (`e2-standard-4`, 4 vCPU, 16GB RAM). With the new ensemble tests and
11+
multi-stem verification tests, CI takes 30+ minutes because each model separation runs
12+
on CPU. A GPU runner would reduce this to ~5 minutes.
13+
14+
## Current State
15+
16+
### Existing runner infrastructure
17+
- **Location:** `karaoke-gen/infrastructure/compute/github_runners.py` (Pulumi)
18+
- **Runners:**`e2-standard-4` (general) + 1× `e2-standard-8` (Docker builds)
19+
- **Labels:** `self-hosted`, `Linux`, `X64`, `gcp`, `large-disk`
20+
- **Region:** `us-central1-a`
21+
- **Models:** Pre-cached at `/opt/audio-separator-models` on runner startup
22+
- **Org-level:** Runners are registered to `nomadkaraoke` org, available to all repos
23+
- **NAT:** All runners use Cloud NAT (no external IPs)
24+
25+
### Current integration test workflow
26+
- File: `.github/workflows/run-integration-tests.yaml`
27+
- Runs on: `self-hosted` (picks up any org runner)
28+
- Tests: `poetry run pytest -sv --cov=audio_separator tests/integration`
29+
- Installs: `poetry install -E cpu`
30+
- Problem: All model inference on CPU → very slow for Roformer/Demucs models
31+
32+
## Requirements
33+
34+
- [x] GCE VM with NVIDIA GPU (T4 is cheapest, sufficient for inference)
35+
- [x] CUDA drivers + PyTorch GPU support pre-installed
36+
- [x] Models pre-cached on persistent disk (same as existing runners)
37+
- [x] Labeled `gpu` so workflow can target it specifically
38+
- [x] Cost-effective — only runs when needed (on-demand, not always-on)
39+
- [x] Integration test workflow updated to use `gpu` label
40+
- [x] Install `poetry install -E gpu` (onnxruntime-gpu) instead of `-E cpu`
41+
42+
## Technical Approach
43+
44+
### Option A: Dedicated GPU VM (simplest)
45+
46+
Add a new GPU runner VM to the existing Pulumi infrastructure. Use an `n1-standard-4`
47+
with 1× NVIDIA T4 GPU. Cost: ~$0.35/hr on-demand, ~$0.11/hr spot.
48+
49+
**Pros:** Simple, fits existing patterns, fast startup (VM already running)
50+
**Cons:** Always-on cost if not managed; or slow cold-start if managed
51+
52+
### Option B: Spot GPU VM with startup/shutdown management
53+
54+
Same as A but use spot pricing and the existing runner_manager Cloud Function to
55+
start/stop based on CI demand.
56+
57+
**Pros:** 70% cheaper ($0.11/hr), fits existing management pattern
58+
**Cons:** Spot can be preempted mid-test (rare for short jobs); cold start ~2-3 min
59+
60+
### Option C: Use a cloud GPU service (Modal, Lambda Labs, etc.)
61+
62+
Run the integration tests on a cloud GPU service rather than self-hosted.
63+
64+
**Pros:** No infrastructure to manage, pay-per-second
65+
**Cons:** More complex CI integration, different from existing patterns
66+
67+
### Recommendation: Option B (Spot GPU VM)
68+
69+
The integration test takes <10 minutes on GPU, so spot preemption risk is low.
70+
Cold start is acceptable since it's triggered by PR events. Cost: ~$0.02 per CI run.
71+
72+
## Implementation Steps
73+
74+
### 1. Pulumi infrastructure (in karaoke-gen repo)
75+
76+
1. [x] Add `GITHUB_GPU_RUNNER` machine type to `config.py`: `n1-standard-4` + 1× T4
77+
2. [x] Add `GPU_RUNNER_LABELS` to `config.py`: `"self-hosted,linux,x64,gcp,gpu"`
78+
3. [x] Create GPU runner VM in `github_runners.py`:
79+
- `n1-standard-4` (4 vCPU, 15GB RAM)
80+
- 1× NVIDIA T4 GPU (`nvidia-tesla-t4`)
81+
- `guest_accelerators` config
82+
- `on_host_maintenance: "TERMINATE"` (required for GPU VMs)
83+
- Same NAT/networking as existing runners
84+
4. [x] Create GPU startup script (`github_runner_gpu.sh`):
85+
- Install NVIDIA drivers via CUDA repo (cuda-drivers + cuda-toolkit-12-4)
86+
- Install CUDA toolkit
87+
- Verify GPU: `nvidia-smi`
88+
- Pre-download models to `/opt/audio-separator-models`
89+
- Register as GHA runner with `gpu` label
90+
5. [x] Add spot scheduling for cost optimization
91+
6. [ ] Run `pulumi up` to create the VM
92+
93+
### 2. Workflow update (in python-audio-separator repo)
94+
95+
7. [x] Update `run-integration-tests.yaml`:
96+
- Change `runs-on: self-hosted` to `runs-on: [self-hosted, gpu]`
97+
- Change `poetry install -E cpu` to `poetry install -E gpu`
98+
- Add `nvidia-smi` verification step
99+
- Add 30-minute timeout
100+
8. [ ] Add fallback: if no GPU runner available, fall back to CPU with longer timeout
101+
- Deferred: not needed initially, the runner_manager auto-starts the GPU VM on demand
102+
103+
### 3. Startup script details
104+
105+
The GPU startup script needs to:
106+
```bash
107+
# Install NVIDIA drivers (for Debian 12)
108+
sudo apt-get update
109+
sudo apt-get install -y linux-headers-$(uname -r) nvidia-driver-535
110+
111+
# Verify GPU
112+
nvidia-smi
113+
114+
# Install CUDA (for PyTorch)
115+
# PyTorch bundles its own CUDA, so we mainly need the driver
116+
117+
# Pre-download models
118+
pip install audio-separator[gpu]
119+
python -c "
120+
from audio_separator.separator import Separator
121+
sep = Separator(model_file_dir='/opt/audio-separator-models')
122+
# Download all models used in integration tests
123+
models = [
124+
'model_bs_roformer_ep_317_sdr_12.9755.ckpt',
125+
'mel_band_roformer_karaoke_aufr33_viperx_sdr_10.1956.ckpt',
126+
'MGM_MAIN_v4.pth',
127+
'UVR-MDX-NET-Inst_HQ_4.onnx',
128+
'kuielab_b_vocals.onnx',
129+
'2_HP-UVR.pth',
130+
'htdemucs_6s.yaml',
131+
'htdemucs_ft.yaml',
132+
# Ensemble preset models
133+
'bs_roformer_vocals_resurrection_unwa.ckpt',
134+
'melband_roformer_big_beta6x.ckpt',
135+
'bs_roformer_vocals_revive_v2_unwa.ckpt',
136+
'mel_band_roformer_kim_ft2_bleedless_unwa.ckpt',
137+
'bs_roformer_vocals_revive_v3e_unwa.ckpt',
138+
'mel_band_roformer_vocals_becruily.ckpt',
139+
'mel_band_roformer_vocals_fv4_gabox.ckpt',
140+
'mel_band_roformer_instrumental_fv7z_gabox.ckpt',
141+
'bs_roformer_instrumental_resurrection_unwa.ckpt',
142+
'melband_roformer_inst_v1e_plus.ckpt',
143+
'mel_band_roformer_instrumental_becruily.ckpt',
144+
'mel_band_roformer_instrumental_instv8_gabox.ckpt',
145+
'UVR-MDX-NET-Inst_HQ_5.onnx',
146+
'mel_band_roformer_karaoke_gabox_v2.ckpt',
147+
'mel_band_roformer_karaoke_becruily.ckpt',
148+
# Multi-stem test models
149+
'17_HP-Wind_Inst-UVR.pth',
150+
'MDX23C-DrumSep-aufr33-jarredou.ckpt',
151+
'dereverb_mel_band_roformer_anvuew_sdr_19.1729.ckpt',
152+
]
153+
for m in models:
154+
sep.download_model_and_data(m)
155+
"
156+
```
157+
158+
## Cost Estimate
159+
160+
| Config | Hourly | Per CI run (~10 min) | Monthly (est. 100 runs) |
161+
|--------|--------|---------------------|-------------------------|
162+
| n1-standard-4 + T4 (on-demand) | $0.61 | $0.10 | $10 |
163+
| n1-standard-4 + T4 (spot) | $0.19 | $0.03 | $3 |
164+
| Current CPU (e2-standard-4) | $0.13 | $0.07 | $7 |
165+
166+
Spot GPU is actually cheaper per-run than current CPU because GPU tests finish 5× faster.
167+
168+
## Files to Create/Modify
169+
170+
| File | Repo | Action |
171+
|------|------|--------|
172+
| `infrastructure/config.py` | karaoke-gen | Add GPU machine type + labels |
173+
| `infrastructure/compute/github_runners.py` | karaoke-gen | Add GPU runner VM |
174+
| `infrastructure/compute/startup_scripts/github_runner_gpu.sh` | karaoke-gen | GPU-specific startup |
175+
| `.github/workflows/run-integration-tests.yaml` | python-audio-separator | Target GPU runner |
176+
177+
## Open Questions
178+
179+
- [x] Should the GPU runner be spot or on-demand? → **Spot** ($0.19/hr, ~$3/mo)
180+
- [x] Should we keep the CPU fallback for when GPU runner is unavailable? → **Deferred** (runner_manager auto-starts VM)
181+
- [x] Should the runner startup script install NVIDIA drivers from scratch each boot,
182+
or use a pre-built GCP Deep Learning VM image? → **From scratch** (idempotent, matches existing pattern)
183+
- [x] Zone availability: T4 GPUs may not be available in us-central1-a → **Available** in all us-central1 zones (a, b, c, f)
184+
185+
## Rollback Plan
186+
187+
The GPU runner is additive infrastructure. If it fails:
188+
1. Change workflow back to `runs-on: self-hosted` (CPU)
189+
2. Destroy the GPU VM via `pulumi destroy` targeting just that resource

.github/workflows/run-integration-tests.yaml

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,17 +23,26 @@ jobs:
2323
run-integration-test:
2424
needs: changes
2525
if: needs.changes.outputs.should_run == 'true'
26-
runs-on: self-hosted
26+
runs-on: [self-hosted, gpu]
27+
timeout-minutes: 30
2728
env:
2829
# Use persistent local directory on self-hosted runners instead of actions/cache.
2930
# Models are pre-downloaded to this path by the runner startup script, so there's
30-
# no need to download ~2GB of models on every CI run.
31+
# no need to download ~14GB of models on every CI run.
3132
AUDIO_SEPARATOR_MODEL_DIR: /opt/audio-separator-models
3233

3334
steps:
3435
- name: Checkout project
3536
uses: actions/checkout@v4
3637

38+
- name: Verify GPU availability
39+
run: |
40+
echo "=== GPU Info ==="
41+
nvidia-smi
42+
echo ""
43+
echo "=== CUDA Version ==="
44+
nvidia-smi --query-gpu=driver_version,cuda_version --format=csv,noheader
45+
3746
- name: Set up Python
3847
uses: actions/setup-python@v5
3948
with:
@@ -59,8 +68,8 @@ jobs:
5968
python-version: '3.13'
6069
cache: poetry
6170

62-
- name: Install Poetry dependencies
63-
run: poetry install -E cpu
71+
- name: Install Poetry dependencies (GPU)
72+
run: poetry install -E gpu
6473

6574
- name: Verify pre-cached models
6675
run: |

0 commit comments

Comments
 (0)