|
| 1 | +# CI GPU Runner Infrastructure |
| 2 | + |
| 3 | +This document explains how the GPU-based integration test infrastructure works for this repo. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Integration tests require GPU hardware to run ML model inference. GPU VMs are expensive (~$1.62/hr for 3x T4), so they auto-scale to zero when idle. The system automatically starts runners when CI jobs need them and stops them after 15 minutes of inactivity. |
| 8 | + |
| 9 | +## Architecture |
| 10 | + |
| 11 | +``` |
| 12 | +GitHub webhook (workflow_job.queued) |
| 13 | + │ |
| 14 | + ▼ |
| 15 | +Cloud Function (github-runner-manager) |
| 16 | + │ |
| 17 | + ├── Job has "gpu" label? → Start GPU runners (3x n1-standard-4 + T4) |
| 18 | + ├── Job has "self-hosted" label? → Start CPU runners |
| 19 | + └── Neither? → Ignore |
| 20 | +
|
| 21 | +Cloud Scheduler (every 15 min) |
| 22 | + │ |
| 23 | + ▼ |
| 24 | +Cloud Function (?action=check_idle) |
| 25 | + │ |
| 26 | + └── No pending jobs + runner idle > 15 min? → Stop runner |
| 27 | +``` |
| 28 | + |
| 29 | +### Components |
| 30 | + |
| 31 | +| Component | Location | Purpose | |
| 32 | +|-----------|----------|---------| |
| 33 | +| Cloud Function | `karaoke-gen/infrastructure/functions/runner_manager/main.py` | Starts/stops runner VMs based on demand | |
| 34 | +| Pulumi module | `karaoke-gen/infrastructure/modules/runner_manager.py` | Deploys the function, scheduler, and IAM | |
| 35 | +| GPU VM definitions | `karaoke-gen/infrastructure/compute/github_runners.py` | 3x n1-standard-4 with T4 GPU | |
| 36 | +| GPU startup script | `karaoke-gen/infrastructure/compute/startup_scripts/github_runner_gpu.sh` | Installs NVIDIA drivers, Python, registers runner | |
| 37 | +| Config | `karaoke-gen/infrastructure/config.py` | Runner count, labels, idle timeout | |
| 38 | +| GitHub webhook | Org-level (`nomadkaraoke`) | Sends `workflow_job` events to Cloud Function | |
| 39 | + |
| 40 | +### GPU Runner VMs |
| 41 | + |
| 42 | +- **Count**: 3 (configurable via `NUM_GPU_RUNNERS` in config.py) |
| 43 | +- **Machine type**: n1-standard-4 (4 vCPU, 15GB RAM) + 1x NVIDIA T4 |
| 44 | +- **Zone**: us-central1-a |
| 45 | +- **Labels**: `self-hosted, linux, x64, gcp, gpu` |
| 46 | +- **Startup time**: ~15-20 min (NVIDIA driver install, Python build, model download) |
| 47 | +- **Model cache**: ~14GB of ML models pre-downloaded to `/opt/audio-separator-models/` |
| 48 | + |
| 49 | +### Required GitHub Branch Protection Checks |
| 50 | + |
| 51 | +The `Protect main` ruleset (ID: 529535) requires these checks to pass before merge: |
| 52 | + |
| 53 | +- `unit-tests` — from `run-unit-tests.yaml` (runs on GitHub-hosted runners) |
| 54 | +- `ensemble-presets` — from `run-integration-tests.yaml` (runs on GPU runners) |
| 55 | +- `core-models` — from `run-integration-tests.yaml` (runs on GPU runners) |
| 56 | +- `stems-and-quality` — from `run-integration-tests.yaml` (runs on GPU runners) |
| 57 | + |
| 58 | +**IMPORTANT**: If integration test job names change (e.g., splitting or renaming jobs), you MUST update the ruleset to match. The ruleset is configured at: |
| 59 | +https://github.com/nomadkaraoke/python-audio-separator/settings/rules/529535 |
| 60 | + |
| 61 | +To update via API: |
| 62 | +```bash |
| 63 | +gh api repos/nomadkaraoke/python-audio-separator/rulesets/529535 \ |
| 64 | + --method PUT --input - <<'EOF' |
| 65 | +{ |
| 66 | + "name": "Protect main", |
| 67 | + "enforcement": "active", |
| 68 | + "target": "branch", |
| 69 | + "conditions": {"ref_name": {"include": ["~DEFAULT_BRANCH"], "exclude": []}}, |
| 70 | + "rules": [ |
| 71 | + {"type": "deletion"}, |
| 72 | + {"type": "pull_request", "parameters": { |
| 73 | + "required_approving_review_count": 0, |
| 74 | + "allowed_merge_methods": ["squash"] |
| 75 | + }}, |
| 76 | + {"type": "required_status_checks", "parameters": { |
| 77 | + "required_status_checks": [ |
| 78 | + {"context": "unit-tests", "integration_id": 15368}, |
| 79 | + {"context": "JOB_NAME_HERE", "integration_id": 15368} |
| 80 | + ] |
| 81 | + }} |
| 82 | + ] |
| 83 | +} |
| 84 | +EOF |
| 85 | +``` |
| 86 | + |
| 87 | +## Troubleshooting |
| 88 | + |
| 89 | +### Integration tests stuck in "queued" |
| 90 | + |
| 91 | +**Symptoms**: PR checks show `pending` for `ensemble-presets`, `core-models`, `stems-and-quality`. |
| 92 | + |
| 93 | +**Diagnosis steps**: |
| 94 | + |
| 95 | +1. Check if GPU runners are online: |
| 96 | + ```bash |
| 97 | + gh api orgs/nomadkaraoke/actions/runners \ |
| 98 | + --jq '.runners[] | select(.labels[].name == "gpu") | {name, status, busy}' |
| 99 | + ``` |
| 100 | + |
| 101 | +2. Check if GPU VMs exist: |
| 102 | + ```bash |
| 103 | + gcloud compute instances list --project=nomadkaraoke --filter="name~gpu" |
| 104 | + ``` |
| 105 | + |
| 106 | +3. Check Cloud Function logs for webhook delivery: |
| 107 | + ```bash |
| 108 | + gcloud logging read 'resource.labels.service_name="github-runner-manager"' \ |
| 109 | + --project=nomadkaraoke --limit=20 \ |
| 110 | + --format="value(timestamp,textPayload,jsonPayload.message)" |
| 111 | + ``` |
| 112 | + |
| 113 | +4. Check GPU runner startup logs (if VMs are RUNNING but GitHub shows offline): |
| 114 | + ```bash |
| 115 | + gcloud compute ssh github-gpu-runner-1 --zone=us-central1-a --project=nomadkaraoke \ |
| 116 | + --command="tail -50 /var/log/github-runner-startup.log" |
| 117 | + ``` |
| 118 | + |
| 119 | +### GPU VMs don't exist |
| 120 | + |
| 121 | +If `gcloud compute instances list` shows no GPU runners but Pulumi state thinks they exist: |
| 122 | + |
| 123 | +```bash |
| 124 | +# 1. Remove stale state (from karaoke-gen/infrastructure/ dir) |
| 125 | +pulumi state delete "urn:pulumi:prod::karaoke-gen-infrastructure::gcp:compute/instance:Instance::github-gpu-runner-1" --target-dependents --yes |
| 126 | +pulumi state delete "urn:pulumi:prod::karaoke-gen-infrastructure::gcp:compute/instance:Instance::github-gpu-runner-2" --target-dependents --yes |
| 127 | +pulumi state delete "urn:pulumi:prod::karaoke-gen-infrastructure::gcp:compute/instance:Instance::github-gpu-runner-3" --target-dependents --yes |
| 128 | + |
| 129 | +# 2. Recreate |
| 130 | +pulumi up --yes |
| 131 | + |
| 132 | +# 3. Re-import dependent resources that got removed (runner-manager function, IAM, scheduler) |
| 133 | +# Check `pulumi preview` for what needs importing |
| 134 | +``` |
| 135 | + |
| 136 | +### GPU runner startup fails (NVIDIA driver issues) |
| 137 | + |
| 138 | +The startup script handles kernel header mismatches by upgrading the kernel and rebooting once. If the runner still fails: |
| 139 | + |
| 140 | +```bash |
| 141 | +# SSH in and check |
| 142 | +gcloud compute ssh github-gpu-runner-1 --zone=us-central1-a --project=nomadkaraoke \ |
| 143 | + --command="nvidia-smi; dkms status; uname -r" |
| 144 | +``` |
| 145 | + |
| 146 | +See `karaoke-gen` memory file `project_gpu_runner_drivers.md` for known issues. |
| 147 | + |
| 148 | +### Webhook not firing |
| 149 | + |
| 150 | +Check the org-level webhook configuration: |
| 151 | +```bash |
| 152 | +gh api orgs/nomadkaraoke/hooks \ |
| 153 | + --jq '.[] | select(.events[] == "workflow_job") | {id, active, config: {url: .config.url}}' |
| 154 | +``` |
| 155 | + |
| 156 | +The webhook URL should point to: `https://us-central1-nomadkaraoke.cloudfunctions.net/github-runner-manager` |
| 157 | + |
| 158 | +## Cost |
| 159 | + |
| 160 | +| Scenario | Cost | |
| 161 | +|----------|------| |
| 162 | +| Per GPU runner hour | ~$0.54/hr (n1-standard-4 + T4) | |
| 163 | +| 3 runners × 15 min CI run | ~$0.41 | |
| 164 | +| Idle (scale to zero) | $0 | |
| 165 | +| Typical daily cost (5 PRs) | ~$2 | |
0 commit comments