This document explains how the GPU-based integration test infrastructure works for this repo.
Integration tests require GPU hardware to run ML model inference. GPU VMs are expensive (~$1.62/hr for 3x T4), so they auto-scale to zero when idle. The system automatically starts runners when CI jobs need them and stops them after 15 minutes of inactivity.
GitHub webhook (workflow_job.queued)
│
▼
Cloud Function (github-runner-manager)
│
├── Job has "gpu" label? → Start GPU runners (3x n1-standard-4 + T4)
├── Job has "self-hosted" label? → Start CPU runners
└── Neither? → Ignore
Cloud Scheduler (every 15 min)
│
▼
Cloud Function (?action=check_idle)
│
└── No pending jobs + runner idle > 15 min? → Stop runner
| Component | Location | Purpose |
|---|---|---|
| Cloud Function | karaoke-gen/infrastructure/functions/runner_manager/main.py |
Starts/stops runner VMs based on demand |
| Pulumi module | karaoke-gen/infrastructure/modules/runner_manager.py |
Deploys the function, scheduler, and IAM |
| GPU VM definitions | karaoke-gen/infrastructure/compute/github_runners.py |
3x n1-standard-4 with T4 GPU |
| GPU startup script | karaoke-gen/infrastructure/compute/startup_scripts/github_runner_gpu.sh |
Installs NVIDIA drivers, Python, registers runner |
| Config | karaoke-gen/infrastructure/config.py |
Runner count, labels, idle timeout |
| GitHub webhook | Org-level (nomadkaraoke) |
Sends workflow_job events to Cloud Function |
- Count: 3 (configurable via
NUM_GPU_RUNNERSin config.py) - Machine type: n1-standard-4 (4 vCPU, 15GB RAM) + 1x NVIDIA T4
- Zone: us-central1-a
- Labels:
self-hosted, linux, x64, gcp, gpu - Startup time: ~15-20 min (NVIDIA driver install, Python build, model download)
- Model cache: ~14GB of ML models pre-downloaded to
/opt/audio-separator-models/
The Protect main ruleset (ID: 529535) requires these checks to pass before merge:
unit-tests— fromrun-unit-tests.yaml(runs on GitHub-hosted runners)ensemble-presets— fromrun-integration-tests.yaml(runs on GPU runners)core-models— fromrun-integration-tests.yaml(runs on GPU runners)stems-and-quality— fromrun-integration-tests.yaml(runs on GPU runners)
IMPORTANT: If integration test job names change (e.g., splitting or renaming jobs), you MUST update the ruleset to match. The ruleset is configured at: https://github.com/nomadkaraoke/python-audio-separator/settings/rules/529535
To update via API:
gh api repos/nomadkaraoke/python-audio-separator/rulesets/529535 \
--method PUT --input - <<'EOF'
{
"name": "Protect main",
"enforcement": "active",
"target": "branch",
"conditions": {"ref_name": {"include": ["~DEFAULT_BRANCH"], "exclude": []}},
"rules": [
{"type": "deletion"},
{"type": "pull_request", "parameters": {
"required_approving_review_count": 0,
"allowed_merge_methods": ["squash"]
}},
{"type": "required_status_checks", "parameters": {
"required_status_checks": [
{"context": "unit-tests", "integration_id": 15368},
{"context": "JOB_NAME_HERE", "integration_id": 15368}
]
}}
]
}
EOFSymptoms: PR checks show pending for ensemble-presets, core-models, stems-and-quality.
Diagnosis steps:
-
Check if GPU runners are online:
gh api orgs/nomadkaraoke/actions/runners \ --jq '.runners[] | select(.labels[].name == "gpu") | {name, status, busy}' -
Check if GPU VMs exist:
gcloud compute instances list --project=nomadkaraoke --filter="name~gpu" -
Check Cloud Function logs for webhook delivery:
gcloud logging read 'resource.labels.service_name="github-runner-manager"' \ --project=nomadkaraoke --limit=20 \ --format="value(timestamp,textPayload,jsonPayload.message)"
-
Check GPU runner startup logs (if VMs are RUNNING but GitHub shows offline):
gcloud compute ssh github-gpu-runner-1 --zone=us-central1-a --project=nomadkaraoke \ --command="tail -50 /var/log/github-runner-startup.log"
If gcloud compute instances list shows no GPU runners but Pulumi state thinks they exist:
# 1. Remove stale state (from karaoke-gen/infrastructure/ dir)
pulumi state delete "urn:pulumi:prod::karaoke-gen-infrastructure::gcp:compute/instance:Instance::github-gpu-runner-1" --target-dependents --yes
pulumi state delete "urn:pulumi:prod::karaoke-gen-infrastructure::gcp:compute/instance:Instance::github-gpu-runner-2" --target-dependents --yes
pulumi state delete "urn:pulumi:prod::karaoke-gen-infrastructure::gcp:compute/instance:Instance::github-gpu-runner-3" --target-dependents --yes
# 2. Recreate
pulumi up --yes
# 3. Re-import dependent resources that got removed (runner-manager function, IAM, scheduler)
# Check `pulumi preview` for what needs importingThe startup script handles kernel header mismatches by upgrading the kernel and rebooting once. If the runner still fails:
# SSH in and check
gcloud compute ssh github-gpu-runner-1 --zone=us-central1-a --project=nomadkaraoke \
--command="nvidia-smi; dkms status; uname -r"See karaoke-gen memory file project_gpu_runner_drivers.md for known issues.
Check the org-level webhook configuration:
gh api orgs/nomadkaraoke/hooks \
--jq '.[] | select(.events[] == "workflow_job") | {id, active, config: {url: .config.url}}'The webhook URL should point to: https://us-central1-nomadkaraoke.cloudfunctions.net/github-runner-manager
| Scenario | Cost |
|---|---|
| Per GPU runner hour | ~$0.54/hr (n1-standard-4 + T4) |
| 3 runners × 15 min CI run | ~$0.41 |
| Idle (scale to zero) | $0 |
| Typical daily cost (5 PRs) | ~$2 |