CI GPU Runner Infrastructure

This document explains how the GPU-based integration test infrastructure works for this repo.

Overview

Integration tests require GPU hardware to run ML model inference. GPU VMs are expensive (~$1.62/hr for 3x T4), so they auto-scale to zero when idle. The system automatically starts runners when CI jobs need them and stops them after 15 minutes of inactivity.

Architecture

GitHub webhook (workflow_job.queued)
    │
    ▼
Cloud Function (github-runner-manager)
    │
    ├── Job has "gpu" label? → Start GPU runners (3x n1-standard-4 + T4)
    ├── Job has "self-hosted" label? → Start CPU runners
    └── Neither? → Ignore

Cloud Scheduler (every 15 min)
    │
    ▼
Cloud Function (?action=check_idle)
    │
    └── No pending jobs + runner idle > 15 min? → Stop runner

Components

Component	Location	Purpose
Cloud Function	`karaoke-gen/infrastructure/functions/runner_manager/main.py`	Starts/stops runner VMs based on demand
Pulumi module	`karaoke-gen/infrastructure/modules/runner_manager.py`	Deploys the function, scheduler, and IAM
GPU VM definitions	`karaoke-gen/infrastructure/compute/github_runners.py`	3x n1-standard-4 with T4 GPU
GPU startup script	`karaoke-gen/infrastructure/compute/startup_scripts/github_runner_gpu.sh`	Installs NVIDIA drivers, Python, registers runner
Config	`karaoke-gen/infrastructure/config.py`	Runner count, labels, idle timeout
GitHub webhook	Org-level (`nomadkaraoke`)	Sends `workflow_job` events to Cloud Function

GPU Runner VMs

Count: 3 (configurable via NUM_GPU_RUNNERS in config.py)
Machine type: n1-standard-4 (4 vCPU, 15GB RAM) + 1x NVIDIA T4
Zone: us-central1-a
Labels: self-hosted, linux, x64, gcp, gpu
Startup time: ~15-20 min (NVIDIA driver install, Python build, model download)
Model cache: ~14GB of ML models pre-downloaded to /opt/audio-separator-models/

Required GitHub Branch Protection Checks

The Protect main ruleset (ID: 529535) requires these checks to pass before merge:

unit-tests — from run-unit-tests.yaml (runs on GitHub-hosted runners)
ensemble-presets — from run-integration-tests.yaml (runs on GPU runners)
core-models — from run-integration-tests.yaml (runs on GPU runners)
stems-and-quality — from run-integration-tests.yaml (runs on GPU runners)

IMPORTANT: If integration test job names change (e.g., splitting or renaming jobs), you MUST update the ruleset to match. The ruleset is configured at: https://github.com/nomadkaraoke/python-audio-separator/settings/rules/529535

To update via API:

gh api repos/nomadkaraoke/python-audio-separator/rulesets/529535 \
  --method PUT --input - <<'EOF'
{
  "name": "Protect main",
  "enforcement": "active",
  "target": "branch",
  "conditions": {"ref_name": {"include": ["~DEFAULT_BRANCH"], "exclude": []}},
  "rules": [
    {"type": "deletion"},
    {"type": "pull_request", "parameters": {
      "required_approving_review_count": 0,
      "allowed_merge_methods": ["squash"]
    }},
    {"type": "required_status_checks", "parameters": {
      "required_status_checks": [
        {"context": "unit-tests", "integration_id": 15368},
        {"context": "JOB_NAME_HERE", "integration_id": 15368}
      ]
    }}
  ]
}
EOF

Troubleshooting

Integration tests stuck in "queued"

Symptoms: PR checks show pending for ensemble-presets, core-models, stems-and-quality.

Diagnosis steps:

Check if GPU runners are online:

gh api orgs/nomadkaraoke/actions/runners \
  --jq '.runners[] | select(.labels[].name == "gpu") | {name, status, busy}'

Check if GPU VMs exist:

gcloud compute instances list --project=nomadkaraoke --filter="name~gpu"

Check Cloud Function logs for webhook delivery:

gcloud logging read 'resource.labels.service_name="github-runner-manager"' \
  --project=nomadkaraoke --limit=20 \
  --format="value(timestamp,textPayload,jsonPayload.message)"

Check GPU runner startup logs (if VMs are RUNNING but GitHub shows offline):

gcloud compute ssh github-gpu-runner-1 --zone=us-central1-a --project=nomadkaraoke \
  --command="tail -50 /var/log/github-runner-startup.log"

GPU VMs don't exist

If gcloud compute instances list shows no GPU runners but Pulumi state thinks they exist:

# 1. Remove stale state (from karaoke-gen/infrastructure/ dir)
pulumi state delete "urn:pulumi:prod::karaoke-gen-infrastructure::gcp:compute/instance:Instance::github-gpu-runner-1" --target-dependents --yes
pulumi state delete "urn:pulumi:prod::karaoke-gen-infrastructure::gcp:compute/instance:Instance::github-gpu-runner-2" --target-dependents --yes
pulumi state delete "urn:pulumi:prod::karaoke-gen-infrastructure::gcp:compute/instance:Instance::github-gpu-runner-3" --target-dependents --yes

# 2. Recreate
pulumi up --yes

# 3. Re-import dependent resources that got removed (runner-manager function, IAM, scheduler)
# Check `pulumi preview` for what needs importing

GPU runner startup fails (NVIDIA driver issues)

The startup script handles kernel header mismatches by upgrading the kernel and rebooting once. If the runner still fails:

# SSH in and check
gcloud compute ssh github-gpu-runner-1 --zone=us-central1-a --project=nomadkaraoke \
  --command="nvidia-smi; dkms status; uname -r"

See karaoke-gen memory file project_gpu_runner_drivers.md for known issues.

Webhook not firing

Check the org-level webhook configuration:

gh api orgs/nomadkaraoke/hooks \
  --jq '.[] | select(.events[] == "workflow_job") | {id, active, config: {url: .config.url}}'

The webhook URL should point to: https://us-central1-nomadkaraoke.cloudfunctions.net/github-runner-manager

Cost

Scenario	Cost
Per GPU runner hour	~$0.54/hr (n1-standard-4 + T4)
3 runners × 15 min CI run	~$0.41
Idle (scale to zero)	$0
Typical daily cost (5 PRs)	~$2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI GPU Runner Infrastructure

Overview

Architecture

Components

GPU Runner VMs

Required GitHub Branch Protection Checks

Troubleshooting

Integration tests stuck in "queued"

GPU VMs don't exist

GPU runner startup fails (NVIDIA driver issues)

Webhook not firing

Cost

Uh oh!

FilesExpand file tree

CI-GPU-RUNNERS.md

Latest commit

History

CI-GPU-RUNNERS.md

File metadata and controls

CI GPU Runner Infrastructure

Overview

Architecture

Components

GPU Runner VMs

Required GitHub Branch Protection Checks

Troubleshooting

Integration tests stuck in "queued"

GPU VMs don't exist

GPU runner startup fails (NVIDIA driver issues)

Webhook not firing

Cost