Skip to content

chore(ci): trigger GPU integration tests on ephemeral runners#286

Closed
beveradb wants to merge 1 commit into
mainfrom
chore/sess-20260517-1431-trigger-gpu-ci
Closed

chore(ci): trigger GPU integration tests on ephemeral runners#286
beveradb wants to merge 1 commit into
mainfrom
chore/sess-20260517-1431-trigger-gpu-ci

Conversation

@beveradb
Copy link
Copy Markdown
Collaborator

Summary

  • Exercises the ephemeral n1+T4 GPU runner path after the dispatcher fix landed in fix(runner-dispatcher): omit TERMINATE for e2 families (unblocks ephemeral cutover) karaoke-gen#776.
  • The n1 (GPU) family was not affected by the e2 bug there — n1 supports onHostMaintenance=TERMINATE (required because attached GPUs can't live-migrate) — but it hasn't been exercised under the ephemeral dispatcher since cutover (2026-05-17T04:56Z), so this is the verification PR.
  • The comment touch in audio_separator/separator/__init__.py flips the audio_separator/** path filter to make dorny/paths-filter run all three integration test jobs.

Test plan

  • ensemble-presets, core-models, stems-and-quality all dispatch on freshly-created ephemeral n1+T4 VMs
  • Each job completes within ~15 min and verifies pre-cached models load from /opt/audio-separator-models
  • No long-lived ephemeral VMs after completion (orphan-cleanup tick within 30 min)

@coderabbitai ignore

🤖 Generated with Claude Code

Verifies the n1+T4 ephemeral runner path after the 2026-05-17 dispatcher
e2 fix (karaoke-gen#776). The GPU family was not affected by that bug
(n1 supports onHostMaintenance=TERMINATE, required for attached GPUs),
but has not been exercised under the ephemeral dispatcher since cutover.

Trivial comment in `audio_separator/separator/__init__.py` flips the
`audio_separator/**` path filter so the 3-job integration test suite
(ensemble-presets, core-models, stems-and-quality) runs on freshly
created GPU VMs. Safe to remove on next edit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@beveradb
Copy link
Copy Markdown
Collaborator Author

Closing — used purely as a verification trigger for the ephemeral GPU runner path after nomadkaraoke/karaoke-gen#776. The rerun surfaced a known issue: the NVIDIA driver kernel module fails to load on fresh ephemeral GPU VMs (nvidia-persistenced fails because /dev/nvidia* don't exist), same root cause as the legacy kernel-header mismatch but the ephemeral image bakes the install at image-build time instead of running it at every VM start. Will be addressed in a separate follow-up PR that adds a systemd boot service to ensure the driver loads via DKMS on every boot. Trivial trigger-comment revert: git revert if you want to clean up the touch.

@beveradb beveradb closed this May 18, 2026
beveradb added a commit to nomadkaraoke/karaoke-gen that referenced this pull request May 18, 2026
…meral cutover) (#776)

## Summary

Fixes three cutover bugs that left the ephemeral runner dispatcher
silently broken since 2026-05-17T04:56Z:

1. **`on_host_maintenance="TERMINATE"` set unconditionally** — e2
machine types reject TERMINATE unless preemptible, so 100% of
general/build VM creates failed. Fix: only set TERMINATE on GPU (n1)
families where it's required because attached GPUs can't live-migrate. 3
regression tests added in `test_ephemeral.py`.
2. **GPU `disk_size_gb=150` but image is 200GB** — GCE rejects boot
disks smaller than the source image. Fix: raise GPU disk to 200GB.
3. **Runner user has no passwordless sudo** — workflow steps like `sudo
apt-get install -y google-cloud-cli-firestore-emulator`
(backend-emulator-tests) and `sudo apt-get install -y ffmpeg`
(package-*, GPU integration tests) failed with "a terminal is required
to read the password". The legacy GHA runner VMs had NOPASSWD sudo. Fix:
add `/etc/sudoers.d/runner` in the image provision script.

Fixes 1 + 2 are dispatcher code (already applied via `pulumi up --target
...:runner-manager-source ...:runner-manager-function` from this
branch). Fix 3 requires a new image build (triggered:
build-runner-images.yml on this branch).

## Why the trivial file touches

`backend/main.py` and `karaoke_gen/__init__.py` carry one-line comments
to flip the `backend` and `package` `dorny/paths-filter` outputs so this
PR exercises **all five PR-triggered self-hosted jobs** end-to-end on
ephemeral runners. Verification PR (paired):
nomadkaraoke/python-audio-separator#286.

## Test plan

- [x] `pytest test_ephemeral.py` — 26 passed locally (3 new scheduling
tests)
- [x] Pulumi applied locally (both iterations, function update verified)
- [x] First ephemeral general VM create succeeded (4 RUNNING; previously
100% failure)
- [ ] Image rebuild completes for general+build+gpu with new sudo
provision
- [ ] PR CI passes with all 5 self-hosted jobs running on the new image
- [ ] python-audio-separator#286 GPU integration tests pass on the new
GPU image
- [ ] After merge: `deploy-backend` succeeds on ephemeral build runner;
prod /api/health/detailed reports new version
- [ ] Phase 4 decommission of 7 legacy VMs+disks (separate PR, ~$220/mo
saving)

## Context

- `karaoke-gen/docs/EPHEMERAL-GHA-RUNNERS.md`
- `docs/archive/2026-05-16-ephemeral-gha-runners-plan.md` (workspace)
- Memory: `project_cost_sprint_may2026.md`

@coderabbitai ignore

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant