Skip to content

ci: verify ephemeral GPU runner end-to-end#287

Closed
beveradb wants to merge 1 commit into
mainfrom
feat/sess-20260518-0431-verify-ephemeral-gpu
Closed

ci: verify ephemeral GPU runner end-to-end#287
beveradb wants to merge 1 commit into
mainfrom
feat/sess-20260518-0431-verify-ephemeral-gpu

Conversation

@beveradb
Copy link
Copy Markdown
Collaborator

Summary

End-to-end verification for karaoke-gen PR #781 — the GPU image NVIDIA driver-load fix.

This PR is a single-line touch on audio_separator/separator/__init__.py whose only purpose is to trigger run-integration-tests.yaml's three self-hosted GPU jobs (ensemble-presets, core-models, stems-and-quality).

Expected behaviour

For each of the three jobs the ephemeral GHA runner dispatcher should:

  1. Create a fresh GCE VM from the gha-runner-gpu image family (latest is gha-runner-gpu-20260518-035713).
  2. Create it with Secure Boot OFF (per the enable_secure_boot=not family.has_gpu change in #781).
  3. On first boot, gha-gpu-modprobe.service rebuilds nvidia.ko via DKMS against the running kernel (kernel skew between image-build and runtime is now handled), modprobes nvidia/nvidia_uvm/nvidia_drm, and verifies with nvidia-smi.
  4. The "Verify GPU availability" step (nvidia-smi --query-gpu=driver_version,name,memory.total --format=csv,noheader) succeeds.
  5. The actual integration tests run.

Smoke result before this PR

A smoke VM created by hand from gha-runner-gpu-20260518-035713 (SB off) booted successfully and nvidia-smi reported Tesla T4 / driver 595.71.05 / CUDA 13.2 as both root and as the runner user. Total boot-to-GPU-ready time: ~2 minutes.

Test plan

  • All three self-hosted GPU jobs pass (ensemble-presets, core-models, stems-and-quality)
  • Each job's "Verify GPU availability" step output shows the T4
  • No "Failed to query NVIDIA devices" or "Key was rejected by service" entries in the runner logs

@coderabbitai ignore

🤖 Generated with Claude Code

No-op touch on audio_separator/separator/__init__.py to fire the
run-integration-tests workflow's three self-hosted GPU jobs
(ensemble-presets, core-models, stems-and-quality).

This is the end-to-end verification step for karaoke-gen PR #781
(GPU image NVIDIA driver-load fix + secure-boot disabled for GPU VMs).
All three jobs should land on fresh ephemeral GPU VMs created by the
updated dispatcher, the gha-gpu-modprobe.service should rebuild the
NVIDIA module against the running kernel on first boot, and nvidia-smi
should report Tesla T4 at the "Verify GPU availability" step.

Smoke-test of the new image (`gha-runner-gpu-20260518-035713`) on
2026-05-18 already showed end-to-end success — this PR exercises it
through the real dispatcher path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@beveradb
Copy link
Copy Markdown
Collaborator Author

All 3 GPU integration jobs green — verifies karaoke-gen PR #781 end-to-end. Closing per the same convention as #286 (verification trigger, not a real change).

@beveradb beveradb closed this May 18, 2026
@beveradb beveradb deleted the feat/sess-20260518-0431-verify-ephemeral-gpu branch May 18, 2026 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant