Commit 88b55cb
authored
CI: allow specifying custom driver versions in test matrix (#2176)
* CI: allow specifying custom driver versions in test matrix
Extends the DRIVER field in ci/test-matrix.yml beyond 'latest'/'earliest'
to accept an explicit version string (e.g. '580.65.06'). For Linux,
ci/tools/install_gpu_driver.sh (adapted from nv-gha-runners/vm-images
PR #256) swaps the driver in-job via nsenter when the row uses a custom
version; for Windows, ci/tools/install_gpu_driver.ps1 is split into
install + configure_driver_mode, with the install step gated on the
DRIVER value and the mode step always running.
The matrix row is routed to a 'latest' runner image when the DRIVER is
a custom version (the install scripts perform the swap themselves).
Container privileges on Linux (--privileged --pid=host) are added only
on rows with a custom DRIVER. Custom DRIVER + FLAVOR=wsl is rejected
eagerly in the compute-matrix step.
Two existing nightly-numba-cuda rows exercise the new path:
- Linux amd64 / 13.3.0 / l4 -> 580.65.06
- Windows amd64 / 13.3.0 / l4 -> 610.47
Closes #293
Closes #1265
* CI: fix Linux driver nsenter re-exec, swap Windows version, enable ci.yml dispatch
- install_gpu_driver.sh: pipe the script body to the host-side bash via
stdin (bash -s < "$0") instead of re-execing "$0". The script lives
in the GH workspace mount (container-only), so the relative path
doesn't resolve after nsenter switches the mount namespace.
The < "$0" fd is opened before nsenter and survives the flip.
- test-matrix.yml: Windows nightly-numba-cuda row 610.47 -> 596.36
(610.47 isn't published on the CDN; install hit 404).
- ci.yml: add workflow_dispatch: trigger so the pipeline can be
re-run manually. The existing should-skip / detect-changes gates
already handle non-PR events.
* CI: move 'Ensure GPU is working' after 'Install GPU driver' on Linux
So nvidia-smi validates the post-install driver state on custom-DRIVER
rows. Windows test-wheel + coverage already use Install -> Configure ->
Ensure; this brings the Linux test-wheel job into line.
* CI: flip two PR-matrix Linux rows to DRIVER=610.43.02
Exercises the custom-driver install path on every PR (not just nightly).
Both rows are amd64 / 13.3.0 / local-CTK, on l4 and rtxpro6000 -- both
in the 'open' kernel-module flavor (only Volta needs 'legacy').
* CI: restart nvidia-persistenced on Linux; poll nvidia-smi on Windows
Linux: After install_gpu_driver.sh stops nvidia-persistenced and the apt
purge removes the package, the .run installer reinstalls the systemd
service but leaves it stopped. cuda.core's test_persistence_mode_enabled
fails with NVML_ERROR_UNKNOWN on driver 610.43.02 when the daemon is
not running; explicitly start it again at the end of host_install().
Windows: configure_driver_mode.ps1's trailing 'Start-Sleep -Seconds 5'
is not enough on slower-coming-back-up multi-GPU rows (observed: 2x
H100 MCDM). Replace it with a poll-until-success loop on nvidia-smi
with a 60s deadline, matching the runner-team nvgha-driver.ps1 pattern.
Previously masked because every Windows row used to run the full
install pipeline; with custom-DRIVER plumbing, latest/earliest rows
skip the install and the cycle is no longer preceded by warm-up time.
* CI: re-enable persistence mode after Linux driver swap
Runner-latest L4 images come up with Persistence-M=On (set somewhere in
the runner team's image setup, not in cuda-python). Our .run install
leaves it Off, which breaks cuda.core's test_persistence_mode_enabled
on driver 610.43.02 -- the test calls device.is_persistence_mode_enabled
= False on a device that already reports False, and 610.43.02 returns
NVML_ERROR_UNKNOWN for that no-op set.
Restore the runner baseline by calling `nvidia-smi -pm 1` at the end of
host_install() (sets the kernel persistence flag directly via NVML).
Also daemon-reload + start nvidia-persistenced.service best-effort so
tools that look for the daemon find it; `set -x` around this trailing
block so the next run's log confirms which lines fired.
* CI: preserve SUID bit when refreshing container nvidia binaries
refresh_container_libs() used 'cp -f --remove-destination' (verbatim
from the runner team's nvgha-driver), which without -p/--preserve
strips the SUID/SGID bits on the destination. /usr/bin/nvidia-modprobe
ships 4755 and NVML's state-changing calls (e.g.
nvmlDeviceSetPersistenceMode) route through it; once SUID is gone the
container-side call returns NVML_ERROR_UNKNOWN, which is what cuda.core's
test_persistence_mode_enabled was hitting.
Add a stat diagnostic line at the end of refresh_container_libs() so
the next CI log records nvidia-modprobe's post-refresh mode.
* CI: exec nvidia-persistenced directly after Linux driver swap
The `--silent --no-questions` .run installer drops /usr/bin/nvidia-
persistenced but does not reliably install a usable systemd unit, so
`systemctl start nvidia-persistenced.service` was a no-op (verified
in CI logs: `+ true` after the start). With the daemon down, the
/run/nvidia-persistenced/socket bind-mounted into the test container
is stale, and NVML state-changing calls (e.g.
nvmlDeviceSetPersistenceMode) made by root inside the container
return NVML_ERROR_UNKNOWN -- which is what cuda.core's
test_persistence_mode_enabled has been failing on.
Verified on ComputeLab with the same driver (610.43.02), same GPU
arch (Ada L40S), root in container: with the daemon up, the SET call
returns NVML_SUCCESS; with the daemon down it returns UnknownError.
Fix: exec /usr/bin/nvidia-persistenced directly. The binary
self-daemonizes and creates the socket on its own. (Same latent gap
exists in nv-gha-runners/vm-images' nvgha-driver; will flag upstream.)
* CI: pass --user root to nvidia-persistenced after Linux driver swap
nvidia-persistenced defaults to `--user nvidia-persistenced`, which
our apt-purge of `nvidia-compute-utils-*` removed. Without that user
the daemon's setuid(3) post-fork fails and the process exits silently
-- the `nvidia-smi -pm 1` right after sees Persistence-M briefly On
(daemon held it), then it flips back to Off (daemon gone), and the
test container's NVML SET call later returns NVML_ERROR_UNKNOWN.
Pass --user root so the daemon doesn't depend on a user account that
the purge deleted. Also add a `pgrep nvidia-persistenced` + `ls -la
/run/nvidia-persistenced/` diagnostic so the next CI log proves the
daemon is alive when the test starts.
* CI: add fast-feedback probe-driver-swap job (workflow_dispatch only)
Allocates one L4 GPU + privileged container, runs install_gpu_driver.sh
with DRIVER=610.43.02, then drives nvmlDeviceSetPersistenceMode via
raw ctypes -- the exact NVML call that cuda.core's
test_persistence_mode_enabled exercises. Exits 1 on
NVML_ERROR_UNKNOWN so the smoke test fails loudly when the install
path leaves the daemon dead.
Total runtime ~5 min vs ~30 min for the full test matrix.
Triggered by workflow_dispatch only -- this is an opt-in debugging
job, not regular PR or nightly traffic.
* CI: drop workflow_dispatch gate on probe-driver-swap so it runs on every PR
* CI: stop refresh_container_libs from clobbering /run/nvidia-persistenced
refresh_container_libs() walks /proc/self/mountinfo for entries
containing 'nvidia' or 'libcuda'. /run/nvidia-persistenced/socket
matches that pattern and was being umount'd + cp'd over -- which
breaks the container's view of the daemon's IPC socket (the
container ends up with a 0-link unlinked socket inode instead of
the live host one). Without a working socket, NVML state-changing
calls inside the container return NVML_ERROR_UNKNOWN -- which is
exactly what cuda.core's test_persistence_mode_enabled was hitting.
Restrict the refresh to /usr/(bin|lib) so it only touches the
actual binaries + shared libraries that change version with the
driver swap. /dev/nvidia*, /proc/driver/nvidia, /run/nvidia-*,
/tmp/nvidia-mps are all left as the toolkit set them up.
Same latent gap exists in nv-gha-runners/vm-images' nvgha-driver;
their CUDA-runtime validation workload never queries the daemon
socket so they haven't surfaced it.
* CI: take down nvidia-persistenced via pkill, not systemctl
The packaged nvidia-persistenced.service has
`RuntimeDirectory=nvidia-persistenced`, which makes systemd `unlink()`
/run/nvidia-persistenced/ when the unit stops. The container has that
directory bind-mounted from the host as of container-start time. When
systemd removes the inode and our subsequent
`/usr/bin/nvidia-persistenced --user root` call re-creates it, the
container's bind mount is stranded on the deleted inode -- its
/run/nvidia-persistenced/socket shows up with link count 0 and NVML
state-changing calls return NVML_ERROR_UNKNOWN.
`pkill -TERM nvidia-persistenced` sends SIGTERM directly to the
daemon, which exits cleanly without involving systemd's
RuntimeDirectory cleanup. The host dir keeps its inode across the
swap; the container's bind mount stays valid; the new daemon's
socket is visible to in-container NVML clients.
* CI: re-bind /run/nvidia-persistenced into container after driver swap
The container's bind mount of /run/nvidia-persistenced/ is taken at
container-start time and pinned to the host directory's then-current
inode. Across the install the host directory gets recreated under a
fresh inode (the daemon's shutdown + restart cycle replaces it), and
the container is stranded on the deleted inode -- socket file shows
up with link count 0 inside the container, NVML state-changing calls
return NVML_ERROR_UNKNOWN.
After refresh_container_libs, umount the stale bind, mkdir the local
mount point if missing, and re-bind from /proc/1/root/run/nvidia-
persistenced (the host's current view via the privileged container's
host-pid-ns access). CAP_SYS_ADMIN required, which custom-DRIVER rows
already grant via --privileged --pid=host.
* CI: drop install_gpu_driver.sh experiments that turned out non-load-bearing
- Revert `pkill -TERM nvidia-persistenced` to `systemctl stop`; pkill
alone didn't prevent the host dir's inode from flipping, the re-bind
of /run/nvidia-persistenced/ is what restores the container's view.
- Drop `nvidia-smi -pm 1`; the test exercises NVML's set call, which
succeeds once the daemon socket is reachable regardless of current
Persistence-M state.
- Trim `set -x` blocks and `pgrep`/`ls -la`/`stat` diagnostics that
served their purpose during debugging.
Keeps the load-bearing changes (nsenter bash -s, /usr/(bin|lib)
refresh filter, exec nvidia-persistenced --user root, the
/run/nvidia-persistenced re-bind, cp --preserve=mode) and brings the
diff against Justin's nvgha-driver back down to the strict minimum.
* Revert: remove the probe-driver-swap fast-feedback job
Added in a3f1573 for fast iteration on install_gpu_driver.sh; no
longer needed now that the script has stabilized.
* CI: address Mike's review comments on PR 2176
- ci.yml: `workflow_dispatch:` -> `workflow_dispatch: {}` so the empty
mapping reads as intentional rather than ambiguous YAML.
- test-wheel-linux.yml: declare `util-linux` in `Install dependencies`
instead of running a second apt-get inline; util-linux ships in
ubuntu:22.04 by default so this is mostly belt-and-suspenders, but
it removes the redundant apt-get call.
- install_gpu_driver.sh: drop `2>/dev/null` on `systemctl stop` so
real errors surface (`|| true` keeps the script non-fatal). The
redirect was inherited verbatim from nv-gha-runners/vm-images PR 256
with no specific need.1 parent 788bbad commit 88b55cb
8 files changed
Lines changed: 359 additions & 51 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
281 | 281 | | |
282 | 282 | | |
283 | 283 | | |
284 | | - | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
285 | 288 | | |
286 | 289 | | |
287 | 290 | | |
288 | | - | |
289 | 291 | | |
290 | | - | |
| 292 | + | |
291 | 293 | | |
292 | 294 | | |
293 | 295 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
88 | | - | |
89 | | - | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
90 | 95 | | |
91 | 96 | | |
92 | 97 | | |
| |||
101 | 106 | | |
102 | 107 | | |
103 | 108 | | |
104 | | - | |
| 109 | + | |
105 | 110 | | |
106 | 111 | | |
107 | 112 | | |
108 | 113 | | |
109 | 114 | | |
110 | 115 | | |
111 | 116 | | |
112 | | - | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
113 | 121 | | |
114 | 122 | | |
115 | 123 | | |
116 | 124 | | |
117 | 125 | | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | 126 | | |
122 | 127 | | |
123 | 128 | | |
| |||
129 | 134 | | |
130 | 135 | | |
131 | 136 | | |
132 | | - | |
133 | | - | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
134 | 141 | | |
135 | 142 | | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
136 | 153 | | |
137 | 154 | | |
138 | 155 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
81 | 81 | | |
82 | 82 | | |
83 | 83 | | |
84 | | - | |
85 | | - | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
86 | 89 | | |
87 | 90 | | |
88 | 91 | | |
| |||
97 | 100 | | |
98 | 101 | | |
99 | 102 | | |
100 | | - | |
| 103 | + | |
101 | 104 | | |
102 | 105 | | |
103 | 106 | | |
| |||
108 | 111 | | |
109 | 112 | | |
110 | 113 | | |
111 | | - | |
| 114 | + | |
| 115 | + | |
112 | 116 | | |
113 | | - | |
| 117 | + | |
114 | 118 | | |
115 | 119 | | |
116 | 120 | | |
117 | 121 | | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
118 | 128 | | |
119 | 129 | | |
120 | 130 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
16 | 24 | | |
| 25 | + | |
17 | 26 | | |
18 | 27 | | |
19 | 28 | | |
| |||
29 | 38 | | |
30 | 39 | | |
31 | 40 | | |
32 | | - | |
| 41 | + | |
33 | 42 | | |
34 | 43 | | |
35 | | - | |
| 44 | + | |
36 | 45 | | |
37 | 46 | | |
38 | 47 | | |
| |||
77 | 86 | | |
78 | 87 | | |
79 | 88 | | |
80 | | - | |
| 89 | + | |
81 | 90 | | |
82 | 91 | | |
83 | 92 | | |
| |||
116 | 125 | | |
117 | 126 | | |
118 | 127 | | |
119 | | - | |
| 128 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
4 | 18 | | |
5 | 19 | | |
6 | 20 | | |
7 | 21 | | |
8 | | - | |
9 | | - | |
10 | | - | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
11 | 28 | | |
12 | 29 | | |
13 | 30 | | |
| |||
54 | 71 | | |
55 | 72 | | |
56 | 73 | | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
| 74 | + | |
84 | 75 | | |
85 | 76 | | |
86 | 77 | | |
| |||
0 commit comments