Skip to content

Commit 88b55cb

Browse files
authored
CI: allow specifying custom driver versions in test matrix (#2176)
* CI: allow specifying custom driver versions in test matrix Extends the DRIVER field in ci/test-matrix.yml beyond 'latest'/'earliest' to accept an explicit version string (e.g. '580.65.06'). For Linux, ci/tools/install_gpu_driver.sh (adapted from nv-gha-runners/vm-images PR #256) swaps the driver in-job via nsenter when the row uses a custom version; for Windows, ci/tools/install_gpu_driver.ps1 is split into install + configure_driver_mode, with the install step gated on the DRIVER value and the mode step always running. The matrix row is routed to a 'latest' runner image when the DRIVER is a custom version (the install scripts perform the swap themselves). Container privileges on Linux (--privileged --pid=host) are added only on rows with a custom DRIVER. Custom DRIVER + FLAVOR=wsl is rejected eagerly in the compute-matrix step. Two existing nightly-numba-cuda rows exercise the new path: - Linux amd64 / 13.3.0 / l4 -> 580.65.06 - Windows amd64 / 13.3.0 / l4 -> 610.47 Closes #293 Closes #1265 * CI: fix Linux driver nsenter re-exec, swap Windows version, enable ci.yml dispatch - install_gpu_driver.sh: pipe the script body to the host-side bash via stdin (bash -s < "$0") instead of re-execing "$0". The script lives in the GH workspace mount (container-only), so the relative path doesn't resolve after nsenter switches the mount namespace. The < "$0" fd is opened before nsenter and survives the flip. - test-matrix.yml: Windows nightly-numba-cuda row 610.47 -> 596.36 (610.47 isn't published on the CDN; install hit 404). - ci.yml: add workflow_dispatch: trigger so the pipeline can be re-run manually. The existing should-skip / detect-changes gates already handle non-PR events. * CI: move 'Ensure GPU is working' after 'Install GPU driver' on Linux So nvidia-smi validates the post-install driver state on custom-DRIVER rows. Windows test-wheel + coverage already use Install -> Configure -> Ensure; this brings the Linux test-wheel job into line. * CI: flip two PR-matrix Linux rows to DRIVER=610.43.02 Exercises the custom-driver install path on every PR (not just nightly). Both rows are amd64 / 13.3.0 / local-CTK, on l4 and rtxpro6000 -- both in the 'open' kernel-module flavor (only Volta needs 'legacy'). * CI: restart nvidia-persistenced on Linux; poll nvidia-smi on Windows Linux: After install_gpu_driver.sh stops nvidia-persistenced and the apt purge removes the package, the .run installer reinstalls the systemd service but leaves it stopped. cuda.core's test_persistence_mode_enabled fails with NVML_ERROR_UNKNOWN on driver 610.43.02 when the daemon is not running; explicitly start it again at the end of host_install(). Windows: configure_driver_mode.ps1's trailing 'Start-Sleep -Seconds 5' is not enough on slower-coming-back-up multi-GPU rows (observed: 2x H100 MCDM). Replace it with a poll-until-success loop on nvidia-smi with a 60s deadline, matching the runner-team nvgha-driver.ps1 pattern. Previously masked because every Windows row used to run the full install pipeline; with custom-DRIVER plumbing, latest/earliest rows skip the install and the cycle is no longer preceded by warm-up time. * CI: re-enable persistence mode after Linux driver swap Runner-latest L4 images come up with Persistence-M=On (set somewhere in the runner team's image setup, not in cuda-python). Our .run install leaves it Off, which breaks cuda.core's test_persistence_mode_enabled on driver 610.43.02 -- the test calls device.is_persistence_mode_enabled = False on a device that already reports False, and 610.43.02 returns NVML_ERROR_UNKNOWN for that no-op set. Restore the runner baseline by calling `nvidia-smi -pm 1` at the end of host_install() (sets the kernel persistence flag directly via NVML). Also daemon-reload + start nvidia-persistenced.service best-effort so tools that look for the daemon find it; `set -x` around this trailing block so the next run's log confirms which lines fired. * CI: preserve SUID bit when refreshing container nvidia binaries refresh_container_libs() used 'cp -f --remove-destination' (verbatim from the runner team's nvgha-driver), which without -p/--preserve strips the SUID/SGID bits on the destination. /usr/bin/nvidia-modprobe ships 4755 and NVML's state-changing calls (e.g. nvmlDeviceSetPersistenceMode) route through it; once SUID is gone the container-side call returns NVML_ERROR_UNKNOWN, which is what cuda.core's test_persistence_mode_enabled was hitting. Add a stat diagnostic line at the end of refresh_container_libs() so the next CI log records nvidia-modprobe's post-refresh mode. * CI: exec nvidia-persistenced directly after Linux driver swap The `--silent --no-questions` .run installer drops /usr/bin/nvidia- persistenced but does not reliably install a usable systemd unit, so `systemctl start nvidia-persistenced.service` was a no-op (verified in CI logs: `+ true` after the start). With the daemon down, the /run/nvidia-persistenced/socket bind-mounted into the test container is stale, and NVML state-changing calls (e.g. nvmlDeviceSetPersistenceMode) made by root inside the container return NVML_ERROR_UNKNOWN -- which is what cuda.core's test_persistence_mode_enabled has been failing on. Verified on ComputeLab with the same driver (610.43.02), same GPU arch (Ada L40S), root in container: with the daemon up, the SET call returns NVML_SUCCESS; with the daemon down it returns UnknownError. Fix: exec /usr/bin/nvidia-persistenced directly. The binary self-daemonizes and creates the socket on its own. (Same latent gap exists in nv-gha-runners/vm-images' nvgha-driver; will flag upstream.) * CI: pass --user root to nvidia-persistenced after Linux driver swap nvidia-persistenced defaults to `--user nvidia-persistenced`, which our apt-purge of `nvidia-compute-utils-*` removed. Without that user the daemon's setuid(3) post-fork fails and the process exits silently -- the `nvidia-smi -pm 1` right after sees Persistence-M briefly On (daemon held it), then it flips back to Off (daemon gone), and the test container's NVML SET call later returns NVML_ERROR_UNKNOWN. Pass --user root so the daemon doesn't depend on a user account that the purge deleted. Also add a `pgrep nvidia-persistenced` + `ls -la /run/nvidia-persistenced/` diagnostic so the next CI log proves the daemon is alive when the test starts. * CI: add fast-feedback probe-driver-swap job (workflow_dispatch only) Allocates one L4 GPU + privileged container, runs install_gpu_driver.sh with DRIVER=610.43.02, then drives nvmlDeviceSetPersistenceMode via raw ctypes -- the exact NVML call that cuda.core's test_persistence_mode_enabled exercises. Exits 1 on NVML_ERROR_UNKNOWN so the smoke test fails loudly when the install path leaves the daemon dead. Total runtime ~5 min vs ~30 min for the full test matrix. Triggered by workflow_dispatch only -- this is an opt-in debugging job, not regular PR or nightly traffic. * CI: drop workflow_dispatch gate on probe-driver-swap so it runs on every PR * CI: stop refresh_container_libs from clobbering /run/nvidia-persistenced refresh_container_libs() walks /proc/self/mountinfo for entries containing 'nvidia' or 'libcuda'. /run/nvidia-persistenced/socket matches that pattern and was being umount'd + cp'd over -- which breaks the container's view of the daemon's IPC socket (the container ends up with a 0-link unlinked socket inode instead of the live host one). Without a working socket, NVML state-changing calls inside the container return NVML_ERROR_UNKNOWN -- which is exactly what cuda.core's test_persistence_mode_enabled was hitting. Restrict the refresh to /usr/(bin|lib) so it only touches the actual binaries + shared libraries that change version with the driver swap. /dev/nvidia*, /proc/driver/nvidia, /run/nvidia-*, /tmp/nvidia-mps are all left as the toolkit set them up. Same latent gap exists in nv-gha-runners/vm-images' nvgha-driver; their CUDA-runtime validation workload never queries the daemon socket so they haven't surfaced it. * CI: take down nvidia-persistenced via pkill, not systemctl The packaged nvidia-persistenced.service has `RuntimeDirectory=nvidia-persistenced`, which makes systemd `unlink()` /run/nvidia-persistenced/ when the unit stops. The container has that directory bind-mounted from the host as of container-start time. When systemd removes the inode and our subsequent `/usr/bin/nvidia-persistenced --user root` call re-creates it, the container's bind mount is stranded on the deleted inode -- its /run/nvidia-persistenced/socket shows up with link count 0 and NVML state-changing calls return NVML_ERROR_UNKNOWN. `pkill -TERM nvidia-persistenced` sends SIGTERM directly to the daemon, which exits cleanly without involving systemd's RuntimeDirectory cleanup. The host dir keeps its inode across the swap; the container's bind mount stays valid; the new daemon's socket is visible to in-container NVML clients. * CI: re-bind /run/nvidia-persistenced into container after driver swap The container's bind mount of /run/nvidia-persistenced/ is taken at container-start time and pinned to the host directory's then-current inode. Across the install the host directory gets recreated under a fresh inode (the daemon's shutdown + restart cycle replaces it), and the container is stranded on the deleted inode -- socket file shows up with link count 0 inside the container, NVML state-changing calls return NVML_ERROR_UNKNOWN. After refresh_container_libs, umount the stale bind, mkdir the local mount point if missing, and re-bind from /proc/1/root/run/nvidia- persistenced (the host's current view via the privileged container's host-pid-ns access). CAP_SYS_ADMIN required, which custom-DRIVER rows already grant via --privileged --pid=host. * CI: drop install_gpu_driver.sh experiments that turned out non-load-bearing - Revert `pkill -TERM nvidia-persistenced` to `systemctl stop`; pkill alone didn't prevent the host dir's inode from flipping, the re-bind of /run/nvidia-persistenced/ is what restores the container's view. - Drop `nvidia-smi -pm 1`; the test exercises NVML's set call, which succeeds once the daemon socket is reachable regardless of current Persistence-M state. - Trim `set -x` blocks and `pgrep`/`ls -la`/`stat` diagnostics that served their purpose during debugging. Keeps the load-bearing changes (nsenter bash -s, /usr/(bin|lib) refresh filter, exec nvidia-persistenced --user root, the /run/nvidia-persistenced re-bind, cp --preserve=mode) and brings the diff against Justin's nvgha-driver back down to the strict minimum. * Revert: remove the probe-driver-swap fast-feedback job Added in a3f1573 for fast iteration on install_gpu_driver.sh; no longer needed now that the script has stabilized. * CI: address Mike's review comments on PR 2176 - ci.yml: `workflow_dispatch:` -> `workflow_dispatch: {}` so the empty mapping reads as intentional rather than ambiguous YAML. - test-wheel-linux.yml: declare `util-linux` in `Install dependencies` instead of running a second apt-get inline; util-linux ships in ubuntu:22.04 by default so this is mostly belt-and-suspenders, but it removes the redundant apt-get call. - install_gpu_driver.sh: drop `2>/dev/null` on `systemctl stop` so real errors surface (`|| true` keeps the script non-fatal). The redirect was inherited verbatim from nv-gha-runners/vm-images PR 256 with no specific need.
1 parent 788bbad commit 88b55cb

8 files changed

Lines changed: 359 additions & 51 deletions

File tree

.github/workflows/ci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ on:
2424
schedule:
2525
# every 24 hours at midnight UTC
2626
- cron: "0 0 * * *"
27+
workflow_dispatch: {}
2728

2829
jobs:
2930
ci-vars:

.github/workflows/coverage.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -281,13 +281,15 @@ jobs:
281281
uses: nv-gha-runners/setup-proxy-cache@main
282282
continue-on-error: true
283283

284-
- name: Update driver
284+
# DRIVER above is 'latest' so install_gpu_driver.ps1 is intentionally
285+
# skipped (it errors on latest/earliest); configure_driver_mode.ps1
286+
# still runs to put the pre-installed driver into TCC mode.
287+
- name: Configure driver mode
285288
shell: powershell
286289
env:
287290
DRIVER_MODE: "TCC"
288-
GPU_TYPE: "a100"
289291
run: |
290-
ci/tools/install_gpu_driver.ps1
292+
ci/tools/configure_driver_mode.ps1
291293
292294
- name: Ensure GPU is working
293295
run: |

.github/workflows/test-wheel-linux.yml

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,13 @@ jobs:
8585
# Read base matrix from YAML file for the specific architecture
8686
TEST_MATRIX=$(yq -o json ".linux[\"${MATRIX_TYPE}\"] | map(select(.ARCH == \"${ARCH}\"))" ci/test-matrix.yml)
8787
88-
# Apply matrix filter and wrap in include structure
89-
MATRIX=$(echo "$TEST_MATRIX" | jq -c '${{ inputs.matrix_filter }} | if (. | length) > 0 then {include: .} else "Error: Empty matrix\n" | halt_error(1) end')
88+
# Apply matrix filter; reject custom DRIVER + FLAVOR=wsl (the
89+
# in-container driver swap doesn't work under WSL); add a
90+
# RUNNER_DRIVER field that maps any custom version back to
91+
# 'latest' (the install script swaps the driver itself, so we
92+
# need to land on the runner that ships with the most recent
93+
# pre-installed driver); wrap in include structure.
94+
MATRIX=$(echo "$TEST_MATRIX" | jq -c '${{ inputs.matrix_filter }} | if any(.[]; .DRIVER != "latest" and .DRIVER != "earliest" and .FLAVOR == "wsl") then "Error: custom DRIVER is not supported with FLAVOR=wsl\n" | halt_error(1) else . end | map(. + {RUNNER_DRIVER: (if .DRIVER == "latest" or .DRIVER == "earliest" then .DRIVER else "latest" end)}) | if (. | length) > 0 then {include: .} else "Error: Empty matrix\n" | halt_error(1) end')
9095
9196
echo "MATRIX=${MATRIX}" | tee --append "${GITHUB_OUTPUT}"
9297
@@ -101,23 +106,23 @@ jobs:
101106
strategy:
102107
fail-fast: false
103108
matrix: ${{ fromJSON(needs.compute-matrix.outputs.MATRIX) }}
104-
runs-on: "${{ matrix.FLAVOR || 'linux' }}-${{ matrix.ARCH }}-gpu-${{ matrix.GPU }}-${{ matrix.DRIVER }}-${{ matrix.GPU_COUNT }}"
109+
runs-on: "${{ matrix.FLAVOR || 'linux' }}-${{ matrix.ARCH }}-gpu-${{ matrix.GPU }}-${{ matrix.RUNNER_DRIVER }}-${{ matrix.GPU_COUNT }}"
105110
# TODO: remove continue-on-error once 3.15 is officially supported
106111
continue-on-error: ${{ startsWith(matrix.PY_VER, '3.15') }}
107112
# The build stage could fail but we want the CI to keep moving.
108113
if: ${{ github.repository_owner == 'nvidia' && !cancelled() }}
109114
# Our self-hosted runners require a container
110115
# TODO: use a different (nvidia?) container
111116
container:
112-
options: -u root --security-opt seccomp=unconfined --shm-size 16g
117+
# Custom-DRIVER rows need --privileged --pid=host so install_gpu_driver.sh
118+
# can nsenter to the host for the install + refresh the toolkit bind mounts
119+
# back inside the container. Stock options for latest/earliest rows.
120+
options: ${{ ((matrix.DRIVER == 'latest' || matrix.DRIVER == 'earliest') && '-u root --security-opt seccomp=unconfined --shm-size 16g') || '-u root --security-opt seccomp=unconfined --shm-size 16g --privileged --pid=host' }}
113121
image: ubuntu:22.04
114122
env:
115123
NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
116124
PIP_CACHE_DIR: "/tmp/pip-cache"
117125
steps:
118-
- name: Ensure GPU is working
119-
run: nvidia-smi
120-
121126
- name: Checkout ${{ github.event.repository.name }}
122127
uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
123128

@@ -129,10 +134,22 @@ jobs:
129134
uses: ./.github/actions/install_unix_deps
130135
continue-on-error: false
131136
with:
132-
# for artifact fetching, graphics libs, g++ required for cffi in example
133-
dependencies: "jq wget libgl1 libegl1 g++"
137+
# for artifact fetching, graphics libs, g++ required for cffi in
138+
# example; util-linux for `nsenter` (custom-DRIVER rows re-exec
139+
# install_gpu_driver.sh onto the host through nsenter)
140+
dependencies: "jq wget libgl1 libegl1 g++ util-linux"
134141
dependent_exes: "jq wget"
135142

143+
- name: Install GPU driver
144+
if: ${{ matrix.DRIVER != 'latest' && matrix.DRIVER != 'earliest' }}
145+
env:
146+
DRIVER: ${{ matrix.DRIVER }}
147+
GPU_TYPE: ${{ matrix.GPU }}
148+
run: ./ci/tools/install_gpu_driver.sh
149+
150+
- name: Ensure GPU is working
151+
run: nvidia-smi
152+
136153
- name: Set environment variables
137154
env:
138155
BUILD_CUDA_VER: ${{ inputs.build-ctk-ver }}

.github/workflows/test-wheel-windows.yml

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -81,8 +81,11 @@ jobs:
8181
# Read base matrix from YAML file for the specific architecture
8282
TEST_MATRIX=$(yq -o json ".windows[\"${MATRIX_TYPE}\"] | map(select(.ARCH == \"${ARCH}\"))" ci/test-matrix.yml)
8383
84-
# Apply matrix filter and wrap in include structure
85-
MATRIX=$(echo "$TEST_MATRIX" | jq -c '${{ inputs.matrix_filter }} | if (. | length) > 0 then {include: .} else "Error: Empty matrix\n" | halt_error(1) end')
84+
# Apply matrix filter; add a RUNNER_DRIVER field that maps any
85+
# custom DRIVER version back to 'latest' (install_gpu_driver.ps1
86+
# swaps the driver itself, so the runner must be the one that
87+
# ships the most recent pre-installed driver); wrap in include.
88+
MATRIX=$(echo "$TEST_MATRIX" | jq -c '${{ inputs.matrix_filter }} | map(. + {RUNNER_DRIVER: (if .DRIVER == "latest" or .DRIVER == "earliest" then .DRIVER else "latest" end)}) | if (. | length) > 0 then {include: .} else "Error: Empty matrix\n" | halt_error(1) end')
8689
8790
echo "MATRIX=${MATRIX}" | tee --append "${GITHUB_OUTPUT}"
8891
@@ -97,7 +100,7 @@ jobs:
97100
if: ${{ github.repository_owner == 'nvidia' && !cancelled() }}
98101
# TODO: remove continue-on-error once 3.15 is officially supported
99102
continue-on-error: ${{ startsWith(matrix.PY_VER, '3.15') }}
100-
runs-on: "windows-${{ matrix.ARCH }}-gpu-${{ matrix.GPU }}-${{ matrix.DRIVER }}-${{ matrix.GPU_COUNT }}"
103+
runs-on: "windows-${{ matrix.ARCH }}-gpu-${{ matrix.GPU }}-${{ matrix.RUNNER_DRIVER }}-${{ matrix.GPU_COUNT }}"
101104
steps:
102105
- name: Checkout ${{ github.event.repository.name }}
103106
uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
@@ -108,13 +111,20 @@ jobs:
108111
with:
109112
enable-apt: true
110113

111-
- name: Update driver
114+
- name: Install GPU driver
115+
if: ${{ matrix.DRIVER != 'latest' && matrix.DRIVER != 'earliest' }}
112116
env:
113-
DRIVER_MODE: ${{ matrix.DRIVER_MODE }}
117+
DRIVER: ${{ matrix.DRIVER }}
114118
GPU_TYPE: ${{ matrix.GPU }}
115119
run: |
116120
ci/tools/install_gpu_driver.ps1
117121
122+
- name: Configure driver mode
123+
env:
124+
DRIVER_MODE: ${{ matrix.DRIVER_MODE }}
125+
run: |
126+
ci/tools/configure_driver_mode.ps1
127+
118128
- name: Ensure GPU is working
119129
run: |
120130
nvidia-smi

ci/test-matrix.yml

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,16 @@
1313
# Windows entries also include DRIVER_MODE.
1414
#
1515
# Notes:
16+
# - DRIVER accepts:
17+
# * 'latest' - use the runner's pre-installed latest driver (no install step)
18+
# * 'earliest' - use the runner's pre-installed earliest driver (no install step)
19+
# * a version string (e.g. '580.65.06')
20+
# - install that version via ci/tools/install_gpu_driver.sh (Linux)
21+
# or ci/tools/install_gpu_driver.ps1 (Windows) at the start of the
22+
# job. The matrix row is routed to the 'latest' runner image (the
23+
# install scripts swap the driver themselves).
1624
# - DRIVER: 'earliest' does not work with CUDA 12.9.1
25+
# - DRIVER: a custom version is not supported with FLAVOR=wsl on Linux.
1726

1827
linux:
1928
pull-request:
@@ -29,10 +38,10 @@ linux:
2938
- { ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' }
3039
- { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 'v100', GPU_COUNT: '1', DRIVER: 'latest' }
3140
- { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.0.2', LOCAL_CTK: '1', GPU: 'rtxpro6000', GPU_COUNT: '1', DRIVER: 'latest' }
32-
- { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.3.0', LOCAL_CTK: '1', GPU: 'rtxpro6000', GPU_COUNT: '1', DRIVER: 'latest' }
41+
- { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.3.0', LOCAL_CTK: '1', GPU: 'rtxpro6000', GPU_COUNT: '1', DRIVER: '610.43.02' }
3342
- { ARCH: 'amd64', PY_VER: '3.14', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 't4', GPU_COUNT: '1', DRIVER: 'latest' }
3443
- { ARCH: 'amd64', PY_VER: '3.14', CUDA_VER: '13.0.2', LOCAL_CTK: '1', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' }
35-
- { ARCH: 'amd64', PY_VER: '3.14', CUDA_VER: '13.3.0', LOCAL_CTK: '1', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' }
44+
- { ARCH: 'amd64', PY_VER: '3.14', CUDA_VER: '13.3.0', LOCAL_CTK: '1', GPU: 'l4', GPU_COUNT: '1', DRIVER: '610.43.02' }
3645
- { ARCH: 'amd64', PY_VER: '3.14t', CUDA_VER: '12.9.1', LOCAL_CTK: '1', GPU: 't4', GPU_COUNT: '1', DRIVER: 'latest' }
3746
- { ARCH: 'amd64', PY_VER: '3.14t', CUDA_VER: '13.0.2', LOCAL_CTK: '1', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' }
3847
- { ARCH: 'amd64', PY_VER: '3.14t', CUDA_VER: '13.3.0', LOCAL_CTK: '1', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' }
@@ -77,7 +86,7 @@ linux:
7786
- { MODE: 'nightly-pytorch', ARCH: 'arm64', PY_VER: '3.12', CUDA_VER: '13.0.2', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', TORCH_VER: '2.9.1', TORCH_CUDA: 'cu130' }
7887
# nightly-numba-cuda
7988
- { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' }
80-
- { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' }
89+
- { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: '580.65.06' }
8190
- { MODE: 'nightly-numba-cuda', ARCH: 'arm64', PY_VER: '3.12', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' }
8291
- { MODE: 'nightly-numba-cuda', ARCH: 'arm64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' }
8392
# nightly-standard (arm64 l4×2 — nightly-only per runner team request)
@@ -116,4 +125,4 @@ windows:
116125
- { MODE: 'nightly-pytorch', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.0.2', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', DRIVER_MODE: 'TCC', TORCH_VER: '2.9.1', TORCH_CUDA: 'cu130' }
117126
# nightly-numba-cuda
118127
- { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', DRIVER_MODE: 'TCC' }
119-
- { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', DRIVER_MODE: 'TCC' }
128+
- { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: '596.36', DRIVER_MODE: 'TCC' }

ci/tools/configure_driver_mode.ps1

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# configure_driver_mode.ps1 -- set the NVIDIA driver mode on a Windows CI
6+
# runner and cycle the display devices so the new mode takes effect
7+
# without rebooting. Always runs (whether or not install_gpu_driver.ps1
8+
# just ran). When install_gpu_driver.ps1 has run, this single device
9+
# cycle also activates the freshly-installed driver.
10+
#
11+
# Inputs (env):
12+
# DRIVER_MODE One of WDDM, TCC, MCDM.
13+
14+
function Set-DriverMode {
15+
16+
# Map matrix DRIVER_MODE to nvidia-smi -fdm code.
17+
# This assumes we have the prior knowledge on which GPU can use which mode.
18+
$driver_mode = $env:DRIVER_MODE
19+
if ($driver_mode -eq "WDDM") {
20+
Write-Output "Setting driver mode to WDDM..."
21+
nvidia-smi -fdm 0
22+
} elseif ($driver_mode -eq "TCC") {
23+
Write-Output "Setting driver mode to TCC..."
24+
nvidia-smi -fdm 1
25+
} elseif ($driver_mode -eq "MCDM") {
26+
Write-Output "Setting driver mode to MCDM..."
27+
nvidia-smi -fdm 2
28+
} else {
29+
Write-Output "Unknown driver mode: $driver_mode"
30+
exit 1
31+
}
32+
33+
# Only restart NVIDIA display adapters, not other display devices (e.g. QEMU VGA)
34+
$nvidia_devices = Get-PnpDevice -Class Display -FriendlyName "NVIDIA*"
35+
foreach ($device in $nvidia_devices) {
36+
Write-Output "Restarting device: $($device.FriendlyName) ($($device.InstanceId))"
37+
pnputil /disable-device "$($device.InstanceId)"
38+
pnputil /enable-device "$($device.InstanceId)"
39+
}
40+
41+
# Poll nvidia-smi until NVML can initialize, or give up after ~60s.
42+
# A fixed sleep is not enough on slower-coming-back-up multi-GPU rows
43+
# (e.g. 2x H100 MCDM) where pnputil enable returns before NVML is
44+
# ready. Pattern borrowed from the runner-team `nvgha-driver.ps1`.
45+
Write-Output "Waiting for nvidia-smi/NVML to come back up after device cycle..."
46+
$deadline = (Get-Date).AddSeconds(60)
47+
do {
48+
Start-Sleep -Seconds 2
49+
& nvidia-smi.exe 2>&1 | Out-Null
50+
} while ($LASTEXITCODE -ne 0 -and (Get-Date) -lt $deadline)
51+
if ($LASTEXITCODE -ne 0) {
52+
Write-Error "nvidia-smi did not return cleanly within 60s of the device cycle"
53+
exit 1
54+
}
55+
}
56+
57+
# Run the functions
58+
Set-DriverMode

ci/tools/install_gpu_driver.ps1

Lines changed: 21 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,30 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
#
33
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# install_gpu_driver.ps1 -- install a specific NVIDIA driver version on a
6+
# Windows CI runner. Driver-mode selection and the post-install device
7+
# power-cycle are the responsibility of configure_driver_mode.ps1, which
8+
# the workflow runs immediately after this script (or by itself when
9+
# DRIVER is 'latest'/'earliest' and the runner already brings up the
10+
# right driver).
11+
#
12+
# Inputs (env):
13+
# DRIVER Driver version, e.g. "610.47". Must NOT be 'latest' or
14+
# 'earliest' -- those are runner-pre-installed and the
15+
# workflow is expected to skip this script for them.
16+
# GPU_TYPE Lower-case GPU label from the matrix (e.g. "l4", "rtx4090").
17+
# Selects the data-center vs desktop installer variant.
418

519
# Install the driver
620
function Install-Driver {
721

8-
# Set the correct URL, filename, and arguments to the installer
9-
# This driver is picked to support Windows 11 & CUDA 13.0
10-
$version = '581.15'
22+
# Driver version is plumbed from the matrix via the DRIVER env var.
23+
$version = $env:DRIVER
24+
if (-not $version -or $version -eq 'latest' -or $version -eq 'earliest') {
25+
Write-Error "DRIVER env var must be a specific version string (e.g. '610.47'); got '$version'."
26+
exit 1
27+
}
1128

1229
# Get GPU type from environment variable
1330
$gpu_type = $env:GPU_TYPE
@@ -54,33 +71,7 @@ function Install-Driver {
5471
# Install the file with the specified path from earlier
5572
Write-Output 'Running the driver installer...'
5673
Start-Process -FilePath $filepath -ArgumentList $install_args -Wait
57-
Write-Output 'Done!'
58-
59-
# Handle driver mode configuration
60-
# This assumes we have the prior knowledge on which GPU can use which mode.
61-
$driver_mode = $env:DRIVER_MODE
62-
if ($driver_mode -eq "WDDM") {
63-
Write-Output "Setting driver mode to WDDM..."
64-
nvidia-smi -fdm 0
65-
} elseif ($driver_mode -eq "TCC") {
66-
Write-Output "Setting driver mode to TCC..."
67-
nvidia-smi -fdm 1
68-
} elseif ($driver_mode -eq "MCDM") {
69-
Write-Output "Setting driver mode to MCDM..."
70-
nvidia-smi -fdm 2
71-
} else {
72-
Write-Output "Unknown driver mode: $driver_mode"
73-
exit 1
74-
}
75-
# Only restart NVIDIA display adapters, not other display devices (e.g. QEMU VGA)
76-
$nvidia_devices = Get-PnpDevice -Class Display -FriendlyName "NVIDIA*"
77-
foreach ($device in $nvidia_devices) {
78-
Write-Output "Restarting device: $($device.FriendlyName) ($($device.InstanceId))"
79-
pnputil /disable-device "$($device.InstanceId)"
80-
pnputil /enable-device "$($device.InstanceId)"
81-
}
82-
# Give it a minute to settle:
83-
Start-Sleep -Seconds 5
74+
Write-Output 'Install complete; driver mode + device cycle handled by configure_driver_mode.ps1.'
8475
}
8576

8677
# Run the functions

0 commit comments

Comments
 (0)