test(e2e): add GPU double-onboard E2E + migrate GPU jobs to NVKS runners (#2606)#2617
Conversation
…sistency (#2606) Add test-gpu-double-onboard.sh that reproduces the exact scenario from issue #2553: re-onboard with NEMOCLAW_PROVIDER=ollama must not leave the proxy running with a different token than what is persisted to disk. Phases: 1. First onboard via install.sh --non-interactive (Ollama provider) 2. Verify sandbox, proxy, token file, inference through sandbox 3. Second onboard (re-onboard) via nemoclaw onboard --non-interactive 4. Token consistency: persisted token must match running proxy (not 401) 5. Inference through sandbox must succeed after re-onboard Migrate both gpu-e2e and the new gpu-double-onboard-e2e from the never-enabled self-hosted runner (sparky) to NVKS ephemeral GPU runners (linux-amd64-gpu-l40g-latest-1, L40G 48 GB VRAM). This: - Removes the GPU_E2E_ENABLED gate (always runs now) - Eliminates needs: gpu-e2e serialization (separate VMs) - Drops timeout from 60 to 30 min (ephemeral = no stale state) - Actually enables GPU E2E coverage in nightly for the first time Closes #2606
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a new GPU double-onboard nightly job and test script for Ollama token-consistency checks; updates the existing GPU E2E job to run unconditionally for Changes
Sequence Diagram(s)sequenceDiagram
participant GH as "GitHub Actions"
participant Runner as "GPU Runner"
participant Script as "test-gpu-double-onboard.sh"
participant CLI as "nemoclaw CLI / Sandbox"
participant Proxy as "Ollama Auth Proxy"
participant Ollama as "Ollama Engine"
GH->>Runner: start gpu-double-onboard-e2e
Runner->>Script: execute test script
Script->>CLI: onboard (first) -> create sandbox
CLI->>Proxy: start auth-proxy (persist token)
Proxy->>Ollama: forward requests
Script->>Proxy: curl /v1/models (Authorization: Bearer <token>) => 200
Script->>CLI: inference via sandbox -> proxy -> Ollama (expect success)
Script->>CLI: re-onboard (recreate sandbox)
Script->>Proxy: verify persisted token accepted (curl => 200)
Script->>Proxy: verify invalid/absent tokens => 401
Script->>CLI: inference again -> expect success
Script->>Runner: on failure upload install/re-onboard/test artifacts
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Line 551: The workflow uses a custom runner label
"linux-amd64-gpu-l40g-latest-1" which actionlint will flag; either add an
actionlint config that whitelists this label or switch the workflow to a
standard GitHub runner. To fix, create a .actionlint.yaml at repo root with a
runners.allow list containing "linux-amd64-gpu-l40g-latest-1" (and any other
custom labels used, e.g., the one at the other occurrence), or update the
runs-on entries in the workflow to a standard label like "ubuntu-latest" so
actionlint no longer rejects the custom runner string.
In `@test/e2e/test-gpu-double-onboard.sh`:
- Around line 315-320: The test currently logs a substring of the bearer token
(using TOKEN_AFTER_FIRST and info "Token after first onboard..."), which leaks
secrets; change the logging to avoid printing token material by instead logging
only that the token file exists and the token length or a one-way digest (e.g.,
sha256) computed from TOKEN_AFTER_FIRST, and apply the same change to the
similar logging at lines referenced around TOKEN_AFTER_FIRST and the other
instance (the lines around 423-424). Ensure you reference TOKEN_FILE and
TOKEN_AFTER_FIRST when locating the code and replace the info message so it
outputs non-secret metadata (presence/length or digest) rather than any token
substring.
- Around line 67-71: The helper function skip() is defined but never used,
triggering ShellCheck SC2329; either remove the entire skip() function
(including the SKIP/TOTAL increments and printf) or add a call to skip "message"
where appropriate in the test flow so it is exercised; if you remove it, also
delete any unused SKIP/TOTAL variables or references; after the change ensure
the script has a proper shebang, is formatted with shfmt, is executable, and
passes ShellCheck validation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4cba2182-0a14-4515-a3b4-842116eefc78
📒 Files selected for processing (2)
.github/workflows/nightly-e2e.yamltest/e2e/test-gpu-double-onboard.sh
| runs-on: self-hosted | ||
| timeout-minutes: 60 | ||
| if: github.repository == 'NVIDIA/NemoClaw' | ||
| runs-on: linux-amd64-gpu-l40g-latest-1 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== runner label references =="
rg -n --hidden 'linux-amd64-gpu-l40g-latest-1' .
echo
echo "== actionlint config candidates =="
fd -HI '(^|/)(\.actionlint\.ya?ml|actionlint\.ya?ml)$' . | while read -r file; do
echo "--- $file ---"
sed -n '1,200p' "$file"
doneRepository: NVIDIA/NemoClaw
Length of output: 293
Add an actionlint config to allow custom runner labels, or switch to standard GitHub runners.
The custom runner label linux-amd64-gpu-l40g-latest-1 used at lines 551 and 599 will be flagged as invalid by actionlint because:
- It is not a standard GitHub Actions runner label
- No actionlint config exists in the repository to whitelist custom labels
Either create .actionlint.yaml at the repository root with a runner allow-list entry for this NVKS label, or use standard GitHub runners instead.
🧰 Tools
🪛 actionlint (1.7.12)
[error] 551-551: label "linux-amd64-gpu-l40g-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/nightly-e2e.yaml at line 551, The workflow uses a custom
runner label "linux-amd64-gpu-l40g-latest-1" which actionlint will flag; either
add an actionlint config that whitelists this label or switch the workflow to a
standard GitHub runner. To fix, create a .actionlint.yaml at repo root with a
runners.allow list containing "linux-amd64-gpu-l40g-latest-1" (and any other
custom labels used, e.g., the one at the other occurrence), or update the
runs-on entries in the workflow to a standard label like "ubuntu-latest" so
actionlint no longer rejects the custom runner string.
| skip() { | ||
| ((SKIP++)) | ||
| ((TOTAL++)) | ||
| printf '\033[33m SKIP: %s\033[0m\n' "$1" | ||
| } |
There was a problem hiding this comment.
Remove or use the dead skip() helper.
ShellCheck already reports SC2329 here, so this new script won't meet the repo's shell-script validation bar until the helper is either invoked or deleted.
♻️ Minimal cleanup
-skip() {
- ((SKIP++))
- ((TOTAL++))
- printf '\033[33m SKIP: %s\033[0m\n' "$1"
-}As per coding guidelines, **/*.sh: Shell scripts must pass ShellCheck validation (configured in .shellcheckrc), be formatted with shfmt, include shebangs, and be executable.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| skip() { | |
| ((SKIP++)) | |
| ((TOTAL++)) | |
| printf '\033[33m SKIP: %s\033[0m\n' "$1" | |
| } |
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 67-71: This function is never invoked. Check usage (or ignored if invoked indirectly).
(SC2329)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-gpu-double-onboard.sh` around lines 67 - 71, The helper
function skip() is defined but never used, triggering ShellCheck SC2329; either
remove the entire skip() function (including the SKIP/TOTAL increments and
printf) or add a call to skip "message" where appropriate in the test flow so it
is exercised; if you remove it, also delete any unused SKIP/TOTAL variables or
references; after the change ensure the script has a proper shebang, is
formatted with shfmt, is executable, and passes ShellCheck validation.
| # 4f: Record the first-onboard token for later comparison | ||
| TOKEN_AFTER_FIRST="" | ||
| if [ -f "$TOKEN_FILE" ]; then | ||
| TOKEN_AFTER_FIRST=$(cat "$TOKEN_FILE" | tr -d '[:space:]') | ||
| info "Token after first onboard: ${TOKEN_AFTER_FIRST:0:8}..." | ||
| fi |
There was a problem hiding this comment.
Don't print proxy token material into CI logs.
These info lines leak part of the bearer token into logs that get uploaded as failure artifacts. Please log only presence/length, or a one-way digest, instead of any token substring.
🔒 Safer logging
- info "Token after first onboard: ${TOKEN_AFTER_FIRST:0:8}..."
+ info "Token after first onboard: present"
…
-info "Token after re-onboard: ${TOKEN_AFTER_SECOND:0:8}..."
+info "Token after re-onboard: present"Also applies to: 423-424
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-gpu-double-onboard.sh` around lines 315 - 320, The test
currently logs a substring of the bearer token (using TOKEN_AFTER_FIRST and info
"Token after first onboard..."), which leaks secrets; change the logging to
avoid printing token material by instead logging only that the token file exists
and the token length or a one-way digest (e.g., sha256) computed from
TOKEN_AFTER_FIRST, and apply the same change to the similar logging at lines
referenced around TOKEN_AFTER_FIRST and the other instance (the lines around
423-424). Ensure you reference TOKEN_FILE and TOKEN_AFTER_FIRST when locating
the code and replace the info message so it outputs non-secret metadata
(presence/length or digest) rather than any token substring.
Use linux-amd64-gpu-rtxpro6000-latest-1 runners which NemoClaw already has access to via nv-gpu-amd64-rtxpro6000-1gpu runner group, instead of L40G runners which are in limited supply.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
.github/workflows/nightly-e2e.yaml (1)
551-551:⚠️ Potential issue | 🟠 MajorCustom GPU runner label will fail
actionlintunless allow-listed.Line 551 and Line 599 use a custom label (
linux-amd64-gpu-rtxpro6000-latest-1), whichactionlintflags as unknown by default. Add/update.actionlint.yamlrunner allow-list (or switch labels) so lint stays green.Suggested config addition
+ # .actionlint.yaml + self-hosted-runner: + labels: + - linux-amd64-gpu-rtxpro6000-latest-1#!/bin/bash set -euo pipefail echo "== custom GPU labels in workflow ==" rg -n 'runs-on:\s*linux-amd64-gpu-rtxpro6000-latest-1' .github/workflows/nightly-e2e.yaml echo echo "== actionlint config files ==" fd -HI '(^|/)(\.actionlint\.ya?ml|actionlint\.ya?ml)$' . echo for f in $(fd -HI '(^|/)(\.actionlint\.ya?ml|actionlint\.ya?ml)$'); do echo "--- $f ---" sed -n '1,200p' "$f" doneAlso applies to: 599-599
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/nightly-e2e.yaml at line 551, The custom runner label "linux-amd64-gpu-rtxpro6000-latest-1" used in runs-on in the nightly-e2e workflow is flagged by actionlint; update your actionlint configuration to allow this label by adding it to the runner allow-list in .actionlint.yaml (or replace the custom label with a standard GitHub-hosted runner label). Locate the runs-on entries using the string "linux-amd64-gpu-rtxpro6000-latest-1" in nightly-e2e.yaml and add that exact label to the runners allow/whitelist section in .actionlint.yaml so actionlint no longer reports it as unknown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Line 600: The workflow step for gpu-double-onboard-e2e currently sets
timeout-minutes: 30 which matches NEMOCLAW_E2E_DEFAULT_TIMEOUT=1800 in
test/e2e/test-gpu-double-onboard.sh and leaves no margin for checkout/GPU
setup/teardown; update the workflow's timeout-minutes for the
gpu-double-onboard-e2e job (the key named "timeout-minutes") to a larger value
(e.g., 45 or 60) so the job has headroom beyond the test script's 1800s timeout
and reduce flakes.
---
Duplicate comments:
In @.github/workflows/nightly-e2e.yaml:
- Line 551: The custom runner label "linux-amd64-gpu-rtxpro6000-latest-1" used
in runs-on in the nightly-e2e workflow is flagged by actionlint; update your
actionlint configuration to allow this label by adding it to the runner
allow-list in .actionlint.yaml (or replace the custom label with a standard
GitHub-hosted runner label). Locate the runs-on entries using the string
"linux-amd64-gpu-rtxpro6000-latest-1" in nightly-e2e.yaml and add that exact
label to the runners allow/whitelist section in .actionlint.yaml so actionlint
no longer reports it as unknown.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 51237bc3-5de9-46f4-a2d5-f6631f1e8293
📒 Files selected for processing (1)
.github/workflows/nightly-e2e.yaml
| gpu-double-onboard-e2e: | ||
| if: github.repository == 'NVIDIA/NemoClaw' | ||
| runs-on: linux-amd64-gpu-rtxpro6000-latest-1 | ||
| timeout-minutes: 30 |
There was a problem hiding this comment.
Increase timeout buffer for gpu-double-onboard-e2e to reduce flakes.
Line 600 sets timeout-minutes: 30, but test/e2e/test-gpu-double-onboard.sh sets NEMOCLAW_E2E_DEFAULT_TIMEOUT=1800 (same 30 minutes). With checkout + GPU checks + teardown, this leaves near-zero margin and can cause false timeouts.
Proposed adjustment
- timeout-minutes: 30
+ timeout-minutes: 40📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| timeout-minutes: 30 | |
| timeout-minutes: 40 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/nightly-e2e.yaml at line 600, The workflow step for
gpu-double-onboard-e2e currently sets timeout-minutes: 30 which matches
NEMOCLAW_E2E_DEFAULT_TIMEOUT=1800 in test/e2e/test-gpu-double-onboard.sh and
leaves no margin for checkout/GPU setup/teardown; update the workflow's
timeout-minutes for the gpu-double-onboard-e2e job (the key named
"timeout-minutes") to a larger value (e.g., 45 or 60) so the job has headroom
beyond the test script's 1800s timeout and reduce flakes.
The skip() function is defined alongside pass()/fail() for completeness but not yet invoked in this script. Add shellcheck disable directive.
GPU E2E Results on RTX Pro 6000 RunnersManually dispatched the nightly workflow from this branch (run #25057572379):
The This is the exact scenario the test was designed to catch. The fix lives in #2555 ( |
…ers (NVIDIA#2606) (NVIDIA#2617) ## Summary Add `test-gpu-double-onboard.sh` that reproduces the exact scenario from NVIDIA#2553: re-onboard with `NEMOCLAW_PROVIDER=ollama` must not leave the proxy running with a different token than what's persisted to disk. ### What the test covers The exact reproduction scenario from NVIDIA#2553: 1. Full non-interactive onboard with `NEMOCLAW_PROVIDER=ollama` 2. Verify proxy running, token persisted, inference works 3. Re-onboard (second onboard, same sandbox with `NEMOCLAW_RECREATE_SANDBOX=1`) 4. **Token consistency**: file token matches what the running proxy accepts (not 401) 5. Inference through sandbox still works after re-onboard ### NVKS runner migration Both `gpu-e2e` (existing) and `gpu-double-onboard-e2e` (new) are moved from the never-enabled `self-hosted` runner to NVKS ephemeral GPU runners (`linux-amd64-gpu-l40g-latest-1` — L40G, 48 GB VRAM): - **Removes `GPU_E2E_ENABLED` gate** — the variable was never set, so `gpu-e2e` has been silently skipped every nightly since it was added on March 25 - **Eliminates `needs: gpu-e2e` serialization** — separate ephemeral VMs, no shared state - **Drops timeout from 60 to 30 min** — fresh VM each run, no stale state cleanup, auto-selected model is `qwen2.5:7b` (4.4 GB pull) - **Enables GPU E2E coverage in nightly for the first time** ### Testing The nightly workflow only triggers on `schedule` and `workflow_dispatch`. To validate before merge, run the workflow manually from this branch. Closes NVIDIA#2606 ### Related - NVIDIA#2553 — The bug (ollama proxy token divergence) - NVIDIA#2555 — The fix (restart proxy when token diverges) - NVIDIA#839 — Original gpu-e2e addition (never enabled) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * GPU nightly E2E workflow now runs unconditionally on ephemeral GPU runners with a shorter timeout and improved failure reporting that surfaces the new double-onboard job. * **Tests** * Added a comprehensive GPU E2E that verifies double-onboard behavior, proxy token persistence and acceptance/rejection, end-to-end inference, and sandbox cleanup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Add
test-gpu-double-onboard.shthat reproduces the exact scenario from #2553: re-onboard withNEMOCLAW_PROVIDER=ollamamust not leave the proxy running with a different token than what's persisted to disk.What the test covers
The exact reproduction scenario from #2553:
NEMOCLAW_PROVIDER=ollamaNEMOCLAW_RECREATE_SANDBOX=1)NVKS runner migration
Both
gpu-e2e(existing) andgpu-double-onboard-e2e(new) are moved from the never-enabledself-hostedrunner to NVKS ephemeral GPU runners (linux-amd64-gpu-l40g-latest-1— L40G, 48 GB VRAM):GPU_E2E_ENABLEDgate — the variable was never set, sogpu-e2ehas been silently skipped every nightly since it was added on March 25needs: gpu-e2eserialization — separate ephemeral VMs, no shared stateqwen2.5:7b(4.4 GB pull)Testing
The nightly workflow only triggers on
scheduleandworkflow_dispatch. To validate before merge, run the workflow manually from this branch.Closes #2606
Related
Summary by CodeRabbit
Chores
Tests