test(e2e): add GPU double-onboard E2E + migrate GPU jobs to NVKS runners (#2606) by jyaunches · Pull Request #2617 · NVIDIA/NemoClaw

jyaunches · 2026-04-28T12:33:02Z

Summary

Add test-gpu-double-onboard.sh that reproduces the exact scenario from #2553: re-onboard with NEMOCLAW_PROVIDER=ollama must not leave the proxy running with a different token than what's persisted to disk.

What the test covers

The exact reproduction scenario from #2553:

Full non-interactive onboard with NEMOCLAW_PROVIDER=ollama
Verify proxy running, token persisted, inference works
Re-onboard (second onboard, same sandbox with NEMOCLAW_RECREATE_SANDBOX=1)
Token consistency: file token matches what the running proxy accepts (not 401)
Inference through sandbox still works after re-onboard

NVKS runner migration

Both gpu-e2e (existing) and gpu-double-onboard-e2e (new) are moved from the never-enabled self-hosted runner to NVKS ephemeral GPU runners (linux-amd64-gpu-l40g-latest-1 — L40G, 48 GB VRAM):

Removes GPU_E2E_ENABLED gate — the variable was never set, so gpu-e2e has been silently skipped every nightly since it was added on March 25
Eliminates needs: gpu-e2e serialization — separate ephemeral VMs, no shared state
Drops timeout from 60 to 30 min — fresh VM each run, no stale state cleanup, auto-selected model is qwen2.5:7b (4.4 GB pull)
Enables GPU E2E coverage in nightly for the first time

Testing

The nightly workflow only triggers on schedule and workflow_dispatch. To validate before merge, run the workflow manually from this branch.

Closes #2606

ollama proxy token diverges from stored token after re-onboard, causing persistent HTTP 401 on inference #2553 — The bug (ollama proxy token divergence)
fix: restart ollama proxy when token diverges from persisted token #2555 — The fix (restart proxy when token diverges)
ci: nightly job for inference test using Ollama on GPU runners #839 — Original gpu-e2e addition (never enabled)

Summary by CodeRabbit

Chores
- GPU nightly E2E workflow now runs unconditionally on ephemeral GPU runners with a shorter timeout and improved failure reporting that surfaces the new double-onboard job.
Tests
- Added a comprehensive GPU E2E that verifies double-onboard behavior, proxy token persistence and acceptance/rejection, end-to-end inference, and sandbox cleanup.

…sistency (#2606) Add test-gpu-double-onboard.sh that reproduces the exact scenario from issue #2553: re-onboard with NEMOCLAW_PROVIDER=ollama must not leave the proxy running with a different token than what is persisted to disk. Phases: 1. First onboard via install.sh --non-interactive (Ollama provider) 2. Verify sandbox, proxy, token file, inference through sandbox 3. Second onboard (re-onboard) via nemoclaw onboard --non-interactive 4. Token consistency: persisted token must match running proxy (not 401) 5. Inference through sandbox must succeed after re-onboard Migrate both gpu-e2e and the new gpu-double-onboard-e2e from the never-enabled self-hosted runner (sparky) to NVKS ephemeral GPU runners (linux-amd64-gpu-l40g-latest-1, L40G 48 GB VRAM). This: - Removes the GPU_E2E_ENABLED gate (always runs now) - Eliminates needs: gpu-e2e serialization (separate VMs) - Drops timeout from 60 to 30 min (ephemeral = no stale state) - Actually enables GPU E2E coverage in nightly for the first time Closes #2606

coderabbitai · 2026-04-28T12:34:16Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6b8262c7-6e32-4e1c-8084-8b1efa287197

📥 Commits

Reviewing files that changed from the base of the PR and between 048787b and b8e4080.

📒 Files selected for processing (1)

test/e2e/test-gpu-double-onboard.sh

✅ Files skipped from review due to trivial changes (1)

test/e2e/test-gpu-double-onboard.sh

📝 Walkthrough

Walkthrough

Adds a new GPU double-onboard nightly job and test script for Ollama token-consistency checks; updates the existing GPU E2E job to run unconditionally for NVIDIA/NemoClaw on a specified NVKS GPU runner with a reduced 30-minute timeout; expands failure artifact uploads and includes the new job in failure notification dependencies.

Changes

Cohort / File(s)	Summary
Workflow Configuration `.github/workflows/nightly-e2e.yaml`	Updated `gpu-e2e` job to run when `github.repository == 'NVIDIA/NemoClaw'` (removed var gate), switched `runs-on` to `linux-amd64-gpu-rtxpro6000-latest-1`, reduced `timeout-minutes` 60→30. Added `gpu-double-onboard-e2e` job (new) that runs on the same GPU runner, sets Ollama re-onboard environment, runs `test-gpu-double-onboard.sh`, uploads three distinct failure artifacts, and is added to the failure-notification job's `needs`.
GPU Double-Onboard Test Script `test/e2e/test-gpu-double-onboard.sh`	New ~570-line bash E2E script implementing phases: pre-cleanup, prerequisites (Docker/nvidia-smi/Ollama), first onboard with `NEMOCLAW_PROVIDER=ollama`, re-onboard with sandbox recreation, token-consistency checks (persisted token file, proxy accepts persisted token, rejects invalid tokens), inference validation, destroy/cleanup, and aggregated failure logging.

Sequence Diagram(s)

sequenceDiagram
  participant GH as "GitHub Actions"
  participant Runner as "GPU Runner"
  participant Script as "test-gpu-double-onboard.sh"
  participant CLI as "nemoclaw CLI / Sandbox"
  participant Proxy as "Ollama Auth Proxy"
  participant Ollama as "Ollama Engine"

  GH->>Runner: start gpu-double-onboard-e2e
  Runner->>Script: execute test script
  Script->>CLI: onboard (first) -> create sandbox
  CLI->>Proxy: start auth-proxy (persist token)
  Proxy->>Ollama: forward requests
  Script->>Proxy: curl /v1/models (Authorization: Bearer <token>) => 200
  Script->>CLI: inference via sandbox -> proxy -> Ollama (expect success)
  Script->>CLI: re-onboard (recreate sandbox)
  Script->>Proxy: verify persisted token accepted (curl => 200)
  Script->>Proxy: verify invalid/absent tokens => 401
  Script->>CLI: inference again -> expect success
  Script->>Runner: on failure upload install/re-onboard/test artifacts

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped in twice to check the key,

Tokens safe and sandboxes agree,
Proxy sings, models answer true,
Re-onboard done — the carrots chew,
Logs bundled up, a hopping review.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: adding a GPU double-onboard E2E test and migrating GPU jobs to NVKS runners.
Linked Issues check	✅ Passed	The PR successfully implements all core requirements: adds test-gpu-double-onboard.sh with required phases, creates gpu-double-onboard-e2e workflow job, handles runner configuration, and includes failure notifications.
Out of Scope Changes check	✅ Passed	All changes align with linked issue `#2606`: new GPU double-onboard E2E script, workflow job addition, and runner migration are all directly specified objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch issue-2606-gpu-double-onboard-e2e

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Line 551: The workflow uses a custom runner label
"linux-amd64-gpu-l40g-latest-1" which actionlint will flag; either add an
actionlint config that whitelists this label or switch the workflow to a
standard GitHub runner. To fix, create a .actionlint.yaml at repo root with a
runners.allow list containing "linux-amd64-gpu-l40g-latest-1" (and any other
custom labels used, e.g., the one at the other occurrence), or update the
runs-on entries in the workflow to a standard label like "ubuntu-latest" so
actionlint no longer rejects the custom runner string.

In `@test/e2e/test-gpu-double-onboard.sh`:
- Around line 315-320: The test currently logs a substring of the bearer token
(using TOKEN_AFTER_FIRST and info "Token after first onboard..."), which leaks
secrets; change the logging to avoid printing token material by instead logging
only that the token file exists and the token length or a one-way digest (e.g.,
sha256) computed from TOKEN_AFTER_FIRST, and apply the same change to the
similar logging at lines referenced around TOKEN_AFTER_FIRST and the other
instance (the lines around 423-424). Ensure you reference TOKEN_FILE and
TOKEN_AFTER_FIRST when locating the code and replace the info message so it
outputs non-secret metadata (presence/length or digest) rather than any token
substring.
- Around line 67-71: The helper function skip() is defined but never used,
triggering ShellCheck SC2329; either remove the entire skip() function
(including the SKIP/TOTAL increments and printf) or add a call to skip "message"
where appropriate in the test flow so it is exercised; if you remove it, also
delete any unused SKIP/TOTAL variables or references; after the change ensure
the script has a proper shebang, is formatted with shfmt, is executable, and
passes ShellCheck validation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4cba2182-0a14-4515-a3b4-842116eefc78

📥 Commits

Reviewing files that changed from the base of the PR and between 0b49851 and 174b274.

📒 Files selected for processing (2)

.github/workflows/nightly-e2e.yaml
test/e2e/test-gpu-double-onboard.sh

coderabbitai · 2026-04-28T12:39:41Z

-    runs-on: self-hosted
-    timeout-minutes: 60
+    if: github.repository == 'NVIDIA/NemoClaw'
+    runs-on: linux-amd64-gpu-l40g-latest-1


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== runner label references ==" rg -n --hidden 'linux-amd64-gpu-l40g-latest-1' . echo echo "== actionlint config candidates ==" fd -HI '(^|/)(\.actionlint\.ya?ml|actionlint\.ya?ml)$' . | while read -r file; do echo "--- $file ---" sed -n '1,200p' "$file" done

Repository: NVIDIA/NemoClaw

Length of output: 293

Add an actionlint config to allow custom runner labels, or switch to standard GitHub runners.

The custom runner label linux-amd64-gpu-l40g-latest-1 used at lines 551 and 599 will be flagged as invalid by actionlint because:

It is not a standard GitHub Actions runner label

No actionlint config exists in the repository to whitelist custom labels

Either create .actionlint.yaml at the repository root with a runner allow-list entry for this NVKS label, or use standard GitHub runners instead.

🧰 Tools

🪛 actionlint (1.7.12)

[error] 551-551: label "linux-amd64-gpu-l40g-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/nightly-e2e.yaml at line 551, The workflow uses a custom runner label "linux-amd64-gpu-l40g-latest-1" which actionlint will flag; either add an actionlint config that whitelists this label or switch the workflow to a standard GitHub runner. To fix, create a .actionlint.yaml at repo root with a runners.allow list containing "linux-amd64-gpu-l40g-latest-1" (and any other custom labels used, e.g., the one at the other occurrence), or update the runs-on entries in the workflow to a standard label like "ubuntu-latest" so actionlint no longer rejects the custom runner string.

coderabbitai · 2026-04-28T12:39:41Z

+skip() {
+  ((SKIP++))
+  ((TOTAL++))
+  printf '\033[33m  SKIP: %s\033[0m\n' "$1"
+}


⚠️ Potential issue | 🟡 Minor

Remove or use the dead skip() helper.

ShellCheck already reports SC2329 here, so this new script won't meet the repo's shell-script validation bar until the helper is either invoked or deleted.

♻️ Minimal cleanup

-skip() { - ((SKIP++)) - ((TOTAL++)) - printf '\033[33m SKIP: %s\033[0m\n' "$1" -}

As per coding guidelines, **/*.sh: Shell scripts must pass ShellCheck validation (configured in .shellcheckrc), be formatted with shfmt, include shebangs, and be executable.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

skip() {

((SKIP++))

((TOTAL++))

printf '\033[33m SKIP: %s\033[0m\n' "$1"

}

🧰 Tools

🪛 Shellcheck (0.11.0)

[info] 67-71: This function is never invoked. Check usage (or ignored if invoked indirectly).

(SC2329)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-gpu-double-onboard.sh` around lines 67 - 71, The helper function skip() is defined but never used, triggering ShellCheck SC2329; either remove the entire skip() function (including the SKIP/TOTAL increments and printf) or add a call to skip "message" where appropriate in the test flow so it is exercised; if you remove it, also delete any unused SKIP/TOTAL variables or references; after the change ensure the script has a proper shebang, is formatted with shfmt, is executable, and passes ShellCheck validation.

coderabbitai · 2026-04-28T12:39:41Z

+# 4f: Record the first-onboard token for later comparison
+TOKEN_AFTER_FIRST=""
+if [ -f "$TOKEN_FILE" ]; then
+  TOKEN_AFTER_FIRST=$(cat "$TOKEN_FILE" | tr -d '[:space:]')
+  info "Token after first onboard: ${TOKEN_AFTER_FIRST:0:8}..."
+fi


⚠️ Potential issue | 🟡 Minor

Don't print proxy token material into CI logs.

These info lines leak part of the bearer token into logs that get uploaded as failure artifacts. Please log only presence/length, or a one-way digest, instead of any token substring.

🔒 Safer logging

- info "Token after first onboard: ${TOKEN_AFTER_FIRST:0:8}..." + info "Token after first onboard: present" … -info "Token after re-onboard: ${TOKEN_AFTER_SECOND:0:8}..." +info "Token after re-onboard: present"

Also applies to: 423-424

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-gpu-double-onboard.sh` around lines 315 - 320, The test currently logs a substring of the bearer token (using TOKEN_AFTER_FIRST and info "Token after first onboard..."), which leaks secrets; change the logging to avoid printing token material by instead logging only that the token file exists and the token length or a one-way digest (e.g., sha256) computed from TOKEN_AFTER_FIRST, and apply the same change to the similar logging at lines referenced around TOKEN_AFTER_FIRST and the other instance (the lines around 423-424). Ensure you reference TOKEN_FILE and TOKEN_AFTER_FIRST when locating the code and replace the info message so it outputs non-secret metadata (presence/length or digest) rather than any token substring.

Use linux-amd64-gpu-rtxpro6000-latest-1 runners which NemoClaw already has access to via nv-gpu-amd64-rtxpro6000-1gpu runner group, instead of L40G runners which are in limited supply.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

.github/workflows/nightly-e2e.yaml (1)

551-551: ⚠️ Potential issue | 🟠 Major

Custom GPU runner label will fail actionlint unless allow-listed.

Line 551 and Line 599 use a custom label (linux-amd64-gpu-rtxpro6000-latest-1), which actionlint flags as unknown by default. Add/update .actionlint.yaml runner allow-list (or switch labels) so lint stays green.

Suggested config addition

+ # .actionlint.yaml
+ self-hosted-runner:
+   labels:
+     - linux-amd64-gpu-rtxpro6000-latest-1

#!/bin/bash
set -euo pipefail

echo "== custom GPU labels in workflow =="
rg -n 'runs-on:\s*linux-amd64-gpu-rtxpro6000-latest-1' .github/workflows/nightly-e2e.yaml

echo
echo "== actionlint config files =="
fd -HI '(^|/)(\.actionlint\.ya?ml|actionlint\.ya?ml)$' .

echo
for f in $(fd -HI '(^|/)(\.actionlint\.ya?ml|actionlint\.ya?ml)$'); do
  echo "--- $f ---"
  sed -n '1,200p' "$f"
done

Also applies to: 599-599

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml at line 551, The custom runner label
"linux-amd64-gpu-rtxpro6000-latest-1" used in runs-on in the nightly-e2e
workflow is flagged by actionlint; update your actionlint configuration to allow
this label by adding it to the runner allow-list in .actionlint.yaml (or replace
the custom label with a standard GitHub-hosted runner label). Locate the runs-on
entries using the string "linux-amd64-gpu-rtxpro6000-latest-1" in
nightly-e2e.yaml and add that exact label to the runners allow/whitelist section
in .actionlint.yaml so actionlint no longer reports it as unknown.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Line 600: The workflow step for gpu-double-onboard-e2e currently sets
timeout-minutes: 30 which matches NEMOCLAW_E2E_DEFAULT_TIMEOUT=1800 in
test/e2e/test-gpu-double-onboard.sh and leaves no margin for checkout/GPU
setup/teardown; update the workflow's timeout-minutes for the
gpu-double-onboard-e2e job (the key named "timeout-minutes") to a larger value
(e.g., 45 or 60) so the job has headroom beyond the test script's 1800s timeout
and reduce flakes.

---

Duplicate comments:
In @.github/workflows/nightly-e2e.yaml:
- Line 551: The custom runner label "linux-amd64-gpu-rtxpro6000-latest-1" used
in runs-on in the nightly-e2e workflow is flagged by actionlint; update your
actionlint configuration to allow this label by adding it to the runner
allow-list in .actionlint.yaml (or replace the custom label with a standard
GitHub-hosted runner label). Locate the runs-on entries using the string
"linux-amd64-gpu-rtxpro6000-latest-1" in nightly-e2e.yaml and add that exact
label to the runners allow/whitelist section in .actionlint.yaml so actionlint
no longer reports it as unknown.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 51237bc3-5de9-46f4-a2d5-f6631f1e8293

📥 Commits

Reviewing files that changed from the base of the PR and between 174b274 and 048787b.

📒 Files selected for processing (1)

.github/workflows/nightly-e2e.yaml

coderabbitai · 2026-04-28T14:08:01Z

+  gpu-double-onboard-e2e:
+    if: github.repository == 'NVIDIA/NemoClaw'
+    runs-on: linux-amd64-gpu-rtxpro6000-latest-1
+    timeout-minutes: 30


⚠️ Potential issue | 🟠 Major

Increase timeout buffer for gpu-double-onboard-e2e to reduce flakes.

Line 600 sets timeout-minutes: 30, but test/e2e/test-gpu-double-onboard.sh sets NEMOCLAW_E2E_DEFAULT_TIMEOUT=1800 (same 30 minutes). With checkout + GPU checks + teardown, this leaves near-zero margin and can cause false timeouts.

Proposed adjustment

- timeout-minutes: 30 + timeout-minutes: 40

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

timeout-minutes: 30

timeout-minutes: 40

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/nightly-e2e.yaml at line 600, The workflow step for gpu-double-onboard-e2e currently sets timeout-minutes: 30 which matches NEMOCLAW_E2E_DEFAULT_TIMEOUT=1800 in test/e2e/test-gpu-double-onboard.sh and leaves no margin for checkout/GPU setup/teardown; update the workflow's timeout-minutes for the gpu-double-onboard-e2e job (the key named "timeout-minutes") to a larger value (e.g., 45 or 60) so the job has headroom beyond the test script's 1800s timeout and reduce flakes.

The skip() function is defined alongside pass()/fail() for completeness but not yet invoked in this script. Add shellcheck disable directive.

jyaunches · 2026-04-28T14:31:53Z

GPU E2E Results on RTX Pro 6000 Runners

Manually dispatched the nightly workflow from this branch (run #25057572379):

Job	Result	Runner
`gpu-e2e`	✅ Passed	`linux-amd64-gpu-rtxpro6000-latest-1`
`gpu-double-onboard-e2e`	❌ 3 failures (22/25 passed)	`linux-amd64-gpu-rtxpro6000-latest-1`

The gpu-double-onboard-e2e failures are expected — the test is correctly detecting the token divergence bug from #2553:

FAIL: PROXY TOKEN DIVERGENCE DETECTED (#2553 regression)
FAIL: Token on disk does not match running proxy (status: 401)
FAIL: Sandbox inference after re-onboard: expected PONG, got:

This is the exact scenario the test was designed to catch. The fix lives in #2555 (fix/ollama-proxy-token-divergence), which hasn't merged to main yet. Once #2555 lands, this test should go green.

…ers (NVIDIA#2606) (NVIDIA#2617) ## Summary Add `test-gpu-double-onboard.sh` that reproduces the exact scenario from NVIDIA#2553: re-onboard with `NEMOCLAW_PROVIDER=ollama` must not leave the proxy running with a different token than what's persisted to disk. ### What the test covers The exact reproduction scenario from NVIDIA#2553: 1. Full non-interactive onboard with `NEMOCLAW_PROVIDER=ollama` 2. Verify proxy running, token persisted, inference works 3. Re-onboard (second onboard, same sandbox with `NEMOCLAW_RECREATE_SANDBOX=1`) 4. **Token consistency**: file token matches what the running proxy accepts (not 401) 5. Inference through sandbox still works after re-onboard ### NVKS runner migration Both `gpu-e2e` (existing) and `gpu-double-onboard-e2e` (new) are moved from the never-enabled `self-hosted` runner to NVKS ephemeral GPU runners (`linux-amd64-gpu-l40g-latest-1` — L40G, 48 GB VRAM): - **Removes `GPU_E2E_ENABLED` gate** — the variable was never set, so `gpu-e2e` has been silently skipped every nightly since it was added on March 25 - **Eliminates `needs: gpu-e2e` serialization** — separate ephemeral VMs, no shared state - **Drops timeout from 60 to 30 min** — fresh VM each run, no stale state cleanup, auto-selected model is `qwen2.5:7b` (4.4 GB pull) - **Enables GPU E2E coverage in nightly for the first time** ### Testing The nightly workflow only triggers on `schedule` and `workflow_dispatch`. To validate before merge, run the workflow manually from this branch. Closes NVIDIA#2606 ### Related - NVIDIA#2553 — The bug (ollama proxy token divergence) - NVIDIA#2555 — The fix (restart proxy when token diverges) - NVIDIA#839 — Original gpu-e2e addition (never enabled)  ## Summary by CodeRabbit * **Chores** * GPU nightly E2E workflow now runs unconditionally on ephemeral GPU runners with a shorter timeout and improved failure reporting that surfaces the new double-onboard job. * **Tests** * Added a comprehensive GPU E2E that verifies double-onboard behavior, proxy token persistence and acceptance/rejection, end-to-end inference, and sandbox cleanup.

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

fix(ci): switch GPU E2E runners from L40G to RTX Pro 6000

048787b

Use linux-amd64-gpu-rtxpro6000-latest-1 runners which NemoClaw already has access to via nv-gpu-amd64-rtxpro6000-1gpu runner group, instead of L40G runners which are in limited supply.

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

fix(e2e): suppress SC2329 for skip() helper in gpu-double-onboard test

b8e4080

The skip() function is defined alongside pass()/fail() for completeness but not yet invoked in this script. Add shellcheck disable directive.

cv approved these changes Apr 28, 2026

View reviewed changes

jyaunches merged commit 3a8853f into main Apr 28, 2026
18 checks passed

This was referenced Apr 28, 2026

fix: restart ollama proxy when token diverges from persisted token #2555

Closed

test(e2e): add dashboard reachability coverage #2123

Closed

jyaunches mentioned this pull request May 8, 2026

re-enable e2e-gpu nightly test #3263

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): add GPU double-onboard E2E + migrate GPU jobs to NVKS runners (#2606)#2617

test(e2e): add GPU double-onboard E2E + migrate GPU jobs to NVKS runners (#2606)#2617
jyaunches merged 3 commits into
mainfrom
issue-2606-gpu-double-onboard-e2e

jyaunches commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Uh oh!

coderabbitai Bot Apr 28, 2026

Uh oh!

coderabbitai Bot Apr 28, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Uh oh!

jyaunches commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jyaunches commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the test covers

NVKS runner migration

Testing

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

jyaunches commented Apr 28, 2026

GPU E2E Results on RTX Pro 6000 Runners

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jyaunches commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading