Skip to content

test(e2e): add GPU double-onboard E2E + migrate GPU jobs to NVKS runners (#2606)#2617

Merged
jyaunches merged 3 commits into
mainfrom
issue-2606-gpu-double-onboard-e2e
Apr 28, 2026
Merged

test(e2e): add GPU double-onboard E2E + migrate GPU jobs to NVKS runners (#2606)#2617
jyaunches merged 3 commits into
mainfrom
issue-2606-gpu-double-onboard-e2e

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented Apr 28, 2026

Summary

Add test-gpu-double-onboard.sh that reproduces the exact scenario from #2553: re-onboard with NEMOCLAW_PROVIDER=ollama must not leave the proxy running with a different token than what's persisted to disk.

What the test covers

The exact reproduction scenario from #2553:

  1. Full non-interactive onboard with NEMOCLAW_PROVIDER=ollama
  2. Verify proxy running, token persisted, inference works
  3. Re-onboard (second onboard, same sandbox with NEMOCLAW_RECREATE_SANDBOX=1)
  4. Token consistency: file token matches what the running proxy accepts (not 401)
  5. Inference through sandbox still works after re-onboard

NVKS runner migration

Both gpu-e2e (existing) and gpu-double-onboard-e2e (new) are moved from the never-enabled self-hosted runner to NVKS ephemeral GPU runners (linux-amd64-gpu-l40g-latest-1 — L40G, 48 GB VRAM):

  • Removes GPU_E2E_ENABLED gate — the variable was never set, so gpu-e2e has been silently skipped every nightly since it was added on March 25
  • Eliminates needs: gpu-e2e serialization — separate ephemeral VMs, no shared state
  • Drops timeout from 60 to 30 min — fresh VM each run, no stale state cleanup, auto-selected model is qwen2.5:7b (4.4 GB pull)
  • Enables GPU E2E coverage in nightly for the first time

Testing

The nightly workflow only triggers on schedule and workflow_dispatch. To validate before merge, run the workflow manually from this branch.

Closes #2606

Related

Summary by CodeRabbit

  • Chores

    • GPU nightly E2E workflow now runs unconditionally on ephemeral GPU runners with a shorter timeout and improved failure reporting that surfaces the new double-onboard job.
  • Tests

    • Added a comprehensive GPU E2E that verifies double-onboard behavior, proxy token persistence and acceptance/rejection, end-to-end inference, and sandbox cleanup.

…sistency (#2606)

Add test-gpu-double-onboard.sh that reproduces the exact scenario from
issue #2553: re-onboard with NEMOCLAW_PROVIDER=ollama must not leave the
proxy running with a different token than what is persisted to disk.

Phases:
  1. First onboard via install.sh --non-interactive (Ollama provider)
  2. Verify sandbox, proxy, token file, inference through sandbox
  3. Second onboard (re-onboard) via nemoclaw onboard --non-interactive
  4. Token consistency: persisted token must match running proxy (not 401)
  5. Inference through sandbox must succeed after re-onboard

Migrate both gpu-e2e and the new gpu-double-onboard-e2e from the
never-enabled self-hosted runner (sparky) to NVKS ephemeral GPU runners
(linux-amd64-gpu-l40g-latest-1, L40G 48 GB VRAM). This:
  - Removes the GPU_E2E_ENABLED gate (always runs now)
  - Eliminates needs: gpu-e2e serialization (separate VMs)
  - Drops timeout from 60 to 30 min (ephemeral = no stale state)
  - Actually enables GPU E2E coverage in nightly for the first time

Closes #2606
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6b8262c7-6e32-4e1c-8084-8b1efa287197

📥 Commits

Reviewing files that changed from the base of the PR and between 048787b and b8e4080.

📒 Files selected for processing (1)
  • test/e2e/test-gpu-double-onboard.sh
✅ Files skipped from review due to trivial changes (1)
  • test/e2e/test-gpu-double-onboard.sh

📝 Walkthrough

Walkthrough

Adds a new GPU double-onboard nightly job and test script for Ollama token-consistency checks; updates the existing GPU E2E job to run unconditionally for NVIDIA/NemoClaw on a specified NVKS GPU runner with a reduced 30-minute timeout; expands failure artifact uploads and includes the new job in failure notification dependencies.

Changes

Cohort / File(s) Summary
Workflow Configuration
.github/workflows/nightly-e2e.yaml
Updated gpu-e2e job to run when github.repository == 'NVIDIA/NemoClaw' (removed var gate), switched runs-on to linux-amd64-gpu-rtxpro6000-latest-1, reduced timeout-minutes 60→30. Added gpu-double-onboard-e2e job (new) that runs on the same GPU runner, sets Ollama re-onboard environment, runs test-gpu-double-onboard.sh, uploads three distinct failure artifacts, and is added to the failure-notification job's needs.
GPU Double-Onboard Test Script
test/e2e/test-gpu-double-onboard.sh
New ~570-line bash E2E script implementing phases: pre-cleanup, prerequisites (Docker/nvidia-smi/Ollama), first onboard with NEMOCLAW_PROVIDER=ollama, re-onboard with sandbox recreation, token-consistency checks (persisted token file, proxy accepts persisted token, rejects invalid tokens), inference validation, destroy/cleanup, and aggregated failure logging.

Sequence Diagram(s)

sequenceDiagram
  participant GH as "GitHub Actions"
  participant Runner as "GPU Runner"
  participant Script as "test-gpu-double-onboard.sh"
  participant CLI as "nemoclaw CLI / Sandbox"
  participant Proxy as "Ollama Auth Proxy"
  participant Ollama as "Ollama Engine"

  GH->>Runner: start gpu-double-onboard-e2e
  Runner->>Script: execute test script
  Script->>CLI: onboard (first) -> create sandbox
  CLI->>Proxy: start auth-proxy (persist token)
  Proxy->>Ollama: forward requests
  Script->>Proxy: curl /v1/models (Authorization: Bearer <token>) => 200
  Script->>CLI: inference via sandbox -> proxy -> Ollama (expect success)
  Script->>CLI: re-onboard (recreate sandbox)
  Script->>Proxy: verify persisted token accepted (curl => 200)
  Script->>Proxy: verify invalid/absent tokens => 401
  Script->>CLI: inference again -> expect success
  Script->>Runner: on failure upload install/re-onboard/test artifacts
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped in twice to check the key,

Tokens safe and sandboxes agree,
Proxy sings, models answer true,
Re-onboard done — the carrots chew,
Logs bundled up, a hopping review.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding a GPU double-onboard E2E test and migrating GPU jobs to NVKS runners.
Linked Issues check ✅ Passed The PR successfully implements all core requirements: adds test-gpu-double-onboard.sh with required phases, creates gpu-double-onboard-e2e workflow job, handles runner configuration, and includes failure notifications.
Out of Scope Changes check ✅ Passed All changes align with linked issue #2606: new GPU double-onboard E2E script, workflow job addition, and runner migration are all directly specified objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-2606-gpu-double-onboard-e2e

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Line 551: The workflow uses a custom runner label
"linux-amd64-gpu-l40g-latest-1" which actionlint will flag; either add an
actionlint config that whitelists this label or switch the workflow to a
standard GitHub runner. To fix, create a .actionlint.yaml at repo root with a
runners.allow list containing "linux-amd64-gpu-l40g-latest-1" (and any other
custom labels used, e.g., the one at the other occurrence), or update the
runs-on entries in the workflow to a standard label like "ubuntu-latest" so
actionlint no longer rejects the custom runner string.

In `@test/e2e/test-gpu-double-onboard.sh`:
- Around line 315-320: The test currently logs a substring of the bearer token
(using TOKEN_AFTER_FIRST and info "Token after first onboard..."), which leaks
secrets; change the logging to avoid printing token material by instead logging
only that the token file exists and the token length or a one-way digest (e.g.,
sha256) computed from TOKEN_AFTER_FIRST, and apply the same change to the
similar logging at lines referenced around TOKEN_AFTER_FIRST and the other
instance (the lines around 423-424). Ensure you reference TOKEN_FILE and
TOKEN_AFTER_FIRST when locating the code and replace the info message so it
outputs non-secret metadata (presence/length or digest) rather than any token
substring.
- Around line 67-71: The helper function skip() is defined but never used,
triggering ShellCheck SC2329; either remove the entire skip() function
(including the SKIP/TOTAL increments and printf) or add a call to skip "message"
where appropriate in the test flow so it is exercised; if you remove it, also
delete any unused SKIP/TOTAL variables or references; after the change ensure
the script has a proper shebang, is formatted with shfmt, is executable, and
passes ShellCheck validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4cba2182-0a14-4515-a3b4-842116eefc78

📥 Commits

Reviewing files that changed from the base of the PR and between 0b49851 and 174b274.

📒 Files selected for processing (2)
  • .github/workflows/nightly-e2e.yaml
  • test/e2e/test-gpu-double-onboard.sh

Comment thread .github/workflows/nightly-e2e.yaml Outdated
runs-on: self-hosted
timeout-minutes: 60
if: github.repository == 'NVIDIA/NemoClaw'
runs-on: linux-amd64-gpu-l40g-latest-1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== runner label references =="
rg -n --hidden 'linux-amd64-gpu-l40g-latest-1' .

echo
echo "== actionlint config candidates =="
fd -HI '(^|/)(\.actionlint\.ya?ml|actionlint\.ya?ml)$' . | while read -r file; do
  echo "--- $file ---"
  sed -n '1,200p' "$file"
done

Repository: NVIDIA/NemoClaw

Length of output: 293


Add an actionlint config to allow custom runner labels, or switch to standard GitHub runners.

The custom runner label linux-amd64-gpu-l40g-latest-1 used at lines 551 and 599 will be flagged as invalid by actionlint because:

  1. It is not a standard GitHub Actions runner label
  2. No actionlint config exists in the repository to whitelist custom labels

Either create .actionlint.yaml at the repository root with a runner allow-list entry for this NVKS label, or use standard GitHub runners instead.

🧰 Tools
🪛 actionlint (1.7.12)

[error] 551-551: label "linux-amd64-gpu-l40g-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml at line 551, The workflow uses a custom
runner label "linux-amd64-gpu-l40g-latest-1" which actionlint will flag; either
add an actionlint config that whitelists this label or switch the workflow to a
standard GitHub runner. To fix, create a .actionlint.yaml at repo root with a
runners.allow list containing "linux-amd64-gpu-l40g-latest-1" (and any other
custom labels used, e.g., the one at the other occurrence), or update the
runs-on entries in the workflow to a standard label like "ubuntu-latest" so
actionlint no longer rejects the custom runner string.

Comment on lines +67 to +71
skip() {
((SKIP++))
((TOTAL++))
printf '\033[33m SKIP: %s\033[0m\n' "$1"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove or use the dead skip() helper.

ShellCheck already reports SC2329 here, so this new script won't meet the repo's shell-script validation bar until the helper is either invoked or deleted.

♻️ Minimal cleanup
-skip() {
-  ((SKIP++))
-  ((TOTAL++))
-  printf '\033[33m  SKIP: %s\033[0m\n' "$1"
-}

As per coding guidelines, **/*.sh: Shell scripts must pass ShellCheck validation (configured in .shellcheckrc), be formatted with shfmt, include shebangs, and be executable.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
skip() {
((SKIP++))
((TOTAL++))
printf '\033[33m SKIP: %s\033[0m\n' "$1"
}
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 67-71: This function is never invoked. Check usage (or ignored if invoked indirectly).

(SC2329)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-gpu-double-onboard.sh` around lines 67 - 71, The helper
function skip() is defined but never used, triggering ShellCheck SC2329; either
remove the entire skip() function (including the SKIP/TOTAL increments and
printf) or add a call to skip "message" where appropriate in the test flow so it
is exercised; if you remove it, also delete any unused SKIP/TOTAL variables or
references; after the change ensure the script has a proper shebang, is
formatted with shfmt, is executable, and passes ShellCheck validation.

Comment on lines +315 to +320
# 4f: Record the first-onboard token for later comparison
TOKEN_AFTER_FIRST=""
if [ -f "$TOKEN_FILE" ]; then
TOKEN_AFTER_FIRST=$(cat "$TOKEN_FILE" | tr -d '[:space:]')
info "Token after first onboard: ${TOKEN_AFTER_FIRST:0:8}..."
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Don't print proxy token material into CI logs.

These info lines leak part of the bearer token into logs that get uploaded as failure artifacts. Please log only presence/length, or a one-way digest, instead of any token substring.

🔒 Safer logging
-  info "Token after first onboard: ${TOKEN_AFTER_FIRST:0:8}..."
+  info "Token after first onboard: present"-info "Token after re-onboard: ${TOKEN_AFTER_SECOND:0:8}..."
+info "Token after re-onboard: present"

Also applies to: 423-424

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-gpu-double-onboard.sh` around lines 315 - 320, The test
currently logs a substring of the bearer token (using TOKEN_AFTER_FIRST and info
"Token after first onboard..."), which leaks secrets; change the logging to
avoid printing token material by instead logging only that the token file exists
and the token length or a one-way digest (e.g., sha256) computed from
TOKEN_AFTER_FIRST, and apply the same change to the similar logging at lines
referenced around TOKEN_AFTER_FIRST and the other instance (the lines around
423-424). Ensure you reference TOKEN_FILE and TOKEN_AFTER_FIRST when locating
the code and replace the info message so it outputs non-secret metadata
(presence/length or digest) rather than any token substring.

Use linux-amd64-gpu-rtxpro6000-latest-1 runners which NemoClaw already
has access to via nv-gpu-amd64-rtxpro6000-1gpu runner group, instead of
L40G runners which are in limited supply.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/nightly-e2e.yaml (1)

551-551: ⚠️ Potential issue | 🟠 Major

Custom GPU runner label will fail actionlint unless allow-listed.

Line 551 and Line 599 use a custom label (linux-amd64-gpu-rtxpro6000-latest-1), which actionlint flags as unknown by default. Add/update .actionlint.yaml runner allow-list (or switch labels) so lint stays green.

Suggested config addition
+ # .actionlint.yaml
+ self-hosted-runner:
+   labels:
+     - linux-amd64-gpu-rtxpro6000-latest-1
#!/bin/bash
set -euo pipefail

echo "== custom GPU labels in workflow =="
rg -n 'runs-on:\s*linux-amd64-gpu-rtxpro6000-latest-1' .github/workflows/nightly-e2e.yaml

echo
echo "== actionlint config files =="
fd -HI '(^|/)(\.actionlint\.ya?ml|actionlint\.ya?ml)$' .

echo
for f in $(fd -HI '(^|/)(\.actionlint\.ya?ml|actionlint\.ya?ml)$'); do
  echo "--- $f ---"
  sed -n '1,200p' "$f"
done

Also applies to: 599-599

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml at line 551, The custom runner label
"linux-amd64-gpu-rtxpro6000-latest-1" used in runs-on in the nightly-e2e
workflow is flagged by actionlint; update your actionlint configuration to allow
this label by adding it to the runner allow-list in .actionlint.yaml (or replace
the custom label with a standard GitHub-hosted runner label). Locate the runs-on
entries using the string "linux-amd64-gpu-rtxpro6000-latest-1" in
nightly-e2e.yaml and add that exact label to the runners allow/whitelist section
in .actionlint.yaml so actionlint no longer reports it as unknown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Line 600: The workflow step for gpu-double-onboard-e2e currently sets
timeout-minutes: 30 which matches NEMOCLAW_E2E_DEFAULT_TIMEOUT=1800 in
test/e2e/test-gpu-double-onboard.sh and leaves no margin for checkout/GPU
setup/teardown; update the workflow's timeout-minutes for the
gpu-double-onboard-e2e job (the key named "timeout-minutes") to a larger value
(e.g., 45 or 60) so the job has headroom beyond the test script's 1800s timeout
and reduce flakes.

---

Duplicate comments:
In @.github/workflows/nightly-e2e.yaml:
- Line 551: The custom runner label "linux-amd64-gpu-rtxpro6000-latest-1" used
in runs-on in the nightly-e2e workflow is flagged by actionlint; update your
actionlint configuration to allow this label by adding it to the runner
allow-list in .actionlint.yaml (or replace the custom label with a standard
GitHub-hosted runner label). Locate the runs-on entries using the string
"linux-amd64-gpu-rtxpro6000-latest-1" in nightly-e2e.yaml and add that exact
label to the runners allow/whitelist section in .actionlint.yaml so actionlint
no longer reports it as unknown.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 51237bc3-5de9-46f4-a2d5-f6631f1e8293

📥 Commits

Reviewing files that changed from the base of the PR and between 174b274 and 048787b.

📒 Files selected for processing (1)
  • .github/workflows/nightly-e2e.yaml

gpu-double-onboard-e2e:
if: github.repository == 'NVIDIA/NemoClaw'
runs-on: linux-amd64-gpu-rtxpro6000-latest-1
timeout-minutes: 30
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Increase timeout buffer for gpu-double-onboard-e2e to reduce flakes.

Line 600 sets timeout-minutes: 30, but test/e2e/test-gpu-double-onboard.sh sets NEMOCLAW_E2E_DEFAULT_TIMEOUT=1800 (same 30 minutes). With checkout + GPU checks + teardown, this leaves near-zero margin and can cause false timeouts.

Proposed adjustment
-    timeout-minutes: 30
+    timeout-minutes: 40
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
timeout-minutes: 30
timeout-minutes: 40
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml at line 600, The workflow step for
gpu-double-onboard-e2e currently sets timeout-minutes: 30 which matches
NEMOCLAW_E2E_DEFAULT_TIMEOUT=1800 in test/e2e/test-gpu-double-onboard.sh and
leaves no margin for checkout/GPU setup/teardown; update the workflow's
timeout-minutes for the gpu-double-onboard-e2e job (the key named
"timeout-minutes") to a larger value (e.g., 45 or 60) so the job has headroom
beyond the test script's 1800s timeout and reduce flakes.

The skip() function is defined alongside pass()/fail() for completeness
but not yet invoked in this script. Add shellcheck disable directive.
@jyaunches
Copy link
Copy Markdown
Contributor Author

GPU E2E Results on RTX Pro 6000 Runners

Manually dispatched the nightly workflow from this branch (run #25057572379):

Job Result Runner
gpu-e2e ✅ Passed linux-amd64-gpu-rtxpro6000-latest-1
gpu-double-onboard-e2e ❌ 3 failures (22/25 passed) linux-amd64-gpu-rtxpro6000-latest-1

The gpu-double-onboard-e2e failures are expected — the test is correctly detecting the token divergence bug from #2553:

FAIL: PROXY TOKEN DIVERGENCE DETECTED (#2553 regression)
FAIL: Token on disk does not match running proxy (status: 401)
FAIL: Sandbox inference after re-onboard: expected PONG, got:

This is the exact scenario the test was designed to catch. The fix lives in #2555 (fix/ollama-proxy-token-divergence), which hasn't merged to main yet. Once #2555 lands, this test should go green.

@jyaunches jyaunches merged commit 3a8853f into main Apr 28, 2026
18 checks passed
@wscurran wscurran added CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). enhancement: testing Use this label to identify requests to improve NemoClaw test coverage. E2E End-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gaps Provider: Ollama Use this label to identify issues with the local Ollama model integration. labels Apr 28, 2026
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
…ers (NVIDIA#2606) (NVIDIA#2617)

## Summary

Add `test-gpu-double-onboard.sh` that reproduces the exact scenario from
NVIDIA#2553: re-onboard with `NEMOCLAW_PROVIDER=ollama` must not leave the
proxy running with a different token than what's persisted to disk.

### What the test covers

The exact reproduction scenario from NVIDIA#2553:
1. Full non-interactive onboard with `NEMOCLAW_PROVIDER=ollama`
2. Verify proxy running, token persisted, inference works
3. Re-onboard (second onboard, same sandbox with
`NEMOCLAW_RECREATE_SANDBOX=1`)
4. **Token consistency**: file token matches what the running proxy
accepts (not 401)
5. Inference through sandbox still works after re-onboard

### NVKS runner migration

Both `gpu-e2e` (existing) and `gpu-double-onboard-e2e` (new) are moved
from the never-enabled `self-hosted` runner to NVKS ephemeral GPU
runners (`linux-amd64-gpu-l40g-latest-1` — L40G, 48 GB VRAM):

- **Removes `GPU_E2E_ENABLED` gate** — the variable was never set, so
`gpu-e2e` has been silently skipped every nightly since it was added on
March 25
- **Eliminates `needs: gpu-e2e` serialization** — separate ephemeral
VMs, no shared state
- **Drops timeout from 60 to 30 min** — fresh VM each run, no stale
state cleanup, auto-selected model is `qwen2.5:7b` (4.4 GB pull)
- **Enables GPU E2E coverage in nightly for the first time**

### Testing

The nightly workflow only triggers on `schedule` and
`workflow_dispatch`. To validate before merge, run the workflow manually
from this branch.

Closes NVIDIA#2606

### Related
- NVIDIA#2553 — The bug (ollama proxy token divergence)
- NVIDIA#2555 — The fix (restart proxy when token diverges)
- NVIDIA#839 — Original gpu-e2e addition (never enabled)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* GPU nightly E2E workflow now runs unconditionally on ephemeral GPU
runners with a shorter timeout and improved failure reporting that
surfaces the new double-onboard job.

* **Tests**
* Added a comprehensive GPU E2E that verifies double-onboard behavior,
proxy token persistence and acceptance/rejection, end-to-end inference,
and sandbox cleanup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@jyaunches jyaunches mentioned this pull request May 8, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. E2E End-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gaps enhancement: testing Use this label to identify requests to improve NemoClaw test coverage. NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). Provider: Ollama Use this label to identify issues with the local Ollama model integration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test(e2e): add GPU double-onboard E2E test for Ollama proxy token consistency

3 participants