Skip to content

Commit eb15e55

Browse files
ericksoacv
andauthored
feat(onboard): use OpenShell Docker GPU sandboxes (#3001)
## Summary - route Linux onboarding through the OpenShell Docker-driver gateway instead of the legacy k3s bootstrap path - add sandbox GPU env/CLI controls, Docker/NVIDIA CDI preflight, GPU metadata, and direct-GPU policy/proof checks - update the GPU e2e path to prove direct sandbox GPU access with `nvidia-smi`, `/proc` comm write, and `cuInit(0)` ## Tests - `bash -n scripts/install-openshell.sh && bash -n test/e2e/test-gpu-e2e.sh` - `npm run build:cli` - `npm run typecheck:cli` - `npx vitest run test/onboard.test.ts src/lib/onboard-command.test.ts test/install-openshell-version-check.test.ts src/lib/inventory-commands.test.ts` - `npm run check:credential-env` - `git diff --check` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Sandbox GPU controls in onboarding/CLI (enable/disable + device) and OpenShell release-channel selection (stable/dev/auto). * **Improvements** * Inventory/status now shows host GPU detection, sandbox GPU enabled/mode/device, and OpenShell driver/version. * Installer now downloads/verifies multiple OpenShell assets and reports installed OpenShell version. * Docker-driver support for privileged sandbox operations and gateway lifecycle handling. * Base-image resolution improved with GLIBC compatibility checks and safer fallback/build behavior. * **Tests** * Added/updated GPU e2e checks and broad test coverage for onboarding, CLI, provisioning, and image-resolution. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
1 parent ee06654 commit eb15e55

62 files changed

Lines changed: 4001 additions & 545 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.coderabbit.yaml

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,10 +188,12 @@ reviews:
188188
- `hermes-discord-e2e` — Hermes Discord config schema + placeholder
189189
isolation
190190
- `hermes-slack-e2e` — Hermes Slack policy + Python placeholder egress
191+
- `openshell-gateway-upgrade-e2e` — stale Linux Docker-driver gateway
192+
process restart after OpenShell upgrade
191193
192194
To run selectively:
193195
```
194-
gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=cloud-e2e,sandbox-operations-e2e,rebuild-openclaw-e2e,messaging-compatible-endpoint-e2e,hermes-discord-e2e,hermes-slack-e2e
196+
gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=cloud-e2e,sandbox-operations-e2e,rebuild-openclaw-e2e,messaging-compatible-endpoint-e2e,hermes-discord-e2e,hermes-slack-e2e,openshell-gateway-upgrade-e2e
195197
```
196198
197199
- path: "src/nemoclaw.ts"
@@ -425,6 +427,20 @@ reviews:
425427
- path: "nemoclaw-blueprint/openclaw-plugins/kimi-inference-compat/**"
426428
instructions: *e2e-kimi-inference-compat
427429

430+
- path: "test/e2e/test-openshell-gateway-upgrade.sh"
431+
instructions: |
432+
This script validates the old OpenShell install upgrade guard for
433+
Linux Docker-driver gateway processes.
434+
435+
**E2E test recommendation:**
436+
- `openshell-gateway-upgrade-e2e` — stale gateway process restart after
437+
OpenShell upgrade
438+
439+
To run selectively:
440+
```
441+
gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=openshell-gateway-upgrade-e2e
442+
```
443+
428444
- path: ".github/workflows/nightly-e2e.yaml"
429445
instructions: |
430446
This is the nightly E2E workflow definition. Changes here affect

.github/actions/resolve-hermes-base-image/action.yaml

Lines changed: 49 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,53 @@ runs:
1010
- name: Resolve Hermes sandbox base image
1111
shell: bash
1212
run: |
13-
if docker pull ghcr.io/nvidia/nemoclaw/hermes-sandbox-base:latest 2>/dev/null; then
14-
echo "HERMES_BASE_IMAGE=ghcr.io/nvidia/nemoclaw/hermes-sandbox-base:latest" >> "$GITHUB_ENV"
15-
else
16-
echo "::warning::GHCR Hermes base image not available, building locally"
17-
docker build -f agents/hermes/Dockerfile.base -t nemoclaw-hermes-base-local .
18-
echo "HERMES_BASE_IMAGE=nemoclaw-hermes-base-local" >> "$GITHUB_ENV"
13+
set -euo pipefail
14+
15+
image="ghcr.io/nvidia/nemoclaw/hermes-sandbox-base"
16+
min_glibc="2.39"
17+
18+
glibc_version() {
19+
docker run --rm --entrypoint /usr/bin/ldd "$1" --version 2>/dev/null \
20+
| sed -nE 's/.*GLIBC ([0-9]+\.[0-9]+).*/\1/p; s/.* ([0-9]+\.[0-9]+)$/\1/p' \
21+
| head -n 1
22+
}
23+
24+
glibc_ok() {
25+
local have="$1"
26+
[[ -n "$have" ]] && [[ "$(printf '%s\n%s\n' "$min_glibc" "$have" | sort -V | head -n 1)" == "$min_glibc" ]]
27+
}
28+
29+
try_image() {
30+
local ref="$1" version
31+
if ! docker pull "$ref" >/dev/null 2>&1; then
32+
return 1
33+
fi
34+
version="$(glibc_version "$ref" || true)"
35+
if ! glibc_ok "$version"; then
36+
echo "::warning::Hermes sandbox base image ${ref} has glibc ${version:-unknown}; need >= ${min_glibc}"
37+
return 1
38+
fi
39+
echo "HERMES_BASE_IMAGE=${ref}" >> "$GITHUB_ENV"
40+
return 0
41+
}
42+
43+
candidates=()
44+
if [[ -n "${GITHUB_SHA:-}" ]]; then
45+
candidates+=("${image}:${GITHUB_SHA:0:8}" "${image}:${GITHUB_SHA:0:7}")
46+
fi
47+
candidates+=("${image}:latest")
48+
49+
for ref in "${candidates[@]}"; do
50+
if try_image "$ref"; then
51+
exit 0
52+
fi
53+
done
54+
55+
echo "::warning::No compatible GHCR Hermes sandbox base image found, building locally"
56+
docker build -f agents/hermes/Dockerfile.base -t nemoclaw-hermes-base-local .
57+
version="$(glibc_version nemoclaw-hermes-base-local || true)"
58+
if ! glibc_ok "$version"; then
59+
echo "::error::Local Hermes sandbox base image has glibc ${version:-unknown}; need >= ${min_glibc}"
60+
exit 1
1961
fi
62+
echo "HERMES_BASE_IMAGE=nemoclaw-hermes-base-local" >> "$GITHUB_ENV"

.github/actions/resolve-sandbox-base-image/action.yaml

Lines changed: 49 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,53 @@ runs:
1010
- name: Resolve sandbox base image
1111
shell: bash
1212
run: |
13-
if docker pull ghcr.io/nvidia/nemoclaw/sandbox-base:latest 2>/dev/null; then
14-
echo "BASE_IMAGE=ghcr.io/nvidia/nemoclaw/sandbox-base:latest" >> "$GITHUB_ENV"
15-
else
16-
echo "::warning::GHCR base image not available, building locally"
17-
docker build -f Dockerfile.base -t nemoclaw-sandbox-base-local .
18-
echo "BASE_IMAGE=nemoclaw-sandbox-base-local" >> "$GITHUB_ENV"
13+
set -euo pipefail
14+
15+
image="ghcr.io/nvidia/nemoclaw/sandbox-base"
16+
min_glibc="2.39"
17+
18+
glibc_version() {
19+
docker run --rm --entrypoint /usr/bin/ldd "$1" --version 2>/dev/null \
20+
| sed -nE 's/.*GLIBC ([0-9]+\.[0-9]+).*/\1/p; s/.* ([0-9]+\.[0-9]+)$/\1/p' \
21+
| head -n 1
22+
}
23+
24+
glibc_ok() {
25+
local have="$1"
26+
[[ -n "$have" ]] && [[ "$(printf '%s\n%s\n' "$min_glibc" "$have" | sort -V | head -n 1)" == "$min_glibc" ]]
27+
}
28+
29+
try_image() {
30+
local ref="$1" version
31+
if ! docker pull "$ref" >/dev/null 2>&1; then
32+
return 1
33+
fi
34+
version="$(glibc_version "$ref" || true)"
35+
if ! glibc_ok "$version"; then
36+
echo "::warning::Sandbox base image ${ref} has glibc ${version:-unknown}; need >= ${min_glibc}"
37+
return 1
38+
fi
39+
echo "BASE_IMAGE=${ref}" >> "$GITHUB_ENV"
40+
return 0
41+
}
42+
43+
candidates=()
44+
if [[ -n "${GITHUB_SHA:-}" ]]; then
45+
candidates+=("${image}:${GITHUB_SHA:0:8}" "${image}:${GITHUB_SHA:0:7}")
46+
fi
47+
candidates+=("${image}:latest")
48+
49+
for ref in "${candidates[@]}"; do
50+
if try_image "$ref"; then
51+
exit 0
52+
fi
53+
done
54+
55+
echo "::warning::No compatible GHCR sandbox base image found, building locally"
56+
docker build -f Dockerfile.base -t nemoclaw-sandbox-base-local .
57+
version="$(glibc_version nemoclaw-sandbox-base-local || true)"
58+
if ! glibc_ok "$version"; then
59+
echo "::error::Local sandbox base image has glibc ${version:-unknown}; need >= ${min_glibc}"
60+
exit 1
1961
fi
62+
echo "BASE_IMAGE=nemoclaw-sandbox-base-local" >> "$GITHUB_ENV"

.github/workflows/base-image.yaml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,10 +65,12 @@ jobs:
6565
- name: Extract metadata
6666
id: meta
6767
uses: docker/metadata-action@v6
68+
env:
69+
DOCKER_METADATA_SHORT_SHA_LENGTH: 8
6870
with:
6971
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
7072
tags: |
71-
type=raw,value=latest
73+
type=raw,value=latest,enable=${{ github.ref == 'refs/heads/main' }}
7274
type=sha,prefix=,format=short
7375
7476
- name: Validate OpenClaw version input
@@ -113,10 +115,12 @@ jobs:
113115
- name: Extract metadata
114116
id: meta
115117
uses: docker/metadata-action@v6
118+
env:
119+
DOCKER_METADATA_SHORT_SHA_LENGTH: 8
116120
with:
117121
images: ${{ env.REGISTRY }}/nvidia/nemoclaw/hermes-sandbox-base
118122
tags: |
119-
type=raw,value=latest
123+
type=raw,value=latest,enable=${{ github.ref == 'refs/heads/main' }}
120124
type=sha,prefix=,format=short
121125
122126
- name: Build and push

.github/workflows/docker-pin-check.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# SPDX-License-Identifier: Apache-2.0
33
#
44
# Weekly check that the pinned Dockerfile base-image digest is still current.
5-
# Fails with an actionable message when a newer node:22-slim is available.
5+
# Fails with an actionable message when a newer node:22-trixie-slim is available.
66

77
name: docker-pin-check
88

@@ -28,3 +28,4 @@ jobs:
2828
run: |
2929
bash scripts/update-docker-pin.sh --check
3030
DOCKERFILE=Dockerfile.base bash scripts/update-docker-pin.sh --check
31+
DOCKERFILE=agents/hermes/Dockerfile.base bash scripts/update-docker-pin.sh --check

.github/workflows/nightly-e2e.yaml

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@
1919
# Discord + Slack coverage with cross-talk assertions. See issue #1903.
2020
# sandbox-survival-e2e Sandbox survival across gateway restarts (onboard, inference,
2121
# gateway stop/start, verify sandbox + workspace + inference).
22+
# openshell-gateway-upgrade-e2e
23+
# Validates stale Linux Docker-driver OpenShell gateway
24+
# processes are restarted after an OpenShell upgrade.
2225
# hermes-e2e Hermes Agent E2E — install → onboard --agent hermes → health
2326
# probe → live inference. Validates the multi-agent architecture.
2427
# hermes-discord-e2e Hermes Discord onboarding — validates the top-level Hermes
@@ -58,6 +61,7 @@ on:
5861
messaging-compatible-endpoint-e2e,
5962
kimi-inference-compat-e2e,
6063
token-rotation-e2e, sandbox-survival-e2e,
64+
openshell-gateway-upgrade-e2e,
6165
issue-2478-crash-loop-recovery-e2e, hermes-e2e, hermes-discord-e2e,
6266
hermes-slack-e2e, sandbox-operations-e2e, inference-routing-e2e,
6367
network-policy-e2e, deployment-services-e2e, diagnostics-e2e,
@@ -1126,6 +1130,45 @@ jobs:
11261130
/tmp/nemoclaw-e2e-upgrade-install.log
11271131
if-no-files-found: ignore
11281132

1133+
# ── OpenShell gateway upgrade E2E ────────────────────────────
1134+
# Reproduces the old-install upgrade edge case for Linux Docker-driver
1135+
# gateways: a healthy gateway process with stale supervisor/runtime env must
1136+
# be restarted rather than reused after the current OpenShell install.
1137+
openshell-gateway-upgrade-e2e:
1138+
if: >-
1139+
github.repository == 'NVIDIA/NemoClaw' &&
1140+
(github.event_name != 'workflow_dispatch' ||
1141+
inputs.jobs == '' ||
1142+
contains(format(',{0},', inputs.jobs), ',openshell-gateway-upgrade-e2e,'))
1143+
runs-on: ubuntu-latest
1144+
timeout-minutes: 30
1145+
steps:
1146+
- name: Checkout
1147+
uses: actions/checkout@v6
1148+
1149+
- name: Setup Node
1150+
uses: actions/setup-node@v6
1151+
with:
1152+
node-version: "22"
1153+
1154+
- name: Run OpenShell gateway upgrade E2E test
1155+
env:
1156+
GITHUB_TOKEN: ${{ github.token }}
1157+
NEMOCLAW_NON_INTERACTIVE: "1"
1158+
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE: "1"
1159+
run: bash test/e2e/test-openshell-gateway-upgrade.sh
1160+
1161+
- name: Upload gateway upgrade logs on failure
1162+
if: failure()
1163+
uses: actions/upload-artifact@v4
1164+
with:
1165+
name: openshell-gateway-upgrade-logs
1166+
path: |
1167+
/tmp/nemoclaw-e2e-openshell-gateway-upgrade.log
1168+
/tmp/nemoclaw-e2e-openshell-gateway-start.log
1169+
/tmp/nemoclaw-e2e-openshell-gateway-process.log
1170+
if-no-files-found: ignore
1171+
11291172
# ── Hermes rebuild upgrade E2E ──────────────────────────────
11301173
# Same upgrade scenario as OpenClaw but for Hermes Agent.
11311174
rebuild-hermes-e2e:
@@ -1209,12 +1252,14 @@ jobs:
12091252
NVIDIA_API_KEY: ${{ secrets.NVIDIA_API_KEY }}
12101253
NEMOCLAW_NON_INTERACTIVE: "1"
12111254
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE: "1"
1255+
NEMOCLAW_SANDBOX_NAME: "e2e-double-install"
12121256
run: bash install.sh --non-interactive --yes-i-accept-third-party-software
12131257
- name: Run double onboard E2E test
12141258
env:
12151259
NVIDIA_API_KEY: ${{ secrets.NVIDIA_API_KEY }}
12161260
NEMOCLAW_NON_INTERACTIVE: "1"
12171261
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE: "1"
1262+
NEMOCLAW_E2E_INSTALL_SANDBOX_NAME: "e2e-double-install"
12181263
run: |
12191264
[ -f "$HOME/.bashrc" ] && source "$HOME/.bashrc" 2>/dev/null || true
12201265
export NVM_DIR="${NVM_DIR:-$HOME/.nvm}"
@@ -1246,12 +1291,14 @@ jobs:
12461291
NVIDIA_API_KEY: ${{ secrets.NVIDIA_API_KEY }}
12471292
NEMOCLAW_NON_INTERACTIVE: "1"
12481293
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE: "1"
1294+
NEMOCLAW_SANDBOX_NAME: "e2e-repair-install"
12491295
run: bash install.sh --non-interactive --yes-i-accept-third-party-software
12501296
- name: Run onboard repair E2E test
12511297
env:
12521298
NVIDIA_API_KEY: ${{ secrets.NVIDIA_API_KEY }}
12531299
NEMOCLAW_NON_INTERACTIVE: "1"
12541300
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE: "1"
1301+
NEMOCLAW_E2E_INSTALL_SANDBOX_NAME: "e2e-repair-install"
12551302
run: |
12561303
[ -f "$HOME/.bashrc" ] && source "$HOME/.bashrc" 2>/dev/null || true
12571304
export NVM_DIR="${NVM_DIR:-$HOME/.nvm}"
@@ -1689,6 +1736,7 @@ jobs:
16891736
shields-config-e2e,
16901737
rebuild-openclaw-e2e,
16911738
upgrade-stale-sandbox-e2e,
1739+
openshell-gateway-upgrade-e2e,
16921740
rebuild-hermes-e2e,
16931741
rebuild-hermes-stale-base-e2e,
16941742
double-onboard-e2e,
@@ -1776,6 +1824,7 @@ jobs:
17761824
shields-config-e2e,
17771825
rebuild-openclaw-e2e,
17781826
upgrade-stale-sandbox-e2e,
1827+
openshell-gateway-upgrade-e2e,
17791828
rebuild-hermes-e2e,
17801829
rebuild-hermes-stale-base-e2e,
17811830
double-onboard-e2e,
@@ -1911,6 +1960,7 @@ jobs:
19111960
shields-config-e2e,
19121961
rebuild-openclaw-e2e,
19131962
upgrade-stale-sandbox-e2e,
1963+
openshell-gateway-upgrade-e2e,
19141964
rebuild-hermes-e2e,
19151965
rebuild-hermes-stale-base-e2e,
19161966
double-onboard-e2e,

.github/workflows/pr-self-hosted.yaml

Lines changed: 4 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -43,15 +43,8 @@ jobs:
4343
- name: Checkout
4444
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
4545

46-
- name: Pull base image from GHCR (fall back to local build)
47-
run: |
48-
if docker pull ghcr.io/nvidia/nemoclaw/sandbox-base:latest 2>/dev/null; then
49-
echo "BASE_IMAGE=ghcr.io/nvidia/nemoclaw/sandbox-base:latest" >> "$GITHUB_ENV"
50-
else
51-
echo "::warning::GHCR base image not available, building locally"
52-
docker build -f Dockerfile.base -t nemoclaw-sandbox-base-local .
53-
echo "BASE_IMAGE=nemoclaw-sandbox-base-local" >> "$GITHUB_ENV"
54-
fi
46+
- name: Resolve sandbox base image
47+
uses: ./.github/actions/resolve-sandbox-base-image
5548

5649
- name: Build production image
5750
run: docker build --build-arg BASE_IMAGE=${{ env.BASE_IMAGE }} -t nemoclaw-production .
@@ -85,15 +78,8 @@ jobs:
8578
- name: Checkout
8679
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
8780

88-
- name: Pull base image from GHCR (fall back to local build)
89-
run: |
90-
if docker pull ghcr.io/nvidia/nemoclaw/sandbox-base:latest 2>/dev/null; then
91-
echo "BASE_IMAGE=ghcr.io/nvidia/nemoclaw/sandbox-base:latest" >> "$GITHUB_ENV"
92-
else
93-
echo "::warning::GHCR base image not available, building locally"
94-
docker build -f Dockerfile.base -t nemoclaw-sandbox-base-local .
95-
echo "BASE_IMAGE=nemoclaw-sandbox-base-local" >> "$GITHUB_ENV"
96-
fi
81+
- name: Resolve sandbox base image
82+
uses: ./.github/actions/resolve-sandbox-base-image
9783

9884
- name: Build production image on arm64
9985
run: docker build --build-arg BASE_IMAGE=${{ env.BASE_IMAGE }} -t nemoclaw-production-arm64 .

.github/workflows/wsl-e2e.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,10 @@ jobs:
122122
$script = @'
123123
set -euo pipefail
124124
export DEBIAN_FRONTEND=noninteractive
125+
printf '%s\n' \
126+
'Acquire::ForceIPv4 "true";' \
127+
'Acquire::Retries "5";' \
128+
>/etc/apt/apt.conf.d/99github-actions-network
125129
apt-get update
126130
apt-get install -y bash ca-certificates curl git jq lsb-release make python3 python3-pip rsync tar unzip xz-utils
127131
'@

0 commit comments

Comments
 (0)