Skip to content

fix(sandbox): tighten bounding-set caps#3328

Merged
ericksoa merged 2 commits into
NVIDIA:mainfrom
Dongni-Yang:fix/sandbox-bounding-set-3280
May 11, 2026
Merged

fix(sandbox): tighten bounding-set caps#3328
ericksoa merged 2 commits into
NVIDIA:mainfrom
Dongni-Yang:fix/sandbox-bounding-set-3280

Conversation

@Dongni-Yang
Copy link
Copy Markdown
Contributor

@Dongni-Yang Dongni-Yang commented May 11, 2026

Summary

Partial fix for #3280 — drop the two caps that are cleanly droppable today, surface residuals when the drop step is silently skipped, and rewrite the cap test so it actually exercises the regression.

  • Append cap_sys_admin and cap_sys_ptrace to the capsh --drop list in scripts/lib/sandbox-init.sh. Verified live: in a permissive runtime (--cap-add CAP_SYS_ADMIN --cap-add CAP_SYS_PTRACE), pre-drop CapBnd=0xa82c25fb, post-drop CapBnd=0x1e9 — only load-bearing caps remain.
  • Replace the misleading "runtime already restricts capabilities" message on the CAP_SETPCAP-missing fallback with report_residual_capabilities(), which reads CapBnd: from /proc/self/status and names which of the 5 must-drop caps remain. Uses bash 64-bit arithmetic (no gawk-strtonum dependency).
  • Rewrite test/e2e-gateway-isolation.sh test 14 to inventory all 8 caps named in the issue against CapBnd. 5 are classified must-drop; 3 (CAP_FOWNER, CAP_SETUID, CAP_SETGID) are classified allowed because dropping them requires an entrypoint refactor (see below). The test container now starts with --cap-add CAP_SYS_ADMIN --cap-add CAP_SYS_PTRACE so the bounding set entering capsh matches the permissive runtime that triggered T6002104. Without this, docker's default bounding set already excludes those caps and the test would have been a no-op for the very regression we care about.

Scope: why this is 5/8, not 8/8

The issue names eight caps that must be absent from the bounding set: CAP_SYS_ADMIN, CAP_NET_RAW, CAP_NET_BIND_SERVICE, CAP_SYS_PTRACE, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_SETUID, CAP_SETGID.

This PR fully drops 5 of them. The remaining 3 — CAP_FOWNER, CAP_SETUID, CAP_SETGID — can't be dropped without an entrypoint refactor:

  1. The entrypoint runs as root and uses gosu to step down to the sandbox/gateway user.
  2. gosu performs the setuid() syscall, which requires CAP_SETUID in the process's permitted set.
  3. Dropping CAP_SETUID from the bounding set causes the next exec to strip it from permitted — but the bounding set can only be modified with CAP_SETPCAP, which only root has.
  4. So the drop must happen before gosu (which breaks gosu) or after gosu (which is too late — we're no longer root and can't modify the bounding set).

The proper fix is to replace gosu with setpriv (setpriv --reuid=sandbox --regid=sandbox --bounding-set=-cap_setuid,-cap_setgid,-cap_fowner -- $cmd), which does reuid + bounding-set drop atomically in one process. setpriv is already present in the image, but the refactor affects 4 gosu call sites in scripts/nemoclaw-start.sh plus the parallel entrypoint in agents/hermes/start.sh (same shared library), plus the no-new-privileges special case at scripts/nemoclaw-start.sh:1490. That's a different shape of change than "tighten bounding set" and needs its own design and review cycle.

Residual-risk note. The practical exploit path for these 3 caps from the sandbox user's bounding set requires (a) execing a setuid-root binary that carries those file caps, but the image ships none and (b) the user creating a new one, which is blocked by Landlock + no-new-privs + lack of CAP_DAC_OVERRIDE on root-owned paths. So the residual is a defense-in-depth gap, not a directly exploitable one.

Follow-up. I'll file an issue titled "refactor(entrypoint): replace gosu with setpriv to drop CAP_FOWNER/SETUID/SETGID from sandbox bounding set (#3280 follow-up)" and link it from here so T6002104 can track the second half.

Test plan

  • Forward case (live container). Built nemoclaw-isolation-test = nemoclaw-production + new sandbox-init.sh. Ran test 14 logic with --cap-add CAP_SYS_ADMIN --cap-add CAP_SYS_PTRACE: all 5 must-drop caps ABSENT, all 3 load-bearing caps PRESENT, test passes.
  • Negative case (live container). Rebuilt the image with cap_sys_admin removed from the drop list. Test correctly fails with "CAP_SYS_ADMIN still present in CapBnd after capsh drop". CapBnd=0x2001e9 (bit 21 set) — exactly the T6002104 signature.
  • Synthetic bit-decode harness. Replayed the decode loop over fabricated CapBnd values (ideal post-drop, single-cap regressions, full-skip case) — classifier correctly identifies each failure mode by name.
  • npx vitest run test/sandbox-init.test.ts — 36/36 pass (no-capsh fall-through and NEMOCLAW_CAPS_DROPPED=1 short-circuit unchanged).
  • shfmt -d -i 2 -ci -bn clean on both touched files.
  • bash -n clean.
  • CI to run full e2e-gateway-isolation.sh against a freshly-built sandbox image.

Notes for review

  • Drop list scope is intentionally narrowed per maintainer guidance; further hardening (CAP_SYS_MODULE, CAP_SYS_RAWIO, CAP_BPF, CAP_PERFMON, CAP_SYS_BOOT, CAP_SYSLOG, CAP_NET_ADMIN, …) is a candidate follow-up beyond this PR's issue.
  • Each kept cap (cap_chown, cap_fowner, cap_setuid, cap_setgid, cap_kill) now has an inline justification with a pointer to its load-bearing call site, so a future contributor can audit them without rediscovering [SECURITY] Dockerfile does not explicitly drop Linux capabilities #797 and fix(entrypoint): relax config permissions before write after CAP_DAC_OVERRIDE drop #2659.
  • report_residual_capabilities() is new code, but it only runs on the existing CAP_SETPCAP-missing fallback path — same trigger as the previous one-line warning, just with structured diagnostics. The happy path is unchanged.

Closes #3280 (partial — see scope section above; full closure depends on the setpriv follow-up).

Signed-off-by: Dongni Yang dongniy@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Security Improvements
    • Enhanced sandbox security by dropping additional dangerous system capabilities.
    • Improved security diagnostics to detect and report residual dangerous capabilities in the runtime environment.
    • Strengthened validation testing to comprehensively verify that dangerous capabilities are properly restricted during sandbox initialization.

Review Change Stack

…#3280)

Append cap_sys_admin and cap_sys_ptrace to the capsh --drop list so they
no longer remain in the bounding set after the entrypoint re-execs. The
historical drop list already covered cap_net_raw / cap_dac_override /
cap_net_bind_service, but T6002104 still observed them present — the
root cause is the CAP_SETPCAP-missing fallback silently skipping the
entire drop and inheriting the runtime defaults.

Replace the misleading "runtime already restricts capabilities" message
on that fallback path with report_residual_capabilities(), which reads
CapBnd from /proc/self/status and names which of the 5 must-drop caps
remain. Uses bash 64-bit arithmetic so it does not depend on gawk
strtonum.

Also enumerate the load-bearing kept caps (cap_chown/cap_fowner for
post-drop chown, cap_setuid/cap_setgid for gosu, cap_kill for sandbox→
gateway signaling) inline so a future contributor can audit why each
one stays.

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
…VIDIA#3280)

Rewrite e2e-gateway-isolation.sh test 14 to inventory every cap named
in issue NVIDIA#3280 (CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_NET_RAW,
CAP_NET_BIND_SERVICE, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_SETUID,
CAP_SETGID) against CapBnd from /proc/self/status. Each is classified
as must-drop or allowed-load-bearing; any must-drop cap still present
fails the test by name. The previous assertion only decoded bit 13
(CAP_NET_RAW) and would have passed unchanged for an incomplete drop
list or a silently skipped drop step.

Run the test container with `--cap-add CAP_SYS_ADMIN --cap-add
CAP_SYS_PTRACE` so the bounding set entering capsh matches the
permissive OpenShell runtime that triggered T6002104. Without this,
docker's default bounding set already excludes those caps and the
test would have been a no-op for the regression we care about.

Validated locally against a derived nemoclaw-isolation-test image:
  - drop list including cap_sys_admin,cap_sys_ptrace → PASS,
    CapBnd=0x1e9 (load-bearing caps only).
  - drop list with cap_sys_admin omitted → FAIL with
    "CAP_SYS_ADMIN still present in CapBnd after capsh drop",
    CapBnd=0x2001e9 (bit 21 set), exactly the T6002104 signature.

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

📝 Walkthrough

Walkthrough

The PR enhances the sandbox entrypoint's capability-dropping security by expanding the set of dangerous capabilities targeted for removal, introducing residual capability reporting when standard dropping mechanisms fail, and validating the dropping behavior through an upgraded E2E test that inspects the full bounding set.

Changes

Capability Dropping Enhancement

Layer / File(s) Summary
Documentation & Specification
scripts/lib/sandbox-init.sh
Expanded capability-drop documentation now explicitly lists cap_sys_admin and cap_sys_ptrace among the dropped set and clarifies which capabilities remain for entrypoint function.
Capability Dropping Function
scripts/lib/sandbox-init.sh
drop_capabilities() updates capsh --drop invocation to include additional dangerous capabilities and replaces its fallback path (when cap_setpcap is missing) to invoke the new report_residual_capabilities() diagnostic.
Residual Capability Reporting
scripts/lib/sandbox-init.sh
New report_residual_capabilities() function logs an error, reads /proc/self/status to extract CapBnd, identifies which dangerous capabilities remain in the bounding set, and reports them (or logs if bounding set cannot be determined).
E2E Test Validation
test/e2e-gateway-isolation.sh
Test 14 is replaced with a full bounding-set inventory: it extracts the --drop list from sandbox-init.sh, invokes capsh with the extracted capabilities, computes CapBnd bits, verifies that must-drop capabilities are absent, and permits a designated set of load-bearing capabilities.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 The sandbox grows strong, its caps now tight,
New checkpoints guard against the night.
Test and verify what stays, what's gone,
Security hardened, the path goes on. 🔐

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'fix(sandbox): tighten bounding-set caps' directly summarizes the main change—enhancing capability dropping in the sandbox initialization by tightening bounding-set restrictions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@Dongni-Yang Dongni-Yang added the v0.0.39 Release target label May 11, 2026
@Dongni-Yang Dongni-Yang changed the title fix(sandbox): tighten bounding-set caps for issue #3280 fix(sandbox): tighten bounding-set caps May 11, 2026
@Dongni-Yang
Copy link
Copy Markdown
Contributor Author

Follow-up PR opened: #3329 — replaces gosu with setpriv to drop the remaining 3 caps (CAP_FOWNER / CAP_SETUID / CAP_SETGID) from the sandbox-user bounding set, closing the rest of #3280.

#3329 is stacked on this PR's branch; it should be merged after this one lands. Live-validated end-to-end: with both PRs applied, the sandbox-user CapBnd becomes 0x100 (only CAP_SETPCAP remains; all 8 caps named in #3280 are absent).

Dongni-Yang added a commit to Dongni-Yang/NemoClaw that referenced this pull request May 11, 2026
…NVIDIA#3280)

Follow-up to NVIDIA#3328, which dropped 5/8 of the caps named in issue NVIDIA#3280
but left CAP_FOWNER, CAP_SETUID, and CAP_SETGID present in the sandbox-
user process's bounding set. Those three were blocked by gosu: gosu
needs CAP_SETUID in permitted to make its setuid() syscall, but the
bounding set can only be modified with CAP_SETPCAP (root-only). So
dropping CAP_SETUID before gosu would break the privilege transition,
and dropping it after would be too late because we are no longer root.

setpriv from util-linux solves this by performing reuid + bounding-set
drop atomically inside a single process: setuid first (still holds
CAP_SETUID), then strip the bounding set (still root long enough to
hold CAP_SETPCAP), then exec the target.

Add init_step_down_prefixes() to scripts/lib/sandbox-init.sh which
populates two bash arrays at source time:

  STEP_DOWN_PREFIX_SANDBOX  — step down to sandbox user
  STEP_DOWN_PREFIX_GATEWAY  — step down to gateway user

Each array expands to a setpriv invocation that drops cap_setuid /
cap_setgid / cap_fowner / cap_chown / cap_kill from the bounding set
during the reuid. If setpriv or CAP_SETPCAP is unavailable, the arrays
fall back to plain "gosu <user>" and a warning is logged so the
residual cap retention surfaces in the entrypoint log (matches the
design of report_residual_capabilities from NVIDIA#3328).

Notes:
* setpriv uses unprefixed cap names (per `setpriv --list`), unlike
  capsh which uses cap_*. The arrays use the setpriv format.
* --init-groups (NOT --clear-groups): the gateway user is a member of
  the sandbox group via `usermod -aG sandbox gateway` in
  Dockerfile.base, which is required to write the chmod 660
  /sandbox/.openclaw/openclaw.json (setgid'd config dir, see NVIDIA#2681).
  --clear-groups would strip that membership and break mutateConfigFile
  with EACCES. --init-groups matches gosu's setgroups+initgroups
  behaviour and restores exactly the groups defined in /etc/group for
  the target user.
* Plain array assignment (not `declare -ga`) at file scope: bash 3.2
  on macOS rejects `declare -g`, and bash 3.2+ treats file-scope
  assignment as global by default. Inside init_step_down_prefixes()
  the reassignment is unscoped, so it targets the same globals in
  both bash 3.2 and 4+.
* Shellcheck SC2034 disabled on the prefix arrays because they are
  consumed cross-file (by scripts/nemoclaw-start.sh and
  agents/hermes/start.sh).

Replace the seven gosu call sites across both entrypoints:

  scripts/nemoclaw-start.sh:
    line 795  — auto-pair (sandbox)
    line 1610 — write_auth_profile + harden_auth_profiles (sandbox)
    line 1614 — final exec to NEMOCLAW_CMD (sandbox)
    line 1720 — OpenClaw gateway (gateway)
  agents/hermes/start.sh:
    line 294  — Discord facade (gateway)
    line 586  — final exec to NEMOCLAW_CMD (sandbox)
    line 607  — Hermes gateway (gateway)

The non-root fallback path in nemoclaw-start.sh (lines 1488+) and the
no-new-privileges history comments at lines 138-139 / 1490-1493 are
unchanged — that path does not use a privilege-step-down tool at all.

Validated live: with --cap-add CAP_SYS_ADMIN --cap-add CAP_SYS_PTRACE
(simulating permissive OpenShell runtime), source sandbox-init.sh and
chain drop_capabilities + STEP_DOWN_PREFIX_SANDBOX → final sandbox-
user CapBnd=0x100 (only CAP_SETPCAP remains; all 8 issue-NVIDIA#3280 caps
absent). Negative path: removing -setuid from the setpriv drop list
correctly leaves CAP_SETUID present (bit 7), matching the regression
signature the test in the follow-up commit catches.

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Dongni-Yang added a commit to Dongni-Yang/NemoClaw that referenced this pull request May 11, 2026
…VIDIA#3280)

Flip CAP_FOWNER / CAP_SETUID / CAP_SETGID in e2e-gateway-isolation.sh
test 14 from "allowed" (as documented in NVIDIA#3328) to "must-drop". The
preceding commit replaces gosu with setpriv so the three load-bearing
caps now drop atomically with reuid; the sandbox-user process should
have ALL eight caps named in issue NVIDIA#3280 absent from CapBnd.

Rewrite test 14 to exercise the full two-stage drop end-to-end:
source sandbox-init.sh, run drop_capabilities() (stage 1: capsh strips
the entrypoint-wide --drop list), then exec STEP_DOWN_PREFIX_SANDBOX
(stage 2: setpriv strips the load-bearing caps during reuid), then
capture CapBnd of the resulting sandbox-user process. The test
container is started with --cap-add CAP_SYS_ADMIN --cap-add
CAP_SYS_PTRACE so the bounding set entering the entrypoint resembles
the permissive OpenShell runtime that triggered T6002104 — otherwise
docker's default bounding set already excludes those caps and the
test would be a no-op for the bug condition.

Use grep ^CapBnd: + awk for extraction rather than a triple-quoted
awk script: the awk script's $2 would otherwise be expanded by bash
on the way through capsh re-exec, producing /^CapBnd:/{print } which
prints the whole line and breaks downstream parsing.

Add two unit tests in test/sandbox-init.test.ts for the new
init_step_down_prefixes() helper:
  - falls back to gosu when setpriv/capsh are unavailable
  - uses setpriv with the issue-3280 bounding-set drop when available

Update the existing snapshot-style test for Hermes start.sh's
start_discord_facade body to assert on the new STEP_DOWN_PREFIX_GATEWAY
invocation instead of the legacy gosu gateway sh -c.

Update nemoclaw-start.test.ts test scaffolding to initialise
STEP_DOWN_PREFIX_SANDBOX and STEP_DOWN_PREFIX_GATEWAY in the fallback
form (gosu sandbox / gosu gateway) inside both runLaunchBlock() and
runPreGatewaySetup(). The extracted launch and setup blocks reference
these arrays, and the test scaffolding doesn't source sandbox-init.sh,
so without an explicit initialisation `set -u` fails on the unbound
array and the stubbed gosu() never receives the call.

Validated locally with docker build + docker run --cap-add against a
test image overlaid with the new sandbox-init.sh:
  - Forward: CapBnd=0x100 (only CAP_SETPCAP), test PASS.
  - Regression (omit -setuid from setpriv drop): CapBnd=0x180, test
    correctly fails with "CAP_SETUID still present" by name.
Full npm test on this branch: same 67 failures as upstream/main
baseline (all pre-existing on main), +2 new passing tests for
init_step_down_prefixes — net zero regressions.

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Dongni-Yang added a commit to Dongni-Yang/NemoClaw that referenced this pull request May 11, 2026
…NVIDIA#3280)

Follow-up to NVIDIA#3328, which dropped 5/8 of the caps named in issue NVIDIA#3280
but left CAP_FOWNER, CAP_SETUID, and CAP_SETGID present in the sandbox-
user process's bounding set. Those three were blocked by gosu: gosu
needs CAP_SETUID in permitted to make its setuid() syscall, but the
bounding set can only be modified with CAP_SETPCAP (root-only). So
dropping CAP_SETUID before gosu would break the privilege transition,
and dropping it after would be too late because we are no longer root.

setpriv from util-linux solves this by performing reuid + bounding-set
drop atomically inside a single process: setuid first (still holds
CAP_SETUID), then strip the bounding set (still root long enough to
hold CAP_SETPCAP), then exec the target.

Add init_step_down_prefixes() to scripts/lib/sandbox-init.sh which
populates two bash arrays at source time:

  STEP_DOWN_PREFIX_SANDBOX  — step down to sandbox user
  STEP_DOWN_PREFIX_GATEWAY  — step down to gateway user

Each array expands to a setpriv invocation that drops cap_setuid /
cap_setgid / cap_fowner / cap_chown / cap_kill from the bounding set
during the reuid. If setpriv or CAP_SETPCAP is unavailable, the arrays
stay at the gosu fallback and a warning is logged so the residual cap
retention surfaces in the entrypoint log (matches the design of
report_residual_capabilities from NVIDIA#3328).

Notes:
* Arrays default to (gosu sandbox)/(gosu gateway) at file scope (NOT
  empty). This prevents a privesc regression if init_step_down_prefixes
  is ever skipped: an unset/empty array would expand to nothing and
  `exec "${ARR[@]}" "${NEMOCLAW_CMD[@]}"` would run the agent as root.
  init_step_down_prefixes() only upgrades to setpriv when available.
* setpriv uses unprefixed cap names (per `setpriv --list`), unlike
  capsh which uses cap_*. The arrays use the setpriv format.
* --init-groups (NOT --clear-groups): the gateway user is a member of
  the sandbox group via `usermod -aG sandbox gateway` in
  Dockerfile.base, which is required to write the chmod 660
  /sandbox/.openclaw/openclaw.json (setgid'd config dir, see NVIDIA#2681).
  --clear-groups would strip that membership and break mutateConfigFile
  with EACCES. --init-groups matches gosu's setgroups+initgroups
  behaviour and restores exactly the groups defined in /etc/group for
  the target user.
* Plain array assignment (not `declare -ga`) at file scope: bash 3.2
  on macOS rejects `declare -g`, and bash 3.2+ treats file-scope
  assignment as global by default. Inside init_step_down_prefixes()
  the reassignment is unscoped, so it targets the same globals in
  both bash 3.2 and 4+.
* Per-assignment shellcheck SC2034 disables: the prefix arrays are
  consumed cross-file (by scripts/nemoclaw-start.sh and
  agents/hermes/start.sh), which shellcheck cannot follow.

Replace the seven gosu call sites across both entrypoints:

  scripts/nemoclaw-start.sh:
    line 795  — auto-pair (sandbox)
    line 1610 — write_auth_profile + harden_auth_profiles (sandbox)
    line 1614 — final exec to NEMOCLAW_CMD (sandbox)
    line 1720 — OpenClaw gateway (gateway)
  agents/hermes/start.sh:
    line 294  — Discord facade (gateway)
    line 586  — final exec to NEMOCLAW_CMD (sandbox)
    line 607  — Hermes gateway (gateway)

The non-root fallback path in nemoclaw-start.sh (lines 1488+) and the
no-new-privileges history comments at lines 138-139 / 1490-1493 are
unchanged — that path does not use a privilege-step-down tool at all.

Validated live: with --cap-add CAP_SYS_ADMIN --cap-add CAP_SYS_PTRACE
(simulating permissive OpenShell runtime), source sandbox-init.sh and
chain drop_capabilities + STEP_DOWN_PREFIX_SANDBOX → final sandbox-
user CapBnd=0x100 (only CAP_SETPCAP remains; all 8 issue-NVIDIA#3280 caps
absent). Negative path: removing -setuid from the setpriv drop list
correctly leaves CAP_SETUID present (bit 7), matching the regression
signature the test in the follow-up commit catches.

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Dongni-Yang added a commit to Dongni-Yang/NemoClaw that referenced this pull request May 11, 2026
…VIDIA#3280)

Flip CAP_FOWNER / CAP_SETUID / CAP_SETGID in e2e-gateway-isolation.sh
test 14 from "allowed" (as documented in NVIDIA#3328) to "must-drop". The
preceding commit replaces gosu with setpriv so the three load-bearing
caps now drop atomically with reuid; the sandbox-user process should
have ALL eight caps named in issue NVIDIA#3280 absent from CapBnd.

Rewrite test 14 to exercise the full two-stage drop end-to-end:
source sandbox-init.sh, run drop_capabilities() (stage 1: capsh strips
the entrypoint-wide --drop list), then exec STEP_DOWN_PREFIX_SANDBOX
(stage 2: setpriv strips the load-bearing caps during reuid), then
capture CapBnd of the resulting sandbox-user process. The test
container is started with --cap-add CAP_SYS_ADMIN --cap-add
CAP_SYS_PTRACE so the bounding set entering the entrypoint resembles
the permissive OpenShell runtime that triggered T6002104 — otherwise
docker's default bounding set already excludes those caps and the
test would be a no-op for the bug condition.

Use grep ^CapBnd: + awk for extraction rather than a triple-quoted
awk script: the awk script's $2 would otherwise be expanded by bash
on the way through capsh re-exec, producing /^CapBnd:/{print } which
prints the whole line and breaks downstream parsing.

Add two unit tests in test/sandbox-init.test.ts for the new
init_step_down_prefixes() helper:
  - falls back to gosu when setpriv/capsh are unavailable
  - uses setpriv with the issue-3280 bounding-set drop when available

Update the existing snapshot-style test for Hermes start.sh's
start_discord_facade body to assert on the new STEP_DOWN_PREFIX_GATEWAY
invocation instead of the legacy gosu gateway sh -c.

Update nemoclaw-start.test.ts test scaffolding to initialise
STEP_DOWN_PREFIX_SANDBOX and STEP_DOWN_PREFIX_GATEWAY in the fallback
form (gosu sandbox / gosu gateway) inside both runLaunchBlock() and
runPreGatewaySetup(). The extracted launch and setup blocks reference
these arrays, and the test scaffolding doesn't source sandbox-init.sh,
so without an explicit initialisation `set -u` fails on the unbound
array and the stubbed gosu() never receives the call.

Validated locally with docker build + docker run --cap-add against a
test image overlaid with the new sandbox-init.sh:
  - Forward: CapBnd=0x100 (only CAP_SETPCAP), test PASS.
  - Regression (omit -setuid from setpriv drop): CapBnd=0x180, test
    correctly fails with "CAP_SETUID still present" by name.
Full npm test on this branch: same 67 failures as upstream/main
baseline (all pre-existing on main), +2 new passing tests for
init_step_down_prefixes — net zero regressions.

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the sandbox capability bounding-set hardening and expanded gateway-isolation coverage. This base layer is scoped to dropping the newly covered issue #3280 caps via capsh plus surfacing residual-cap diagnostics, with the full stacked nightly validated on #3329 head a5ea171.

@ericksoa ericksoa merged commit d607569 into NVIDIA:main May 11, 2026
25 checks passed
ericksoa added a commit that referenced this pull request May 11, 2026
## Summary

Follow-up to #3328, which dropped 5/8 of the caps named in issue #3280
and left CAP_FOWNER, CAP_SETUID, CAP_SETGID present because the
entrypoint's `gosu`-based privilege separation prevented dropping them
from the bounding set (gosu needs CAP_SETUID in *permitted* to do its
`setuid()` syscall, but the bounding set can only be modified with
CAP_SETPCAP, which only root holds — there's no point in the entrypoint
where the drop can happen without breaking either gosu or the privilege
transition).

This PR resolves the chicken-and-egg by replacing `gosu` with `setpriv`
(util-linux, already present in the image). `setpriv` does reuid +
bounding-set drop atomically inside a single process: setuid first
(still holds CAP_SETUID), then strip the bounding set (still root long
enough to hold CAP_SETPCAP), then exec — no `exec` between the setuid
and the bounding-set drop.

After this PR, **all 8 caps named in issue #3280 are absent from the
sandbox-user process's CapBnd** (verified live; see Test plan).

## Stacking note

Stacked on [PR #3328](#3328). The
full PR diff shows 4 commits:

| Commit | From PR |
|---|---|
| `fix(sandbox): tighten bounding-set caps and surface residuals` |
#3328 |
| `test(sandbox): inventory dangerous-cap set in bounding-set assertion`
| #3328 |
| **`fix(sandbox): replace gosu with setpriv to drop all bounding-set
caps`** | **this PR (`05954bf49`)** |
| **`test(sandbox): require all 8 issue-3280 caps absent after
step-down`** | **this PR (`a5ea1712c`)** |

Reviewers should focus on the **bottom two commits**. Merge **after**
#3328 lands; I'll rebase if #3328 changes during review.

## Changes

### `scripts/lib/sandbox-init.sh`
Add `init_step_down_prefixes()` and two file-scope arrays:

- `STEP_DOWN_PREFIX_SANDBOX` — defaults to `(gosu sandbox)`; upgraded by
`init_step_down_prefixes()` to `(setpriv --reuid=sandbox --regid=sandbox
--init-groups --bounding-set=-setuid,-setgid,-fowner,-chown,-kill --)`
when `setpriv` + `CAP_SETPCAP` are available.
- `STEP_DOWN_PREFIX_GATEWAY` — same shape, gateway user.

If `setpriv` is missing or `CAP_SETPCAP` is unavailable, the arrays stay
at the gosu fallback (matching the previous behavior) and a `[SECURITY
WARNING]` is logged so the residual cap retention surfaces in the
entrypoint log (matches `report_residual_capabilities()` from #3328).

**Implementation notes:**
- **File-scope default is `(gosu …)`, not `()`** — hardens against a
theoretical privesc regression: if `init_step_down_prefixes()` were ever
skipped by a future refactor, an empty array would expand to nothing,
and `exec "${STEP_DOWN_PREFIX_SANDBOX[@]}" "${NEMOCLAW_CMD[@]}"` would
run the agent **as root**. The gosu default makes the failure mode safe.
- **`--init-groups` (not `--clear-groups`)** — gateway is a member of
the sandbox group via `usermod -aG sandbox gateway` in
`Dockerfile.base:99`, required to write the chmod 660
`/sandbox/.openclaw/openclaw.json` (setgid'd config dir per #2681).
`--clear-groups` would strip that membership and break
`mutateConfigFile` with EACCES. `--init-groups` matches gosu's setgroups
+ initgroups behaviour. *(Addresses CodeRabbit comment.)*
- **Plain array assignment (no `declare -ga`)** — bash 3.2 on macOS
rejects `declare -g`, which would break macOS CI when any test sources
`sandbox-init.sh`. File-scope `ARR=()` is global by default in bash
3.2+; the function-internal reassignment without `local` targets the
same global. *(Addresses CodeRabbit comment.)*
- **`setpriv` uses unprefixed cap names** (per `setpriv --list`), unlike
`capsh` which uses `cap_*`. The arrays follow the setpriv convention.
- **Per-assignment `# shellcheck disable=SC2034`** — the prefix arrays
are consumed cross-file (by `scripts/nemoclaw-start.sh` and
`agents/hermes/start.sh`), which shellcheck cannot follow from
`sandbox-init.sh` alone.

### `scripts/nemoclaw-start.sh` (4 sites) and `agents/hermes/start.sh`
(3 sites)
Replace all `gosu <user>` invocations with
`"${STEP_DOWN_PREFIX_<USER>[@]}"`:

| File | Line | Role |
|---|---|---|
| nemoclaw-start.sh | 795 | auto-pair (sandbox) |
| nemoclaw-start.sh | 1610 | write_auth_profile + harden_auth_profiles
(sandbox) |
| nemoclaw-start.sh | 1614 | final exec to NEMOCLAW_CMD (sandbox) |
| nemoclaw-start.sh | 1720 | OpenClaw gateway (gateway) |
| hermes/start.sh | 294 | Discord facade (gateway) |
| hermes/start.sh | 586 | final exec to NEMOCLAW_CMD (sandbox) |
| hermes/start.sh | 607 | Hermes gateway (gateway) |

Non-root fallback path in `nemoclaw-start.sh` (lines 1488+) and the
no-new-privileges history comments at 138-139 / 1490-1493 are unchanged
— that path doesn't use a privilege-step-down tool at all.

### `test/e2e-gateway-isolation.sh`
Flip CAP_FOWNER / CAP_SETUID / CAP_SETGID in test 14 from `allowed` to
`must-drop`. Rewrite the test to exercise the full two-stage drop
end-to-end: source `sandbox-init.sh`, run `drop_capabilities()` (stage
1: capsh), then exec `STEP_DOWN_PREFIX_SANDBOX` (stage 2: setpriv), then
capture CapBnd.

### `test/sandbox-init.test.ts`
Two new unit tests for `init_step_down_prefixes()`:
- Falls back to gosu when setpriv/capsh are unavailable
- Uses setpriv with the issue-3280 bounding-set drop when available

Update the existing `start_discord_facade` snapshot test to expect the
new `STEP_DOWN_PREFIX_GATEWAY` invocation instead of the legacy `gosu
gateway sh -c`.

### `test/nemoclaw-start.test.ts`
Initialise `STEP_DOWN_PREFIX_SANDBOX=(gosu sandbox)` and
`STEP_DOWN_PREFIX_GATEWAY=(gosu gateway)` in the test scaffolding for
both `runLaunchBlock()` and `runPreGatewaySetup()`. The extracted launch
/ setup blocks reference these arrays, and the test scaffolding doesn't
source `sandbox-init.sh`, so without an explicit initialisation `set -u`
fails on the unbound array and the stubbed `gosu()` never receives the
call (this caused the `user=gateway` CI failure on the prior push).

## Test plan

### Forward case (full production image, post-build)
Built `nemoclaw-3329-test` directly from this branch's `Dockerfile` (63
steps, no overlay). Ran the full two-stage drop end-to-end with
`--cap-add CAP_SYS_ADMIN --cap-add CAP_SYS_PTRACE` (worst-case
permissive runtime):

```
Stage 1 (root, post-capsh):   CapBnd=00000000000001e9
Stage 2 (sandbox, post-setpriv): uid=998(sandbox) gid=998(sandbox) groups=sandbox
                                 CapBnd=0000000000000100  → cap_setpcap only
Issue #3280 caps absent: cap_sys_admin / cap_sys_ptrace / cap_net_raw /
                         cap_net_bind_service / cap_dac_override /
                         cap_fowner / cap_setuid / cap_setgid  ✅ (8/8)
```

### Gateway path (full production image, post-build)
Same image, but invoking `STEP_DOWN_PREFIX_GATEWAY` instead:

```
uid=999(gateway) gid=999(gateway) groups=gateway sandbox   ← --init-groups OK
CapBnd=0000000000000100  → cap_setpcap only
/sandbox/.openclaw/openclaw.json (mode 660, sandbox:sandbox) writable by gateway ✅
```

This is the exact case CodeRabbit flagged: gateway must retain `sandbox`
group membership to write the chmod 660 setgid'd config (per #2681).
Confirmed.

### Negative case (live container)
Rebuilt with `-setuid` removed from the setpriv `--bounding-set` arg.
`CapBnd=0x180` (bit 7 set = CAP_SETUID). Test correctly fails with
"CAP_SETUID still present in sandbox-user CapBnd (issue #3280)" —
matches the regression signature this PR is designed to catch.

### Full regression baseline
`npm test` on this branch vs `upstream/main`:

| | Test files failed | Tests failed | Tests passed |
|---|---|---|---|
| `upstream/main` (baseline) | 22 | 67 | 3418 |
| this branch | 22 | 67 | 3420 |
| Δ | 0 | 0 | +2 |

Net: 2 new passing tests (the new `init_step_down_prefixes` cases), zero
new failures. All 67 baseline failures pre-date this PR (stale `dist/`,
unrelated TypeScript files).

### Targeted
- `npx vitest run
test/{sandbox-init,nemoclaw-start,seccomp-guard,service-env}.test.ts` →
**132/132 pass**.
- `bash -n` clean on all 4 touched shell files.
- `shfmt -d -i 2 -ci -bn` clean.

## Security review

| CWE | Status | Notes |
|---|---|---|
| CWE-269 Improper Privilege Management | ✅ no issue | Saved-UID=0 inert
— CAP_SETUID gone from bounding set, can't be regained. |
| CWE-273 Improper Check for Dropped Privileges | ⚠️ no regression |
Trusts setpriv. `exec` semantics → fail-closed on setpriv failure. E2E
test 14 verifies in CI. |
| CWE-274 Improper Handling of Insufficient Privileges | ⚠️ documented
trade-off | SETPCAP-missing fallback is fail-open-for-availability +
fail-loud-for-posture (`[SECURITY WARNING]` to log). |
| CWE-367 TOCTOU | ✅ no issue | Check and use happen in same root
process; CAP_SETPCAP preserved between them. |
| CWE-426 Untrusted Search Path | ✅ no issue | PATH locked at entrypoint
top; init runs as root pre-stepdown. |
| CWE-732 Incorrect Permission Assignment | ✅ no issue | `--init-groups`
preserves gateway's sandbox-group membership (chmod 660 config write
still works). |
| CWE-77/78 Command Injection | ✅ no issue | All setpriv argv literals;
array expansion does not word-split. |
| CWE-200/209/532 Information Exposure | ✅ no issue | Warnings contain
only public cap names; log is root:600 (sandbox user can't read). |
| CWE-693 Protection Mechanism Failure | ✅ no issue | setpriv 2.38.1, no
known CVEs affecting bounding-set ops. |

**Net assessment:** no new CWEs introduced. Sandbox-user CapBnd: 6
entries → 1 entry. Attack surface for setuid-root-binary cap regain:
reduced to empty.

## Risks and notes for review

- **setpriv vs gosu setuid semantics.** Both use the `setuid` syscall.
`setpriv --reuid` sets ruid+euid but not saved UID (gosu uses
`setresuid` which sets all three). Saved-UID=0 is inert here because
using it requires CAP_SETUID in *permitted*, which is empty after the
bounding-set drop on `exec`.
- **No-new-privs interaction.** `setpriv` performs the setuid syscall as
root, which is unrestricted regardless of `no_new_privs`. Different
failure mode from gosu (documented at `nemoclaw-start.sh:138-139` and
`:1490-1493`). Worth verifying on Spark/arm64 in CI.
- **Defense-in-depth, not user-facing behaviour change.** The agent
shell continues to run as the sandbox user with the same supplementary
groups; the only observable difference is `cat /proc/self/status`
showing an empty CapBnd (apart from CAP_SETPCAP itself, which is
harmless in an unprivileged process).
- **Fallback warning is a log line, not an exit.** If a runtime lacks
setpriv or CAP_SETPCAP, the sandbox still boots (under the legacy gosu
path) but emits `[SECURITY WARNING]` so the residual surfaces in `docker
logs`.

## Review feedback addressed

1. **CodeRabbit: Bash 3.2 incompat (`declare -ga`)** → replaced with
plain array assignment.
2. **CodeRabbit: `--clear-groups` removes gateway from sandbox group** →
switched to `--init-groups`; verified live.
3. **Self-review: unset-array privesc regression risk** → file-scope
default initialised to `(gosu …)` instead of `()`;
`init_step_down_prefixes()` only upgrades.
4. **CI: `shellcheck SC2034`** → per-assignment `# shellcheck
disable=SC2034` with cross-file-consumption note.
5. **CI: `test/nemoclaw-start.test.ts:1201` `user=gateway`** →
scaffolding initialises `STEP_DOWN_PREFIX_*` in fallback form so the
stubbed gosu still receives the call.

Closes #3280.

Signed-off-by: Dongni Yang <dongniy@nvidia.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Security Improvements**
* Enhanced sandbox isolation through improved removal of dangerous
capabilities from restricted environments
* Updated privilege separation mechanism with better fallback handling
and more flexible configuration
* Improved capability-dropping logic for comprehensive restriction of
high-risk permissions

* **Tests**
* Updated integration tests to verify capability restrictions work as
expected

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3329)

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
@miyoungc miyoungc mentioned this pull request May 12, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 12, 2026
## Summary
Refreshes the release-prep docs for v0.0.39 based on changes merged
since the Friday 4pm doc refresh. Updates the source docs, bumps the
docs version metadata, and regenerates the NemoClaw user skills from the
refreshed docs.

## Changes
- #3314 -> `docs/get-started/prerequisites.md`,
`docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`:
Documents installer Docker setup, Docker group activation, and retry
guidance.
- #3317 -> `docs/get-started/quickstart.md`,
`docs/reference/commands.md`: Documents the DGX Spark and DGX Station
express install prompt and `NEMOCLAW_NO_EXPRESS`.
- #3328 and #3329 -> `docs/security/best-practices.md`,
`docs/deployment/sandbox-hardening.md`: Updates sandbox capability
hardening docs for the stricter bounding-set and `setpriv` step-down
behavior.
- #3330, #3335, and #3346 -> `docs/inference/use-local-inference.md`:
Documents Windows-host Ollama relaunch behavior, NIM key passthrough,
early health-fail diagnostics, and mixed-GPU preflight detail.
- #2406, #2883, #3001, #3244, #3267, #3318, #3320, and #3354 ->
`docs/about/release-notes.md`: Adds the v0.0.39 release-prep section
while keeping the v0.0.38 release notes intact.
- Advances the release-prep docs metadata from v0.0.38 to v0.0.39.
- Regenerates `.agents/skills/nemoclaw-user-*` from the updated source
docs.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes v0.0.39

* **New Features**
  * Host alias management commands for easier configuration
  * Sandbox GPU control options during onboarding
  * Update command with check and confirmation modes

* **Documentation**
* Enhanced Linux installer guidance with Docker and group membership
handling
  * Expanded troubleshooting for permission and connectivity issues
  * Improved capability-dropping security documentation
  * Updated inference model switching commands
  * Brev environment-specific troubleshooting

* **Improvements**
  * DGX Spark/Station express install flow
  * Windows Ollama relay and health-check enhancements
  * NVIDIA NIM preflight GPU reporting

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3375)

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.39 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Nemoclaw] [All Platforms] Sandbox allows dangerous capabilities in bounding set despite empty effective set

2 participants