fix(entrypoint): relax config permissions before write after CAP_DAC_OVERRIDE drop by jyaunches · Pull Request #2659 · NVIDIA/NemoClaw

jyaunches · 2026-04-29T03:43:33Z

Summary

Fix the test/e2e/test-runtime-overrides.sh E2E crash at baseline config capture. After PR #917 drops CAP_DAC_OVERRIDE via capsh, root can no longer write to 444-mode config files. The three runtime override functions (apply_model_override, apply_cors_override, apply_slack_token_override) now temporarily relax permissions to 644 before writing and re-lock to 444 afterward, using CAP_FOWNER which is retained by design.

Related Issue

Fixes #2653

Changes

scripts/lib/sandbox-init.sh: Add relax_config_for_write() / lock_config_after_write() shared helpers that chmod 644/chmod 444 with symlink guards
scripts/nemoclaw-start.sh: Wrap python3 write + hash recomputation in all three override functions with the new helpers
scripts/nemoclaw-start.sh: Tighten apply_model_override() trigger guard — only fire when NEMOCLAW_MODEL_OVERRIDE or NEMOCLAW_INFERENCE_API_OVERRIDE is explicitly set (Dockerfile ENV defaults no longer trigger spurious override runs)
test/e2e/test-runtime-overrides.sh: Create timestamped log file matching CI artifact glob test-runtime-overrides-*.log, replace all 2>/dev/null with 2>>"$LOG_FILE"
test/e2e/test-runtime-overrides.sh: Update tests 3-5, 9-11, 14 to pair standalone parameters with NEMOCLAW_MODEL_OVERRIDE (matching the tightened trigger guard)

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

AI Disclosure

AI-assisted — tool: Claude Code (pi agent)

Signed-off-by: Julie Yaunches jyaunches@nvidia.com

Summary by CodeRabbit

Bug Fixes
- Overrides now only apply when explicit override flags are set, preventing unintended changes.
- Safer config writes: permissions are temporarily relaxed with strict safety checks (rejecting unsafe targets, skipping missing files), clear security errors on failure, recompute verification only on successful writes, and permissions are always restored; write/hash failures are surfaced.
Tests
- Improved test logging with timestamped artifacts and preserved stderr; runtime override tests updated to cover model-override scenarios.

…OVERRIDE drop (#2653) After PR #917 drops CAP_DAC_OVERRIDE via capsh, root can no longer write to 444-mode config files. The E2E test uses plain docker run (no no-new-privileges), so capsh succeeds and the capability is actually dropped — unlike production where OpenShell's no-new-privileges prevents the drop. Three entrypoint functions (apply_model_override, apply_cors_override, apply_slack_token_override) open openclaw.json for in-place writing, raising PermissionError (EACCES) under set -euo pipefail. Phase 1 — Permission fix: - Add relax_config_for_write() / lock_config_after_write() helpers in scripts/lib/sandbox-init.sh that chmod 644/444 with symlink guards - Wrap all three override functions with these helpers Phase 2 — Spurious trigger guard: - Tighten apply_model_override() entry guard to only fire when NEMOCLAW_MODEL_OVERRIDE or NEMOCLAW_INFERENCE_API_OVERRIDE is set - NEMOCLAW_CONTEXT_WINDOW, NEMOCLAW_MAX_TOKENS, NEMOCLAW_REASONING are promoted from Dockerfile ARGs to ENV and always present — they should only take effect alongside an explicit model override - Update E2E tests 3-5, 9-11, 14 to pair standalone parameters with NEMOCLAW_MODEL_OVERRIDE Phase 3 — Test error visibility: - Create timestamped log file matching CI artifact glob - Replace all 2>/dev/null with 2>>$LOG_FILE in test helpers - Append run_override_stderr output to main log file Fixes #2653 Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

coderabbitai · 2026-04-29T03:43:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c82375ce-1ca9-45f1-84b4-90134d03e70e

📥 Commits

Reviewing files that changed from the base of the PR and between 5965c0d and b8f4032.

📒 Files selected for processing (2)

scripts/nemoclaw-start.sh
test/nemoclaw-start.test.ts

📝 Walkthrough

Walkthrough

Adds two sandbox permission helper functions, uses them to safely relax and re-lock config file modes around runtime override writers, narrows model-override gating, and routes Docker/validation stderr into timestamped CI test logs.

Changes

Cohort / File(s)	Summary
Config Permission Management `scripts/lib/sandbox-init.sh`	Adds `relax_config_for_write()` and `lock_config_after_write()` to validate targets (reject symlinks), ignore missing regular files, emit `[SECURITY]` warnings on violations, and set modes to `644` before writes and `444` after.
Runtime Override Logic `scripts/nemoclaw-start.sh`	Restricts `apply_model_override` to run only when `NEMOCLAW_MODEL_OVERRIDE` or `NEMOCLAW_INFERENCE_API_OVERRIDE` are set. Wraps override writers (`apply_model_override`, `apply_cors_override`, `apply_slack_token_override`) with the permission helpers, captures writer and hash exit codes, recomputes SHA256 only on successful writes, and always re-locks; functions return non-zero when write/hash fail.
E2E Test Logging & Env Requirements `test/e2e/test-runtime-overrides.sh`	Creates timestamped `test-runtime-overrides-*.log` and appends Docker stderr and validation stderr to it for CI visibility. Adjusts tests to include `NEMOCLAW_MODEL_OVERRIDE=test` in validation/rejection scenarios and updates related descriptions.
Unit Test Guard Adjustment `test/nemoclaw-start.test.ts`	Narrows `apply_model_override()` guard tests to assert only `NEMOCLAW_MODEL_OVERRIDE` or `NEMOCLAW_INFERENCE_API_OVERRIDE` trigger the patch; adds negative checks that `NEMOCLAW_CONTEXT_WINDOW`, `NEMOCLAW_MAX_TOKENS`, and `NEMOCLAW_REASONING` are not referenced in the early-return guard.

Sequence Diagram(s)

sequenceDiagram
    participant Starter as "nemoclaw-start.sh"
    participant SandboxLib as "sandbox-init.sh helpers"
    participant Writer as "inline Python writer"
    participant FS as "filesystem (/sandbox/.openclaw/openclaw.json)"
    participant Hasher as "SHA256 recompute"

    Starter->>SandboxLib: relax_config_for_write(target file)
    SandboxLib-->>Starter: success / failure (stderr [SECURITY] on violations)
    alt relax succeeded
        Starter->>Writer: invoke writer (apply_*_override)
        Writer-->>Starter: exit status
        alt writer succeeded
            Starter->>Hasher: recompute SHA256 of FS file
            Hasher-->>Starter: hash value / status
        else writer failed
            Starter-->>Starter: record non-zero write status
        end
    end
    Starter->>SandboxLib: lock_config_after_write(target file)
    SandboxLib-->>Starter: success / failure (stderr [SECURITY] on violations)
    Starter-->>Caller: return captured status (0 or non-zero)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I nudged the locks and peeked inside,

I set them loose where writers bide,
No symlinks slipped, no secrets flown,
I sealed them back when work was done.
Logs purr softly — tidy trail, well-done!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding permission-relaxation helpers to fix a crash caused by dropping CAP_DAC_OVERRIDE capability, which is the core objective of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch issue-2653-runtime-overrides-e2e-crash

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

scripts/lib/sandbox-init.sh (1)

170-170: Consider propagating chmod failures.

The chmod commands don't check their return status. If chmod fails (e.g., filesystem error, unexpected ownership), the function silently continues. While rare, this could lead to confusing behavior where the caller believes permissions were changed but they weren't.

♻️ Optional: propagate chmod failures

 relax_config_for_write() {
   local f
   for f in "$@"; do
     if [ -L "$f" ]; then
       printf '[SECURITY] Refusing to relax permissions — %s is a symlink\n' "$f" >&2
       return 1
     fi
     [ -f "$f" ] || continue
-    chmod 644 "$f"
+    if ! chmod 644 "$f"; then
+      printf '[SECURITY] Failed to relax permissions on %s\n' "$f" >&2
+      return 1
+    fi
   done
 }

 lock_config_after_write() {
   local f
   for f in "$@"; do
     if [ -L "$f" ]; then
       printf '[SECURITY] Refusing to lock permissions — %s is a symlink\n' "$f" >&2
       return 1
     fi
     [ -f "$f" ] || continue
-    chmod 444 "$f"
+    if ! chmod 444 "$f"; then
+      printf '[SECURITY] Failed to lock permissions on %s\n' "$f" >&2
+      return 1
+    fi
   done
 }

Also applies to: 182-182

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/lib/sandbox-init.sh` at line 170, The chmod invocations (e.g., the
line containing chmod 644 "$f" and the other chmod calls around the same area)
currently ignore failures; update the surrounding code to check chmod's exit
status and propagate errors instead of continuing silently — after each chmod,
if it fails emit an explanatory error to stderr via the function's logger and
return a non‑zero status (or exit non‑zero if the script is intended to
terminate), so callers can detect and handle permission-change failures.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 366-368: The relax_config_for_write call currently isn't paired
with a guaranteed re-lock; wrap the section that runs the Python override and
the subsequent hash recomputation so lock_config_after_write is always invoked
even on error by installing a trap immediately after relax_config_for_write
(similar to the pattern used around apply_model_override): set a trap to call
lock_config_after_write with the same arguments on EXIT (or ERR), run the Python
script and recompute_hash, then clear the trap once successful; apply the same
trap/clear pattern to the later relax_config_for_write block that surrounds the
388-390 operations so both blocks use lock_config_after_write as an error-safe
finally.
- Around line 276-278: After calling relax_config_for_write, ensure
lock_config_after_write is always invoked on failure or exit by installing a
trap (or explicit error handling) before running the Python script and hash
recomputation so that lock_config_after_write "$config_file" "$hash_file" runs
on ERR/EXIT; remove or disable the trap only after successful completion of the
Python script and the hash recomputation (and before the later chattr
+i/Landlock steps) to avoid leaving files at 644 if the Python script or hash
step fails; apply the same pattern around the second relax/lock pair (the
321–323 code path) so both relax_config_for_write/lock_config_after_write pairs
are guarded.
- Around line 447-449: The relax_config_for_write call that temporarily loosens
permissions for the Slack token override should be protected with a trap-based
cleanup so permissions are always re-locked even if the subsequent Python script
or hash recomputation fails: before calling relax_config_for_write (used for the
slack token override) register a trap that will call the inverse/relock routine
(undoing relax_config_for_write’s changes) on EXIT/ERR, then run the Python
script and hash recompute, and finally remove or replace the trap on successful
completion; apply the exact same trap-wrapping pattern to the other occurrence
of relax_config_for_write mentioned in the review.

---

Nitpick comments:
In `@scripts/lib/sandbox-init.sh`:
- Line 170: The chmod invocations (e.g., the line containing chmod 644 "$f" and
the other chmod calls around the same area) currently ignore failures; update
the surrounding code to check chmod's exit status and propagate errors instead
of continuing silently — after each chmod, if it fails emit an explanatory error
to stderr via the function's logger and return a non‑zero status (or exit
non‑zero if the script is intended to terminate), so callers can detect and
handle permission-change failures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0684cdf1-e5dc-4f13-93b8-28f83dc966fc

📥 Commits

Reviewing files that changed from the base of the PR and between 9113dca and 956f773.

📒 Files selected for processing (3)

scripts/lib/sandbox-init.sh
scripts/nemoclaw-start.sh
test/e2e/test-runtime-overrides.sh

coderabbitai · 2026-04-29T03:46:42Z

+  # Relax 444 → 644 so writes succeed after CAP_DAC_OVERRIDE is dropped (#2653)
+  relax_config_for_write "$config_file" "$hash_file"
+


⚠️ Potential issue | 🟡 Minor

Same concern: Slack token override lacks error-safe re-locking.

The pattern repeats here. For defense-in-depth, consider the same trap-based approach to ensure permissions are re-locked even if the Python script or hash recomputation fails.

Also applies to: 484-486

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 447 - 449, The relax_config_for_write call that temporarily loosens permissions for the Slack token override should be protected with a trap-based cleanup so permissions are always re-locked even if the subsequent Python script or hash recomputation fails: before calling relax_config_for_write (used for the slack token override) register a trap that will call the inverse/relock routine (undoing relax_config_for_write’s changes) on EXIT/ERR, then run the Python script and hash recompute, and finally remove or replace the trap on successful completion; apply the exact same trap-wrapping pattern to the other occurrence of relax_config_for_write mentioned in the review.

…ailure Address CodeRabbit review feedback: 1. relax_config_for_write() / lock_config_after_write() now check chmod exit status and return 1 with a diagnostic message on failure. 2. All three override functions (apply_model_override, apply_cors_override, apply_slack_token_override) now capture write/hash failures via || _write_rc=$? and always call lock_config_after_write before returning — files are never left at 644 on error. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 320-324: The success message is printed even when sha256sum fails
because failure is swallowed by "|| _write_rc=$?" but printf is unconditional;
change the pattern so the printf only runs on successful sha256sum. Replace the
current construct (cd ... && sha256sum ... >"$hash_file") || _write_rc=$?;
printf ... with an if/then: if (cd /sandbox/.openclaw && sha256sum openclaw.json
>"$hash_file"); then printf '[SECURITY] Config hash recomputed after model
override\n' >&2; else _write_rc=$?; fi, and apply the same fix to the analogous
sites in apply_cors_override and apply_slack_token_override so the success log
is emitted only on true success.
- Around line 196-203: Update the test titled "triggers on any override env
var..." to match the new guard semantics: assert that setting either
NEMOCLAW_MODEL_OVERRIDE or NEMOCLAW_INFERENCE_API_OVERRIDE causes the config
patch to run, and remove the expectation that NEMOCLAW_CONTEXT_WINDOW,
NEMOCLAW_MAX_TOKENS, or NEMOCLAW_REASONING alone should trigger it; add/adjust
cases to explicitly verify (a) model override triggers, (b) inference API
override triggers, and (c) the other three vars by themselves do not trigger the
guard.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c60aebfb-2bd2-45af-bcd7-6ddc091dbba9

📥 Commits

Reviewing files that changed from the base of the PR and between 956f773 and 5965c0d.

📒 Files selected for processing (2)

scripts/lib/sandbox-init.sh
scripts/nemoclaw-start.sh

🚧 Files skipped from review as they are similar to previous changes (1)

scripts/lib/sandbox-init.sh

… logs Address second round of CodeRabbit review feedback: 1. Update unit test 'triggers on any override env var' to assert the new 2-variable guard (#2653 Phase 2): only NEMOCLAW_MODEL_OVERRIDE and NEMOCLAW_INFERENCE_API_OVERRIDE trigger the function; standalone NEMOCLAW_CONTEXT_WINDOW/MAX_TOKENS/REASONING do not. 2. Move 'hash recomputed' success log inside the sha256sum success path so it is not emitted on hash failure — applies to all three override functions. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

…it (#2700) ## Summary Fixes #2698 — `runtime-overrides-e2e` test 14 crashes because invalid override validation uses `return 1`, which under `set -euo pipefail` kills the container before CMD runs. ## Root Cause PR #2659 tightened the trigger guard for `apply_model_override()` but kept `return 1` for validation failures. When `NEMOCLAW_CONTEXT_WINDOW=notanumber` is rejected, `return 1` propagates through `set -e` and exits the entrypoint. The container dies before the test can read the config. ## Fix Change non-security-critical validation failures from `return 1` to `return 0`: - Invalid API type → skip, don't crash - Non-integer context window → skip, don't crash - Non-integer max tokens → skip, don't crash - Invalid reasoning boolean → skip, don't crash - Non-http CORS origin → skip, don't crash Security-critical rejections remain `return 1`: - Symlink on config/hash path (indicates tampering) - Control characters in model override (indicates injection) - Oversized model override (indicates injection) The security warning is still logged to stderr, but the container starts normally with config unchanged. ## Verification - [x] Unit tests pass (`test/nemoclaw-start.test.ts` — assertions use `toContain` on message strings which are preserved) - [ ] `runtime-overrides-e2e` passes (to be validated on sparky) ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **Bug Fixes** * Improved startup resilience by gracefully handling invalid configuration parameters. The system now logs warnings and continues instead of aborting when misconfigured environment variables are detected.  Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

…OVERRIDE drop (NVIDIA#2659) ## Summary Fix the `test/e2e/test-runtime-overrides.sh` E2E crash at baseline config capture. After PR NVIDIA#917 drops `CAP_DAC_OVERRIDE` via `capsh`, root can no longer write to 444-mode config files. The three runtime override functions (`apply_model_override`, `apply_cors_override`, `apply_slack_token_override`) now temporarily relax permissions to 644 before writing and re-lock to 444 afterward, using `CAP_FOWNER` which is retained by design. ## Related Issue Fixes NVIDIA#2653 ## Changes - **`scripts/lib/sandbox-init.sh`**: Add `relax_config_for_write()` / `lock_config_after_write()` shared helpers that `chmod 644`/`chmod 444` with symlink guards - **`scripts/nemoclaw-start.sh`**: Wrap python3 write + hash recomputation in all three override functions with the new helpers - **`scripts/nemoclaw-start.sh`**: Tighten `apply_model_override()` trigger guard — only fire when `NEMOCLAW_MODEL_OVERRIDE` or `NEMOCLAW_INFERENCE_API_OVERRIDE` is explicitly set (Dockerfile ENV defaults no longer trigger spurious override runs) - **`test/e2e/test-runtime-overrides.sh`**: Create timestamped log file matching CI artifact glob `test-runtime-overrides-*.log`, replace all `2>/dev/null` with `2>>"$LOG_FILE"` - **`test/e2e/test-runtime-overrides.sh`**: Update tests 3-5, 9-11, 14 to pair standalone parameters with `NEMOCLAW_MODEL_OVERRIDE` (matching the tightened trigger guard) ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **Bug Fixes** * Overrides now only apply when explicit override flags are set, preventing unintended changes. * Safer config writes: permissions are temporarily relaxed with strict safety checks (rejecting unsafe targets, skipping missing files), clear security errors on failure, recompute verification only on successful writes, and permissions are always restored; write/hash failures are surfaced. * **Tests** * Improved test logging with timestamped artifacts and preserved stderr; runtime override tests updated to cover model-override scenarios.  --------- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

…it (NVIDIA#2700) ## Summary Fixes NVIDIA#2698 — `runtime-overrides-e2e` test 14 crashes because invalid override validation uses `return 1`, which under `set -euo pipefail` kills the container before CMD runs. ## Root Cause PR NVIDIA#2659 tightened the trigger guard for `apply_model_override()` but kept `return 1` for validation failures. When `NEMOCLAW_CONTEXT_WINDOW=notanumber` is rejected, `return 1` propagates through `set -e` and exits the entrypoint. The container dies before the test can read the config. ## Fix Change non-security-critical validation failures from `return 1` to `return 0`: - Invalid API type → skip, don't crash - Non-integer context window → skip, don't crash - Non-integer max tokens → skip, don't crash - Invalid reasoning boolean → skip, don't crash - Non-http CORS origin → skip, don't crash Security-critical rejections remain `return 1`: - Symlink on config/hash path (indicates tampering) - Control characters in model override (indicates injection) - Oversized model override (indicates injection) The security warning is still logged to stderr, but the container starts normally with config unchanged. ## Verification - [x] Unit tests pass (`test/nemoclaw-start.test.ts` — assertions use `toContain` on message strings which are preserved) - [ ] `runtime-overrides-e2e` passes (to be validated on sparky) ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **Bug Fixes** * Improved startup resilience by gracefully handling invalid configuration parameters. The system now logs warnings and continues instead of aborting when misconfigured environment variables are detected.  Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

## Summary Partial fix for #3280 — drop the two caps that are cleanly droppable today, surface residuals when the drop step is silently skipped, and rewrite the cap test so it actually exercises the regression. - Append `cap_sys_admin` and `cap_sys_ptrace` to the `capsh --drop` list in `scripts/lib/sandbox-init.sh`. Verified live: in a permissive runtime (`--cap-add CAP_SYS_ADMIN --cap-add CAP_SYS_PTRACE`), pre-drop `CapBnd=0xa82c25fb`, post-drop `CapBnd=0x1e9` — only load-bearing caps remain. - Replace the misleading "runtime already restricts capabilities" message on the `CAP_SETPCAP`-missing fallback with `report_residual_capabilities()`, which reads `CapBnd:` from `/proc/self/status` and names which of the 5 must-drop caps remain. Uses bash 64-bit arithmetic (no gawk-strtonum dependency). - Rewrite `test/e2e-gateway-isolation.sh` test 14 to inventory all 8 caps named in the issue against `CapBnd`. 5 are classified `must-drop`; 3 (`CAP_FOWNER`, `CAP_SETUID`, `CAP_SETGID`) are classified `allowed` because dropping them requires an entrypoint refactor (see below). The test container now starts with `--cap-add CAP_SYS_ADMIN --cap-add CAP_SYS_PTRACE` so the bounding set entering capsh matches the permissive runtime that triggered T6002104. Without this, docker's default bounding set already excludes those caps and the test would have been a no-op for the very regression we care about. ## Scope: why this is 5/8, not 8/8 The issue names eight caps that must be absent from the bounding set: `CAP_SYS_ADMIN`, `CAP_NET_RAW`, `CAP_NET_BIND_SERVICE`, `CAP_SYS_PTRACE`, `CAP_DAC_OVERRIDE`, `CAP_FOWNER`, `CAP_SETUID`, `CAP_SETGID`. This PR fully drops 5 of them. The remaining 3 — `CAP_FOWNER`, `CAP_SETUID`, `CAP_SETGID` — can't be dropped without an entrypoint refactor: 1. The entrypoint runs as root and uses `gosu` to step down to the sandbox/gateway user. 2. `gosu` performs the `setuid()` syscall, which requires `CAP_SETUID` in the process's permitted set. 3. Dropping `CAP_SETUID` from the bounding set causes the next `exec` to strip it from permitted — but the bounding set can only be modified with `CAP_SETPCAP`, which only root has. 4. So the drop must happen *before* gosu (which breaks gosu) or *after* gosu (which is too late — we're no longer root and can't modify the bounding set). The proper fix is to replace `gosu` with `setpriv` (`setpriv --reuid=sandbox --regid=sandbox --bounding-set=-cap_setuid,-cap_setgid,-cap_fowner -- $cmd`), which does reuid + bounding-set drop atomically in one process. `setpriv` is already present in the image, but the refactor affects 4 gosu call sites in `scripts/nemoclaw-start.sh` plus the parallel entrypoint in `agents/hermes/start.sh` (same shared library), plus the no-new-privileges special case at `scripts/nemoclaw-start.sh:1490`. That's a different shape of change than "tighten bounding set" and needs its own design and review cycle. **Residual-risk note.** The practical exploit path for these 3 caps from the sandbox user's bounding set requires (a) execing a setuid-root binary that carries those file caps, but the image ships none and (b) the user creating a new one, which is blocked by Landlock + `no-new-privs` + lack of `CAP_DAC_OVERRIDE` on root-owned paths. So the residual is a defense-in-depth gap, not a directly exploitable one. **Follow-up.** I'll file an issue titled "refactor(entrypoint): replace gosu with setpriv to drop CAP_FOWNER/SETUID/SETGID from sandbox bounding set (#3280 follow-up)" and link it from here so T6002104 can track the second half. ## Test plan - [x] **Forward case (live container).** Built `nemoclaw-isolation-test` = `nemoclaw-production` + new `sandbox-init.sh`. Ran test 14 logic with `--cap-add CAP_SYS_ADMIN --cap-add CAP_SYS_PTRACE`: all 5 must-drop caps `ABSENT`, all 3 load-bearing caps `PRESENT`, test passes. - [x] **Negative case (live container).** Rebuilt the image with `cap_sys_admin` removed from the drop list. Test correctly fails with "CAP_SYS_ADMIN still present in CapBnd after capsh drop". `CapBnd=0x2001e9` (bit 21 set) — exactly the T6002104 signature. - [x] **Synthetic bit-decode harness.** Replayed the decode loop over fabricated `CapBnd` values (ideal post-drop, single-cap regressions, full-skip case) — classifier correctly identifies each failure mode by name. - [x] `npx vitest run test/sandbox-init.test.ts` — 36/36 pass (no-capsh fall-through and `NEMOCLAW_CAPS_DROPPED=1` short-circuit unchanged). - [x] `shfmt -d -i 2 -ci -bn` clean on both touched files. - [x] `bash -n` clean. - [ ] CI to run full `e2e-gateway-isolation.sh` against a freshly-built sandbox image. ## Notes for review - Drop list scope is intentionally narrowed per maintainer guidance; further hardening (CAP_SYS_MODULE, CAP_SYS_RAWIO, CAP_BPF, CAP_PERFMON, CAP_SYS_BOOT, CAP_SYSLOG, CAP_NET_ADMIN, …) is a candidate follow-up beyond this PR's issue. - Each kept cap (cap_chown, cap_fowner, cap_setuid, cap_setgid, cap_kill) now has an inline justification with a pointer to its load-bearing call site, so a future contributor can audit them without rediscovering #797 and #2659. - `report_residual_capabilities()` is new code, but it only runs on the existing `CAP_SETPCAP`-missing fallback path — same trigger as the previous one-line warning, just with structured diagnostics. The happy path is unchanged. Closes #3280 *(partial — see scope section above; full closure depends on the setpriv follow-up)*. Signed-off-by: Dongni Yang <dongniy@nvidia.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Security Improvements** * Enhanced sandbox security by dropping additional dangerous system capabilities. * Improved security diagnostics to detect and report residual dangerous capabilities in the runtime environment. * Strengthened validation testing to comprehensively verify that dangerous capabilities are properly restricted during sandbox initialization. [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3328)  --------- Signed-off-by: Dongni Yang <dongniy@nvidia.com>

jyaunches added the security Something isn't secure label Apr 29, 2026

jyaunches self-assigned this Apr 29, 2026

jyaunches added the security Something isn't secure label Apr 29, 2026

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread scripts/nemoclaw-start.sh

Comment thread scripts/nemoclaw-start.sh

jyaunches added the v0.0.30 Release target label Apr 29, 2026

cv approved these changes Apr 29, 2026

View reviewed changes

cv merged commit 3739258 into main Apr 29, 2026
18 checks passed

This was referenced Apr 29, 2026

bug(e2e): runtime-overrides-e2e test 14 fails — entrypoint hard-exits on invalid override #2698

Closed

fix(entrypoint): gracefully skip invalid overrides instead of hard-exit #2700

Merged

This was referenced Apr 29, 2026

openclaw.json created as root with read-only permissions, making all config writes impossible from sandbox #719

Open

feat: support OPENCLAW_CONFIG_PATH to persist openclaw.json on PVC #1174

Open

Dongni-Yang mentioned this pull request May 11, 2026

fix(sandbox): tighten bounding-set caps #3328

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(entrypoint): relax config permissions before write after CAP_DAC_OVERRIDE drop#2659

fix(entrypoint): relax config permissions before write after CAP_DAC_OVERRIDE drop#2659
cv merged 3 commits into
mainfrom
issue-2653-runtime-overrides-e2e-crash

jyaunches commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Relax 444 → 644 so writes succeed after CAP_DAC_OVERRIDE is dropped (#2653)
		relax_config_for_write "$config_file" "$hash_file"

Conversation

jyaunches commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

AI Disclosure

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jyaunches commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading