Skip to content

fix(entrypoint): relax config permissions before write after CAP_DAC_OVERRIDE drop#2659

Merged
cv merged 3 commits into
mainfrom
issue-2653-runtime-overrides-e2e-crash
Apr 29, 2026
Merged

fix(entrypoint): relax config permissions before write after CAP_DAC_OVERRIDE drop#2659
cv merged 3 commits into
mainfrom
issue-2653-runtime-overrides-e2e-crash

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented Apr 29, 2026

Summary

Fix the test/e2e/test-runtime-overrides.sh E2E crash at baseline config capture. After PR #917 drops CAP_DAC_OVERRIDE via capsh, root can no longer write to 444-mode config files. The three runtime override functions (apply_model_override, apply_cors_override, apply_slack_token_override) now temporarily relax permissions to 644 before writing and re-lock to 444 afterward, using CAP_FOWNER which is retained by design.

Related Issue

Fixes #2653

Changes

  • scripts/lib/sandbox-init.sh: Add relax_config_for_write() / lock_config_after_write() shared helpers that chmod 644/chmod 444 with symlink guards
  • scripts/nemoclaw-start.sh: Wrap python3 write + hash recomputation in all three override functions with the new helpers
  • scripts/nemoclaw-start.sh: Tighten apply_model_override() trigger guard — only fire when NEMOCLAW_MODEL_OVERRIDE or NEMOCLAW_INFERENCE_API_OVERRIDE is explicitly set (Dockerfile ENV defaults no longer trigger spurious override runs)
  • test/e2e/test-runtime-overrides.sh: Create timestamped log file matching CI artifact glob test-runtime-overrides-*.log, replace all 2>/dev/null with 2>>"$LOG_FILE"
  • test/e2e/test-runtime-overrides.sh: Update tests 3-5, 9-11, 14 to pair standalone parameters with NEMOCLAW_MODEL_OVERRIDE (matching the tightened trigger guard)

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

AI Disclosure

  • AI-assisted — tool: Claude Code (pi agent)

Signed-off-by: Julie Yaunches jyaunches@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Overrides now only apply when explicit override flags are set, preventing unintended changes.
    • Safer config writes: permissions are temporarily relaxed with strict safety checks (rejecting unsafe targets, skipping missing files), clear security errors on failure, recompute verification only on successful writes, and permissions are always restored; write/hash failures are surfaced.
  • Tests

    • Improved test logging with timestamped artifacts and preserved stderr; runtime override tests updated to cover model-override scenarios.

…OVERRIDE drop (#2653)

After PR #917 drops CAP_DAC_OVERRIDE via capsh, root can no longer write
to 444-mode config files. The E2E test uses plain docker run (no
no-new-privileges), so capsh succeeds and the capability is actually
dropped — unlike production where OpenShell's no-new-privileges prevents
the drop.

Three entrypoint functions (apply_model_override, apply_cors_override,
apply_slack_token_override) open openclaw.json for in-place writing,
raising PermissionError (EACCES) under set -euo pipefail.

Phase 1 — Permission fix:
- Add relax_config_for_write() / lock_config_after_write() helpers in
  scripts/lib/sandbox-init.sh that chmod 644/444 with symlink guards
- Wrap all three override functions with these helpers

Phase 2 — Spurious trigger guard:
- Tighten apply_model_override() entry guard to only fire when
  NEMOCLAW_MODEL_OVERRIDE or NEMOCLAW_INFERENCE_API_OVERRIDE is set
- NEMOCLAW_CONTEXT_WINDOW, NEMOCLAW_MAX_TOKENS, NEMOCLAW_REASONING are
  promoted from Dockerfile ARGs to ENV and always present — they should
  only take effect alongside an explicit model override
- Update E2E tests 3-5, 9-11, 14 to pair standalone parameters with
  NEMOCLAW_MODEL_OVERRIDE

Phase 3 — Test error visibility:
- Create timestamped log file matching CI artifact glob
- Replace all 2>/dev/null with 2>>$LOG_FILE in test helpers
- Append run_override_stderr output to main log file

Fixes #2653

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
@jyaunches jyaunches added the security Something isn't secure label Apr 29, 2026
@jyaunches jyaunches self-assigned this Apr 29, 2026
@jyaunches jyaunches added the security Something isn't secure label Apr 29, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c82375ce-1ca9-45f1-84b4-90134d03e70e

📥 Commits

Reviewing files that changed from the base of the PR and between 5965c0d and b8f4032.

📒 Files selected for processing (2)
  • scripts/nemoclaw-start.sh
  • test/nemoclaw-start.test.ts

📝 Walkthrough

Walkthrough

Adds two sandbox permission helper functions, uses them to safely relax and re-lock config file modes around runtime override writers, narrows model-override gating, and routes Docker/validation stderr into timestamped CI test logs.

Changes

Cohort / File(s) Summary
Config Permission Management
scripts/lib/sandbox-init.sh
Adds relax_config_for_write() and lock_config_after_write() to validate targets (reject symlinks), ignore missing regular files, emit [SECURITY] warnings on violations, and set modes to 644 before writes and 444 after.
Runtime Override Logic
scripts/nemoclaw-start.sh
Restricts apply_model_override to run only when NEMOCLAW_MODEL_OVERRIDE or NEMOCLAW_INFERENCE_API_OVERRIDE are set. Wraps override writers (apply_model_override, apply_cors_override, apply_slack_token_override) with the permission helpers, captures writer and hash exit codes, recomputes SHA256 only on successful writes, and always re-locks; functions return non-zero when write/hash fail.
E2E Test Logging & Env Requirements
test/e2e/test-runtime-overrides.sh
Creates timestamped test-runtime-overrides-*.log and appends Docker stderr and validation stderr to it for CI visibility. Adjusts tests to include NEMOCLAW_MODEL_OVERRIDE=test in validation/rejection scenarios and updates related descriptions.
Unit Test Guard Adjustment
test/nemoclaw-start.test.ts
Narrows apply_model_override() guard tests to assert only NEMOCLAW_MODEL_OVERRIDE or NEMOCLAW_INFERENCE_API_OVERRIDE trigger the patch; adds negative checks that NEMOCLAW_CONTEXT_WINDOW, NEMOCLAW_MAX_TOKENS, and NEMOCLAW_REASONING are not referenced in the early-return guard.

Sequence Diagram(s)

sequenceDiagram
    participant Starter as "nemoclaw-start.sh"
    participant SandboxLib as "sandbox-init.sh helpers"
    participant Writer as "inline Python writer"
    participant FS as "filesystem (/sandbox/.openclaw/openclaw.json)"
    participant Hasher as "SHA256 recompute"

    Starter->>SandboxLib: relax_config_for_write(target file)
    SandboxLib-->>Starter: success / failure (stderr [SECURITY] on violations)
    alt relax succeeded
        Starter->>Writer: invoke writer (apply_*_override)
        Writer-->>Starter: exit status
        alt writer succeeded
            Starter->>Hasher: recompute SHA256 of FS file
            Hasher-->>Starter: hash value / status
        else writer failed
            Starter-->>Starter: record non-zero write status
        end
    end
    Starter->>SandboxLib: lock_config_after_write(target file)
    SandboxLib-->>Starter: success / failure (stderr [SECURITY] on violations)
    Starter-->>Caller: return captured status (0 or non-zero)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I nudged the locks and peeked inside,

I set them loose where writers bide,
No symlinks slipped, no secrets flown,
I sealed them back when work was done.
Logs purr softly — tidy trail, well-done!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding permission-relaxation helpers to fix a crash caused by dropping CAP_DAC_OVERRIDE capability, which is the core objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-2653-runtime-overrides-e2e-crash

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
scripts/lib/sandbox-init.sh (1)

170-170: Consider propagating chmod failures.

The chmod commands don't check their return status. If chmod fails (e.g., filesystem error, unexpected ownership), the function silently continues. While rare, this could lead to confusing behavior where the caller believes permissions were changed but they weren't.

♻️ Optional: propagate chmod failures
 relax_config_for_write() {
   local f
   for f in "$@"; do
     if [ -L "$f" ]; then
       printf '[SECURITY] Refusing to relax permissions — %s is a symlink\n' "$f" >&2
       return 1
     fi
     [ -f "$f" ] || continue
-    chmod 644 "$f"
+    if ! chmod 644 "$f"; then
+      printf '[SECURITY] Failed to relax permissions on %s\n' "$f" >&2
+      return 1
+    fi
   done
 }

 lock_config_after_write() {
   local f
   for f in "$@"; do
     if [ -L "$f" ]; then
       printf '[SECURITY] Refusing to lock permissions — %s is a symlink\n' "$f" >&2
       return 1
     fi
     [ -f "$f" ] || continue
-    chmod 444 "$f"
+    if ! chmod 444 "$f"; then
+      printf '[SECURITY] Failed to lock permissions on %s\n' "$f" >&2
+      return 1
+    fi
   done
 }

Also applies to: 182-182

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/lib/sandbox-init.sh` at line 170, The chmod invocations (e.g., the
line containing chmod 644 "$f" and the other chmod calls around the same area)
currently ignore failures; update the surrounding code to check chmod's exit
status and propagate errors instead of continuing silently — after each chmod,
if it fails emit an explanatory error to stderr via the function's logger and
return a non‑zero status (or exit non‑zero if the script is intended to
terminate), so callers can detect and handle permission-change failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 366-368: The relax_config_for_write call currently isn't paired
with a guaranteed re-lock; wrap the section that runs the Python override and
the subsequent hash recomputation so lock_config_after_write is always invoked
even on error by installing a trap immediately after relax_config_for_write
(similar to the pattern used around apply_model_override): set a trap to call
lock_config_after_write with the same arguments on EXIT (or ERR), run the Python
script and recompute_hash, then clear the trap once successful; apply the same
trap/clear pattern to the later relax_config_for_write block that surrounds the
388-390 operations so both blocks use lock_config_after_write as an error-safe
finally.
- Around line 276-278: After calling relax_config_for_write, ensure
lock_config_after_write is always invoked on failure or exit by installing a
trap (or explicit error handling) before running the Python script and hash
recomputation so that lock_config_after_write "$config_file" "$hash_file" runs
on ERR/EXIT; remove or disable the trap only after successful completion of the
Python script and the hash recomputation (and before the later chattr
+i/Landlock steps) to avoid leaving files at 644 if the Python script or hash
step fails; apply the same pattern around the second relax/lock pair (the
321–323 code path) so both relax_config_for_write/lock_config_after_write pairs
are guarded.
- Around line 447-449: The relax_config_for_write call that temporarily loosens
permissions for the Slack token override should be protected with a trap-based
cleanup so permissions are always re-locked even if the subsequent Python script
or hash recomputation fails: before calling relax_config_for_write (used for the
slack token override) register a trap that will call the inverse/relock routine
(undoing relax_config_for_write’s changes) on EXIT/ERR, then run the Python
script and hash recompute, and finally remove or replace the trap on successful
completion; apply the exact same trap-wrapping pattern to the other occurrence
of relax_config_for_write mentioned in the review.

---

Nitpick comments:
In `@scripts/lib/sandbox-init.sh`:
- Line 170: The chmod invocations (e.g., the line containing chmod 644 "$f" and
the other chmod calls around the same area) currently ignore failures; update
the surrounding code to check chmod's exit status and propagate errors instead
of continuing silently — after each chmod, if it fails emit an explanatory error
to stderr via the function's logger and return a non‑zero status (or exit
non‑zero if the script is intended to terminate), so callers can detect and
handle permission-change failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0684cdf1-e5dc-4f13-93b8-28f83dc966fc

📥 Commits

Reviewing files that changed from the base of the PR and between 9113dca and 956f773.

📒 Files selected for processing (3)
  • scripts/lib/sandbox-init.sh
  • scripts/nemoclaw-start.sh
  • test/e2e/test-runtime-overrides.sh

Comment thread scripts/nemoclaw-start.sh Outdated
Comment thread scripts/nemoclaw-start.sh Outdated
Comment thread scripts/nemoclaw-start.sh Outdated
Comment on lines +447 to +449
# Relax 444 → 644 so writes succeed after CAP_DAC_OVERRIDE is dropped (#2653)
relax_config_for_write "$config_file" "$hash_file"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same concern: Slack token override lacks error-safe re-locking.

The pattern repeats here. For defense-in-depth, consider the same trap-based approach to ensure permissions are re-locked even if the Python script or hash recomputation fails.

Also applies to: 484-486

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/nemoclaw-start.sh` around lines 447 - 449, The relax_config_for_write
call that temporarily loosens permissions for the Slack token override should be
protected with a trap-based cleanup so permissions are always re-locked even if
the subsequent Python script or hash recomputation fails: before calling
relax_config_for_write (used for the slack token override) register a trap that
will call the inverse/relock routine (undoing relax_config_for_write’s changes)
on EXIT/ERR, then run the Python script and hash recompute, and finally remove
or replace the trap on successful completion; apply the exact same trap-wrapping
pattern to the other occurrence of relax_config_for_write mentioned in the
review.

…ailure

Address CodeRabbit review feedback:

1. relax_config_for_write() / lock_config_after_write() now check chmod
   exit status and return 1 with a diagnostic message on failure.

2. All three override functions (apply_model_override, apply_cors_override,
   apply_slack_token_override) now capture write/hash failures via
   || _write_rc=$? and always call lock_config_after_write before
   returning — files are never left at 644 on error.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 320-324: The success message is printed even when sha256sum fails
because failure is swallowed by "|| _write_rc=$?" but printf is unconditional;
change the pattern so the printf only runs on successful sha256sum. Replace the
current construct (cd ... && sha256sum ... >"$hash_file") || _write_rc=$?;
printf ... with an if/then: if (cd /sandbox/.openclaw && sha256sum openclaw.json
>"$hash_file"); then printf '[SECURITY] Config hash recomputed after model
override\n' >&2; else _write_rc=$?; fi, and apply the same fix to the analogous
sites in apply_cors_override and apply_slack_token_override so the success log
is emitted only on true success.
- Around line 196-203: Update the test titled "triggers on any override env
var..." to match the new guard semantics: assert that setting either
NEMOCLAW_MODEL_OVERRIDE or NEMOCLAW_INFERENCE_API_OVERRIDE causes the config
patch to run, and remove the expectation that NEMOCLAW_CONTEXT_WINDOW,
NEMOCLAW_MAX_TOKENS, or NEMOCLAW_REASONING alone should trigger it; add/adjust
cases to explicitly verify (a) model override triggers, (b) inference API
override triggers, and (c) the other three vars by themselves do not trigger the
guard.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c60aebfb-2bd2-45af-bcd7-6ddc091dbba9

📥 Commits

Reviewing files that changed from the base of the PR and between 956f773 and 5965c0d.

📒 Files selected for processing (2)
  • scripts/lib/sandbox-init.sh
  • scripts/nemoclaw-start.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/lib/sandbox-init.sh

Comment thread scripts/nemoclaw-start.sh
Comment thread scripts/nemoclaw-start.sh
… logs

Address second round of CodeRabbit review feedback:

1. Update unit test 'triggers on any override env var' to assert the new
   2-variable guard (#2653 Phase 2): only NEMOCLAW_MODEL_OVERRIDE and
   NEMOCLAW_INFERENCE_API_OVERRIDE trigger the function; standalone
   NEMOCLAW_CONTEXT_WINDOW/MAX_TOKENS/REASONING do not.

2. Move 'hash recomputed' success log inside the sha256sum success path
   so it is not emitted on hash failure — applies to all three override
   functions.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
@jyaunches jyaunches added the v0.0.30 Release target label Apr 29, 2026
@cv cv merged commit 3739258 into main Apr 29, 2026
18 checks passed
cv pushed a commit that referenced this pull request Apr 29, 2026
…it (#2700)

## Summary

Fixes #2698 — `runtime-overrides-e2e` test 14 crashes because invalid
override validation uses `return 1`, which under `set -euo pipefail`
kills the container before CMD runs.

## Root Cause

PR #2659 tightened the trigger guard for `apply_model_override()` but
kept `return 1` for validation failures. When
`NEMOCLAW_CONTEXT_WINDOW=notanumber` is rejected, `return 1` propagates
through `set -e` and exits the entrypoint. The container dies before the
test can read the config.

## Fix

Change non-security-critical validation failures from `return 1` to
`return 0`:
- Invalid API type → skip, don't crash
- Non-integer context window → skip, don't crash
- Non-integer max tokens → skip, don't crash
- Invalid reasoning boolean → skip, don't crash
- Non-http CORS origin → skip, don't crash

Security-critical rejections remain `return 1`:
- Symlink on config/hash path (indicates tampering)
- Control characters in model override (indicates injection)
- Oversized model override (indicates injection)

The security warning is still logged to stderr, but the container starts
normally with config unchanged.

## Verification
- [x] Unit tests pass (`test/nemoclaw-start.test.ts` — assertions use
`toContain` on message strings which are preserved)
- [ ] `runtime-overrides-e2e` passes (to be validated on sparky)

## AI Disclosure
- [x] AI-assisted — tool: Claude Code (pi agent)

---
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved startup resilience by gracefully handling invalid
configuration parameters. The system now logs warnings and continues
instead of aborting when misconfigured environment variables are
detected.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
…OVERRIDE drop (NVIDIA#2659)

## Summary

Fix the `test/e2e/test-runtime-overrides.sh` E2E crash at baseline
config capture. After PR NVIDIA#917 drops `CAP_DAC_OVERRIDE` via `capsh`, root
can no longer write to 444-mode config files. The three runtime override
functions (`apply_model_override`, `apply_cors_override`,
`apply_slack_token_override`) now temporarily relax permissions to 644
before writing and re-lock to 444 afterward, using `CAP_FOWNER` which is
retained by design.

## Related Issue

Fixes NVIDIA#2653

## Changes

- **`scripts/lib/sandbox-init.sh`**: Add `relax_config_for_write()` /
`lock_config_after_write()` shared helpers that `chmod 644`/`chmod 444`
with symlink guards
- **`scripts/nemoclaw-start.sh`**: Wrap python3 write + hash
recomputation in all three override functions with the new helpers
- **`scripts/nemoclaw-start.sh`**: Tighten `apply_model_override()`
trigger guard — only fire when `NEMOCLAW_MODEL_OVERRIDE` or
`NEMOCLAW_INFERENCE_API_OVERRIDE` is explicitly set (Dockerfile ENV
defaults no longer trigger spurious override runs)
- **`test/e2e/test-runtime-overrides.sh`**: Create timestamped log file
matching CI artifact glob `test-runtime-overrides-*.log`, replace all
`2>/dev/null` with `2>>"$LOG_FILE"`
- **`test/e2e/test-runtime-overrides.sh`**: Update tests 3-5, 9-11, 14
to pair standalone parameters with `NEMOCLAW_MODEL_OVERRIDE` (matching
the tightened trigger guard)

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
- [x] AI-assisted — tool: Claude Code (pi agent)

---
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Overrides now only apply when explicit override flags are set,
preventing unintended changes.
* Safer config writes: permissions are temporarily relaxed with strict
safety checks (rejecting unsafe targets, skipping missing files), clear
security errors on failure, recompute verification only on successful
writes, and permissions are always restored; write/hash failures are
surfaced.

* **Tests**
* Improved test logging with timestamped artifacts and preserved stderr;
runtime override tests updated to cover model-override scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
…it (NVIDIA#2700)

## Summary

Fixes NVIDIA#2698 — `runtime-overrides-e2e` test 14 crashes because invalid
override validation uses `return 1`, which under `set -euo pipefail`
kills the container before CMD runs.

## Root Cause

PR NVIDIA#2659 tightened the trigger guard for `apply_model_override()` but
kept `return 1` for validation failures. When
`NEMOCLAW_CONTEXT_WINDOW=notanumber` is rejected, `return 1` propagates
through `set -e` and exits the entrypoint. The container dies before the
test can read the config.

## Fix

Change non-security-critical validation failures from `return 1` to
`return 0`:
- Invalid API type → skip, don't crash
- Non-integer context window → skip, don't crash
- Non-integer max tokens → skip, don't crash
- Invalid reasoning boolean → skip, don't crash
- Non-http CORS origin → skip, don't crash

Security-critical rejections remain `return 1`:
- Symlink on config/hash path (indicates tampering)
- Control characters in model override (indicates injection)
- Oversized model override (indicates injection)

The security warning is still logged to stderr, but the container starts
normally with config unchanged.

## Verification
- [x] Unit tests pass (`test/nemoclaw-start.test.ts` — assertions use
`toContain` on message strings which are preserved)
- [ ] `runtime-overrides-e2e` passes (to be validated on sparky)

## AI Disclosure
- [x] AI-assisted — tool: Claude Code (pi agent)

---
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved startup resilience by gracefully handling invalid
configuration parameters. The system now logs warnings and continues
instead of aborting when misconfigured environment variables are
detected.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
ericksoa pushed a commit that referenced this pull request May 11, 2026
## Summary

Partial fix for #3280 — drop the two caps that are cleanly droppable
today, surface residuals when the drop step is silently skipped, and
rewrite the cap test so it actually exercises the regression.

- Append `cap_sys_admin` and `cap_sys_ptrace` to the `capsh --drop` list
in `scripts/lib/sandbox-init.sh`. Verified live: in a permissive runtime
(`--cap-add CAP_SYS_ADMIN --cap-add CAP_SYS_PTRACE`), pre-drop
`CapBnd=0xa82c25fb`, post-drop `CapBnd=0x1e9` — only load-bearing caps
remain.
- Replace the misleading "runtime already restricts capabilities"
message on the `CAP_SETPCAP`-missing fallback with
`report_residual_capabilities()`, which reads `CapBnd:` from
`/proc/self/status` and names which of the 5 must-drop caps remain. Uses
bash 64-bit arithmetic (no gawk-strtonum dependency).
- Rewrite `test/e2e-gateway-isolation.sh` test 14 to inventory all 8
caps named in the issue against `CapBnd`. 5 are classified `must-drop`;
3 (`CAP_FOWNER`, `CAP_SETUID`, `CAP_SETGID`) are classified `allowed`
because dropping them requires an entrypoint refactor (see below). The
test container now starts with `--cap-add CAP_SYS_ADMIN --cap-add
CAP_SYS_PTRACE` so the bounding set entering capsh matches the
permissive runtime that triggered T6002104. Without this, docker's
default bounding set already excludes those caps and the test would have
been a no-op for the very regression we care about.

## Scope: why this is 5/8, not 8/8

The issue names eight caps that must be absent from the bounding set:
`CAP_SYS_ADMIN`, `CAP_NET_RAW`, `CAP_NET_BIND_SERVICE`,
`CAP_SYS_PTRACE`, `CAP_DAC_OVERRIDE`, `CAP_FOWNER`, `CAP_SETUID`,
`CAP_SETGID`.

This PR fully drops 5 of them. The remaining 3 — `CAP_FOWNER`,
`CAP_SETUID`, `CAP_SETGID` — can't be dropped without an entrypoint
refactor:

1. The entrypoint runs as root and uses `gosu` to step down to the
sandbox/gateway user.
2. `gosu` performs the `setuid()` syscall, which requires `CAP_SETUID`
in the process's permitted set.
3. Dropping `CAP_SETUID` from the bounding set causes the next `exec` to
strip it from permitted — but the bounding set can only be modified with
`CAP_SETPCAP`, which only root has.
4. So the drop must happen *before* gosu (which breaks gosu) or *after*
gosu (which is too late — we're no longer root and can't modify the
bounding set).

The proper fix is to replace `gosu` with `setpriv` (`setpriv
--reuid=sandbox --regid=sandbox
--bounding-set=-cap_setuid,-cap_setgid,-cap_fowner -- $cmd`), which does
reuid + bounding-set drop atomically in one process. `setpriv` is
already present in the image, but the refactor affects 4 gosu call sites
in `scripts/nemoclaw-start.sh` plus the parallel entrypoint in
`agents/hermes/start.sh` (same shared library), plus the
no-new-privileges special case at `scripts/nemoclaw-start.sh:1490`.
That's a different shape of change than "tighten bounding set" and needs
its own design and review cycle.

**Residual-risk note.** The practical exploit path for these 3 caps from
the sandbox user's bounding set requires (a) execing a setuid-root
binary that carries those file caps, but the image ships none and (b)
the user creating a new one, which is blocked by Landlock +
`no-new-privs` + lack of `CAP_DAC_OVERRIDE` on root-owned paths. So the
residual is a defense-in-depth gap, not a directly exploitable one.

**Follow-up.** I'll file an issue titled "refactor(entrypoint): replace
gosu with setpriv to drop CAP_FOWNER/SETUID/SETGID from sandbox bounding
set (#3280 follow-up)" and link it from here so T6002104 can track the
second half.

## Test plan

- [x] **Forward case (live container).** Built `nemoclaw-isolation-test`
= `nemoclaw-production` + new `sandbox-init.sh`. Ran test 14 logic with
`--cap-add CAP_SYS_ADMIN --cap-add CAP_SYS_PTRACE`: all 5 must-drop caps
`ABSENT`, all 3 load-bearing caps `PRESENT`, test passes.
- [x] **Negative case (live container).** Rebuilt the image with
`cap_sys_admin` removed from the drop list. Test correctly fails with
"CAP_SYS_ADMIN still present in CapBnd after capsh drop".
`CapBnd=0x2001e9` (bit 21 set) — exactly the T6002104 signature.
- [x] **Synthetic bit-decode harness.** Replayed the decode loop over
fabricated `CapBnd` values (ideal post-drop, single-cap regressions,
full-skip case) — classifier correctly identifies each failure mode by
name.
- [x] `npx vitest run test/sandbox-init.test.ts` — 36/36 pass (no-capsh
fall-through and `NEMOCLAW_CAPS_DROPPED=1` short-circuit unchanged).
- [x] `shfmt -d -i 2 -ci -bn` clean on both touched files.
- [x] `bash -n` clean.
- [ ] CI to run full `e2e-gateway-isolation.sh` against a freshly-built
sandbox image.

## Notes for review

- Drop list scope is intentionally narrowed per maintainer guidance;
further hardening (CAP_SYS_MODULE, CAP_SYS_RAWIO, CAP_BPF, CAP_PERFMON,
CAP_SYS_BOOT, CAP_SYSLOG, CAP_NET_ADMIN, …) is a candidate follow-up
beyond this PR's issue.
- Each kept cap (cap_chown, cap_fowner, cap_setuid, cap_setgid,
cap_kill) now has an inline justification with a pointer to its
load-bearing call site, so a future contributor can audit them without
rediscovering #797 and #2659.
- `report_residual_capabilities()` is new code, but it only runs on the
existing `CAP_SETPCAP`-missing fallback path — same trigger as the
previous one-line warning, just with structured diagnostics. The happy
path is unchanged.

Closes #3280 *(partial — see scope section above; full closure depends
on the setpriv follow-up)*.

Signed-off-by: Dongni Yang <dongniy@nvidia.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Security Improvements**
* Enhanced sandbox security by dropping additional dangerous system
capabilities.
* Improved security diagnostics to detect and report residual dangerous
capabilities in the runtime environment.
* Strengthened validation testing to comprehensively verify that
dangerous capabilities are properly restricted during sandbox
initialization.

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3328)

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

security Something isn't secure v0.0.30 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(e2e): runtime-overrides-e2e crashes immediately at baseline config capture

3 participants