Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions packages/envd/internal/port/forward.go
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ func (f *Forwarder) startPortForwarding(ctx context.Context, p *PortToForward) {

cgroupFD, ok := f.cgroupManager.GetFileDescriptor(cgroups.ProcessTypeSocat)

// socat keeps envd's SCHED_FIFO + Nice=-20 by design.
cmd.SysProcAttr = &syscall.SysProcAttr{
Setpgid: true,
}
Expand Down
13 changes: 7 additions & 6 deletions packages/envd/internal/services/process/handler/handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -96,13 +96,14 @@ func New(
// User command string for logging (without the internal wrapper details).
userCmd := strings.Join(append([]string{req.GetProcess().GetCmd()}, req.GetProcess().GetArgs()...), " ")

// Wrap the command in a shell that sets the OOM score and nice value before exec-ing the actual command.
// This eliminates the race window where grandchildren could inherit the parent's protected OOM score (-1000)
// or high CPU priority (nice -20) before the post-start calls had a chance to correct them.
// nice(1) applies a relative adjustment, so we compute the delta from the current (inherited) nice to the target.
// Reset oom_score_adj, drop SCHED_FIFO via chrt, drop the SYS_NICE
// ambient cap, then apply nice.
niceDelta := defaultNice - currentNice()
oomWrapperScript := fmt.Sprintf(`echo %d > /proc/$$/oom_score_adj && exec /usr/bin/nice -n %d "${@}"`, defaultOomScore, niceDelta)
wrapperArgs := append([]string{"-c", oomWrapperScript, "--", req.GetProcess().GetCmd()}, req.GetProcess().GetArgs()...)
wrapperScript := fmt.Sprintf(
`echo %d > /proc/$$/oom_score_adj && exec /usr/bin/chrt --other 0 /usr/bin/setpriv --ambient-caps -all -- /usr/bin/nice -n %d "${@}"`,
defaultOomScore, niceDelta,
)
Comment on lines +102 to +105

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: the wrapper's setpriv --ambient-caps -all only clears the ambient set, not the inheritable set. Go's AmbientCaps machinery raises CAP_SYS_NICE in both ambient and inheritable before PR_CAP_AMBIENT_RAISE (see Go's syscall/exec_linux.go forkAndExecInChild1), and the inheritable bit survives execve. So the user command ends up with CAP_SYS_NICE still in P(inh), contradicting the comment's stated intent ("drop the SYS_NICE ambient cap"). Trivial fix: append --inh-caps -all to the setpriv invocation.

Extended reasoning...

What's leaking

The wrapper script on handler.go:102-105 does:

exec /usr/bin/chrt --other 0 /usr/bin/setpriv --ambient-caps -all -- /usr/bin/nice -n N USER_CMD

The stated intent (from the diff comment) is to "drop the SYS_NICE ambient cap, then apply nice". But after this chain runs, the final user process actually inherits CAP_SYS_NICE in its inheritable capability set.

Why

When SysProcAttr.AmbientCaps is non-empty, Go does not just raise ambient. Looking at /usr/local/go/src/syscall/exec_linux.go forkAndExecInChild1 (lines 519–524):

// Add the c capability to the permitted and inheritable capability mask,
// otherwise we will not be able to add it to the ambient capability mask.
caps.data[capToIndex(c)].permitted   |= capToMask(c)
caps.data[capToIndex(c)].inheritable |= capToMask(c)

It then calls capset() and only afterwards PR_CAP_AMBIENT_RAISE. That's required by the kernel — you can't raise ambient unless the cap is already in both permitted and inheritable. So by the time /bin/sh exec's, P(perm) and P(inh) both contain CAP_SYS_NICE.

Across execve (capabilities(7)): P'(inheritable) = P(inheritable) — the inheritable set is preserved, only permitted/effective get recomputed against file caps.

setpriv --ambient-caps -all per its man page issues PR_CAP_AMBIENT_LOWER for each capability; it does not call capset() and so does not touch the inheritable set. So after the entire sh → chrt → setpriv → nice → user_cmd chain, the user process still has CAP_SYS_NICE in P(inh).

Step-by-step proof

  1. envd has CAP_SYS_NICE in P(perm) (runs as root, full bounding set).
  2. envd forks /bin/sh with AmbientCaps=[CAP_SYS_NICE]. Go raises permitted+inheritable, then ambient. After execve /bin/sh has: P(perm)={SYS_NICE}, P(inh)={SYS_NICE}, P(amb)={SYS_NICE}.
  3. /bin/sh exec's chrt — caps unchanged through execve except ambient bits get promoted into permitted+effective per ambient rules. Still P(inh)={SYS_NICE}.
  4. chrt exec's setpriv --ambient-caps -all. setpriv clears the ambient set → P(amb)={}. But P(inh) is untouched → P(inh)={SYS_NICE}.
  5. setpriv exec's nice, then user_cmd. Across each execve, P(inh) is preserved. Final state for the user command: P(inh)={SYS_NICE}, P(amb)={}, P(perm)/P(eff) computed from F(inh)/F(perm) of the user binary.

Verifiable from a shell inside the sandbox with grep ^Cap /proc/self/statusCapInh will be 0000000000800000 (bit 23 = CAP_SYS_NICE) instead of 0000000000000000.

Impact (why this is a nit, not a real bug)

For ordinary binaries with no file capabilities, P'(permitted) = (F(perm) & bset) | (F(inh) & P(inh)) | P(amb) yields zero — F(inh) is empty, so the inheritable leak grants nothing. Exploitation requires a binary with CAP_SYS_NICE+ie file caps, and:

  • The sandbox user can't install one (they lack CAP_SETFCAP).
  • No standard Linux distribution ships one.

So in practice the leak is inert. I'm filing this as a nit rather than normal: it's a defense-in-depth refinement that matches the comment's stated intent, not a user-visible escalation path. Acknowledging the refutation: yes, the refuter is correct that exploitability is essentially zero in the default sandbox, which is why I'm not arguing this blocks the merge.

Fix

-/usr/bin/setpriv --ambient-caps -all -- /usr/bin/nice -n %d "${@}"
+/usr/bin/setpriv --ambient-caps -all --inh-caps -all -- /usr/bin/nice -n %d "${@}"

Optionally also clear CAP_SYS_NICE from the bounding set (--bounding-set -all or scoped) for a tighter posture. With --inh-caps -all, P(inh) on the final user process is empty, matching the wrapper's stated intent.

wrapperArgs := append([]string{"-c", wrapperScript, "--", req.GetProcess().GetCmd()}, req.GetProcess().GetArgs()...)
Comment on lines +99 to +106

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 envd version in packages/envd/pkg/version.go is still 0.5.24 but this PR introduces three behavioral envd changes (chrt+setpriv wrapper for user processes, CAP_SYS_NICE ambient cap, SCHED_FIFO scheduling for envd itself). CLAUDE.md:133 and packages/envd/README.md both explicitly require a version bump on every behavioral change. Bump to 0.5.25.

Extended reasoning...

The convention

CLAUDE.md:133 states explicitly: "Version in pkg/version.go must be bumped on every behavioral change (not comments/docs-only changes)". packages/envd/README.md echoes the same rule. Recent git history shows this is followed strictly — every recent envd PR bumped the version: #2683 (0.5.23 → 0.5.24, NFS umount), #2680 (0.5.22 → 0.5.23), #2676 (0.5.21 → 0.5.22).

Behavioral changes in this PR

This PR contains three clearly non-docs changes to envd runtime behavior:

  1. handler.go:99-103 — user process wrapper now exec's /usr/bin/chrt --other 0 /usr/bin/setpriv --ambient-caps -all -- /usr/bin/nice instead of just /usr/bin/nice.
  2. handler_caps_linux.go (new) — adds CAP_SYS_NICE to user processes' AmbientCaps.
  3. envd.service.tpl:20-21 — adds CPUSchedulingPolicy=fifo and CPUSchedulingPriority=1 to the envd systemd unit.

None of these are comment/doc-only. Yet packages/envd/pkg/version.go still reads const Version = "0.5.24", unchanged from before this PR.

Why this matters semantically (not just procedurally)

The version constant is consumed by version-based feature detection in packages/shared/pkg/utils/version.go (IsGTEVersion, CheckEnvdVersionForSnapshot, MinEnvdVersionForSnapshot) and the orchestrator pins behavior to envd version via -envd-version. Skipping the bump means callers cannot distinguish the new wrapper/scheduling behavior from the previous build, so any feature-gated logic that ought to key on these changes will silently misclassify rootfs images carrying this code.

Proof / step-by-step

  1. cat packages/envd/pkg/version.goconst Version = "0.5.24".
  2. git log --oneline packages/envd/pkg/version.go | head shows monotonic bumps tied to every prior envd-touching PR (0.5.19 → 0.5.24 in five PRs).
  3. git diff main -- packages/envd packages/orchestrator/pkg/template/build/core/rootfs/files/envd.service.tpl shows the three behavioral hunks above.
  4. git diff main -- packages/envd/pkg/version.go is empty → convention violated.

Fix

Bump packages/envd/pkg/version.go to const Version = "0.5.25" (or higher) so the new wrapper/scheduling behavior is identifiable from the version constant. Trivial one-line change.

cmd := exec.CommandContext(ctx, "/bin/sh", wrapperArgs...)

uid, gid, err := permissions.GetUserIdUints(user)
Expand Down
Loading