feat(envd): run envd at SCHED_FIFO 1, reset user processes via wrapper#2684
feat(envd): run envd at SCHED_FIFO 1, reset user processes via wrapper#2684ValentaTomas wants to merge 7 commits into
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 27c8fa0. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 6 Tests Failed:
View the full list of 8 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Code Review
The AmbientCaps field in syscall.SysProcAttr requires a slice of uintptr, so unix.CAP_SYS_NICE must be cast to uintptr to prevent a compilation error. The systemd unit also requires AmbientCapabilities=CAP_SYS_NICE to ensure the capability is preserved across user ID changes, as the permitted set is otherwise cleared before ambient capabilities are set.
|
Superseded by #2700 (move envd into a dedicated network namespace), which addresses the root cause. |
Mirrors envd's existing Nice=-20 CPU priority with a real-time scheduling class so envd preempts customer SCHED_OTHER work during pause/resume storms. Uses the lowest RT priority (1) so envd cannot starve kernel threads or other RT services; the default kernel.sched_rt_runtime_us throttle (95% per 1s) caps total RT bandwidth so a hypothetical envd busy-loop cannot DoS the system. User-spawned processes are reset to SCHED_OTHER (nice 0, no ambient caps) via the existing /bin/sh wrapper: CAP_SYS_NICE is passed through setuid via SysProcAttr.AmbientCaps so chrt(1) can drop the RT policy, then setpriv(1) strips the ambient cap so the user command cannot re-raise itself. socat intentionally inherits SCHED_FIFO + Nice=-20: port forwarding is infrastructure-critical and dropping connections under load is much worse than the small RT budget cost.
c00ceb7 to
1acfcc4
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1acfcc4e93
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Reset oom_score_adj, drop SCHED_FIFO via chrt, drop the SYS_NICE | ||
| // ambient cap, then apply nice. | ||
| niceDelta := defaultNice - currentNice() | ||
| oomWrapperScript := fmt.Sprintf(`echo %d > /proc/$$/oom_score_adj && exec /usr/bin/nice -n %d "${@}"`, defaultOomScore, niceDelta) | ||
| wrapperArgs := append([]string{"-c", oomWrapperScript, "--", req.GetProcess().GetCmd()}, req.GetProcess().GetArgs()...) | ||
| wrapperScript := fmt.Sprintf( | ||
| `echo %d > /proc/$$/oom_score_adj && exec /usr/bin/chrt --other 0 /usr/bin/setpriv --ambient-caps -all -- /usr/bin/nice -n %d "${@}"`, | ||
| defaultOomScore, niceDelta, | ||
| ) | ||
| wrapperArgs := append([]string{"-c", wrapperScript, "--", req.GetProcess().GetCmd()}, req.GetProcess().GetArgs()...) |
There was a problem hiding this comment.
🟡 envd version in packages/envd/pkg/version.go is still 0.5.24 but this PR introduces three behavioral envd changes (chrt+setpriv wrapper for user processes, CAP_SYS_NICE ambient cap, SCHED_FIFO scheduling for envd itself). CLAUDE.md:133 and packages/envd/README.md both explicitly require a version bump on every behavioral change. Bump to 0.5.25.
Extended reasoning...
The convention
CLAUDE.md:133 states explicitly: "Version in pkg/version.go must be bumped on every behavioral change (not comments/docs-only changes)". packages/envd/README.md echoes the same rule. Recent git history shows this is followed strictly — every recent envd PR bumped the version: #2683 (0.5.23 → 0.5.24, NFS umount), #2680 (0.5.22 → 0.5.23), #2676 (0.5.21 → 0.5.22).
Behavioral changes in this PR
This PR contains three clearly non-docs changes to envd runtime behavior:
handler.go:99-103— user process wrapper now exec's/usr/bin/chrt --other 0 /usr/bin/setpriv --ambient-caps -all -- /usr/bin/niceinstead of just/usr/bin/nice.handler_caps_linux.go(new) — addsCAP_SYS_NICEto user processes'AmbientCaps.envd.service.tpl:20-21— addsCPUSchedulingPolicy=fifoandCPUSchedulingPriority=1to the envd systemd unit.
None of these are comment/doc-only. Yet packages/envd/pkg/version.go still reads const Version = "0.5.24", unchanged from before this PR.
Why this matters semantically (not just procedurally)
The version constant is consumed by version-based feature detection in packages/shared/pkg/utils/version.go (IsGTEVersion, CheckEnvdVersionForSnapshot, MinEnvdVersionForSnapshot) and the orchestrator pins behavior to envd version via -envd-version. Skipping the bump means callers cannot distinguish the new wrapper/scheduling behavior from the previous build, so any feature-gated logic that ought to key on these changes will silently misclassify rootfs images carrying this code.
Proof / step-by-step
cat packages/envd/pkg/version.go→const Version = "0.5.24".git log --oneline packages/envd/pkg/version.go | headshows monotonic bumps tied to every prior envd-touching PR (0.5.19 → 0.5.24 in five PRs).git diff main -- packages/envd packages/orchestrator/pkg/template/build/core/rootfs/files/envd.service.tplshows the three behavioral hunks above.git diff main -- packages/envd/pkg/version.gois empty → convention violated.
Fix
Bump packages/envd/pkg/version.go to const Version = "0.5.25" (or higher) so the new wrapper/scheduling behavior is identifiable from the version constant. Trivial one-line change.
| wrapperScript := fmt.Sprintf( | ||
| `echo %d > /proc/$$/oom_score_adj && exec /usr/bin/chrt --other 0 /usr/bin/setpriv --ambient-caps -all -- /usr/bin/nice -n %d "${@}"`, | ||
| defaultOomScore, niceDelta, | ||
| ) |
There was a problem hiding this comment.
🟡 Nit: the wrapper's setpriv --ambient-caps -all only clears the ambient set, not the inheritable set. Go's AmbientCaps machinery raises CAP_SYS_NICE in both ambient and inheritable before PR_CAP_AMBIENT_RAISE (see Go's syscall/exec_linux.go forkAndExecInChild1), and the inheritable bit survives execve. So the user command ends up with CAP_SYS_NICE still in P(inh), contradicting the comment's stated intent ("drop the SYS_NICE ambient cap"). Trivial fix: append --inh-caps -all to the setpriv invocation.
Extended reasoning...
What's leaking
The wrapper script on handler.go:102-105 does:
exec /usr/bin/chrt --other 0 /usr/bin/setpriv --ambient-caps -all -- /usr/bin/nice -n N USER_CMD
The stated intent (from the diff comment) is to "drop the SYS_NICE ambient cap, then apply nice". But after this chain runs, the final user process actually inherits CAP_SYS_NICE in its inheritable capability set.
Why
When SysProcAttr.AmbientCaps is non-empty, Go does not just raise ambient. Looking at /usr/local/go/src/syscall/exec_linux.go forkAndExecInChild1 (lines 519–524):
// Add the c capability to the permitted and inheritable capability mask,
// otherwise we will not be able to add it to the ambient capability mask.
caps.data[capToIndex(c)].permitted |= capToMask(c)
caps.data[capToIndex(c)].inheritable |= capToMask(c)It then calls capset() and only afterwards PR_CAP_AMBIENT_RAISE. That's required by the kernel — you can't raise ambient unless the cap is already in both permitted and inheritable. So by the time /bin/sh exec's, P(perm) and P(inh) both contain CAP_SYS_NICE.
Across execve (capabilities(7)): P'(inheritable) = P(inheritable) — the inheritable set is preserved, only permitted/effective get recomputed against file caps.
setpriv --ambient-caps -all per its man page issues PR_CAP_AMBIENT_LOWER for each capability; it does not call capset() and so does not touch the inheritable set. So after the entire sh → chrt → setpriv → nice → user_cmd chain, the user process still has CAP_SYS_NICE in P(inh).
Step-by-step proof
- envd has CAP_SYS_NICE in P(perm) (runs as root, full bounding set).
- envd forks
/bin/shwithAmbientCaps=[CAP_SYS_NICE]. Go raises permitted+inheritable, then ambient. After execve/bin/shhas: P(perm)={SYS_NICE}, P(inh)={SYS_NICE}, P(amb)={SYS_NICE}. /bin/shexec'schrt— caps unchanged through execve except ambient bits get promoted into permitted+effective per ambient rules. Still P(inh)={SYS_NICE}.chrtexec'ssetpriv --ambient-caps -all. setpriv clears the ambient set → P(amb)={}. But P(inh) is untouched → P(inh)={SYS_NICE}.- setpriv exec's
nice, then user_cmd. Across each execve, P(inh) is preserved. Final state for the user command: P(inh)={SYS_NICE}, P(amb)={}, P(perm)/P(eff) computed from F(inh)/F(perm) of the user binary.
Verifiable from a shell inside the sandbox with grep ^Cap /proc/self/status — CapInh will be 0000000000800000 (bit 23 = CAP_SYS_NICE) instead of 0000000000000000.
Impact (why this is a nit, not a real bug)
For ordinary binaries with no file capabilities, P'(permitted) = (F(perm) & bset) | (F(inh) & P(inh)) | P(amb) yields zero — F(inh) is empty, so the inheritable leak grants nothing. Exploitation requires a binary with CAP_SYS_NICE+ie file caps, and:
- The sandbox user can't install one (they lack
CAP_SETFCAP). - No standard Linux distribution ships one.
So in practice the leak is inert. I'm filing this as a nit rather than normal: it's a defense-in-depth refinement that matches the comment's stated intent, not a user-visible escalation path. Acknowledging the refutation: yes, the refuter is correct that exploitability is essentially zero in the default sandbox, which is why I'm not arguing this blocks the merge.
Fix
-/usr/bin/setpriv --ambient-caps -all -- /usr/bin/nice -n %d "${@}"
+/usr/bin/setpriv --ambient-caps -all --inh-caps -all -- /usr/bin/nice -n %d "${@}"Optionally also clear CAP_SYS_NICE from the bounding set (--bounding-set -all or scoped) for a tighter posture. With --inh-caps -all, P(inh) on the final user process is empty, matching the wrapper's stated intent.
|
Closing — not following up. SCHED_FIFO/AmbientCapabilities approach turned out to be CI-flaky and the real leverage in this area is #2688 (cgroup freeze) + the customer-validated A11 (Async io_engine), not SCHED_FIFO. May revisit if there's evidence envd is CPU-starved even with #2688 frozen-customer. |
envd at SCHED_FIFO priority 1; user processes reset to SCHED_OTHER via wrapper + AmbientCaps[CAP_SYS_NICE] + setpriv. socat keeps RT intentionally.