docs: add OSEP-0013 Isolated Execution API#989
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid CRI nested user namespace dependency. Use setpriv for real UID isolation instead of bwrap --unshare-user + --uid/--gid mapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b08d1fe9a8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| 8. extra_writable segment: --bind per item | ||
| 9. Env segment: --clearenv (deny mode) then --setenv allowlist / passthrough minus blacklist | ||
| 10. --seccomp <fd> | ||
| 11. -- setpriv --reuid=<n> --regid=<n> --init-groups <user cmd> |
There was a problem hiding this comment.
Clear groups for arbitrary numeric UIDs
This launch sequence contradicts the earlier requirement that callers can choose numeric UIDs without /etc/passwd entries: setpriv --init-groups requires the target user to be resolvable, so sessions for allocated UIDs that are not in the image passwd database fail before the user command starts. I checked setpriv --help for the relevant option (--init-groups initialize supplementary groups) and verified the failure mode with setpriv --reuid=123456 --regid=123456 --init-groups true (uid 123456 not found); use --clear-groups/explicit groups or require passwd provisioning instead.
Useful? React with 👍 / 👎.
| mount -t overlay overlay \ | ||
| -o lowerdir=<workspace>,upperdir=<upper>,workdir=<work> \ | ||
| /tmp/merged-<id> | ||
| rsync -aHAX --delete /tmp/merged-<id>/ <workspace>/ |
There was a problem hiding this comment.
Avoid writing to lowerdir while mounted
In overlay sessions with persistence, this commit path mounts <workspace> as the overlay lowerdir and then rsyncs the merged view back into that same lowerdir before unmounting it. OverlayFS does not allow changing an underlying lower/upper tree while it is part of a mounted overlay, so commits that update or delete lower files can produce undefined or stale results; apply the upper after unmounting, or copy to a separate destination and swap it into place.
Useful? React with 👍 / 👎.
| | Read (info/download) | Check upper first; whiteout → 404; upper miss → fall through lower | | ||
| | Search | Walk upper + lower, merge/deduplicate, skip whiteout paths | | ||
| | Write (upload/replace) | Write to upper path; `os.Chown(path, session.uid, session.gid)` | | ||
| | Delete | Create whiteout in upper (char device 0,0); directory: opaque xattr | |
There was a problem hiding this comment.
Whiteout deleted lower directories
For overlay mode, using only an opaque xattr for directory deletion leaves an upper directory at that path; it hides lower children but the merged view still contains the directory as empty. When a caller deletes a lower-only directory, subsequent info/search and commit can still see or recreate that directory instead of treating it as removed, so directory deletion needs a whiteout for the directory path and opaque markers should be reserved for hiding lower children during copy-up/mkdir semantics.
Useful? React with 👍 / 👎.
| - execd must run as root or with `CAP_SYS_ADMIN` (required to create mount/PID | ||
| namespaces without user namespace). |
There was a problem hiding this comment.
Require setuid/chown capabilities too
Allowing execd to run non-root with only CAP_SYS_ADMIN underprovisions the isolated path: bwrap may be able to create mount/PID namespaces, but setpriv --reuid/--regid for caller-selected UIDs and filesystem proxy os.Chown operations also require the corresponding UID/GID/chown privileges. In a Pod spec following this guidance, sessions with non-default uid/gid or overlay uploads will fail with permission errors, so the infrastructure requirement should include root or the full capability set needed for setuid/setgid/chown.
Useful? React with 👍 / 👎.
|
|
||
| execd runs probes at startup in order: | ||
|
|
||
| 1. `bwrap --version` — binary check | ||
| 2. `bwrap --ro-bind / / --unshare-pid -- true` — smoke test |
There was a problem hiding this comment.
Probe overlay support before enabling sessions
These startup probes never exercise the --overlay-src/--overlay flags even though strict defaults to overlay mode. Bubblewrap overlay flags are only available in newer non-setuid builds, so an older embedded binary or an accidentally setuid build can pass this smoke test and make /v1/isolated/capabilities report available = true, but default session creation then fails when it tries to build the workspace overlay; add an overlay-specific probe and reflect failures in capabilities/defaults.
Useful? React with 👍 / 👎.
| | run (within session) | Read | | ||
| | diff (option A) | Read | |
There was a problem hiding this comment.
Snapshot diff against concurrent runs
Because both run and diff hold only read locks, a caller can request GET .../diff while the session command is actively mutating the same upper directory that the tar stream is walking. In that scenario the exported artifact is not a consistent point-in-time diff and can miss files, include partial writes, or fail during traversal, so diff should take an exclusive lock or snapshot/copy the upper before streaming.
Useful? React with 👍 / 👎.
| if reset_upper_after: | ||
| rm -rf <upper>/* <work>/* |
There was a problem hiding this comment.
Reset dotfiles in upper and work dirs
This reset command only expands non-hidden entries, so files such as <upper>/.env, hidden directories, and overlay metadata-style dot entries survive reset_upper_after. After a commit that claims to clear the upper/work trees, later runs or diffs can still observe those stale changes; use a deletion method that removes all children without relying on * glob expansion.
Useful? React with 👍 / 👎.
| | Search | Walk upper + lower, merge/deduplicate, skip whiteout paths | | ||
| | Write (upload/replace) | Write to upper path; `os.Chown(path, session.uid, session.gid)` | | ||
| | Delete | Create whiteout in upper (char device 0,0); directory: opaque xattr | | ||
| | Move | Source: create whiteout; destination: write to upper | |
There was a problem hiding this comment.
Copy lower sources when moving overlay files
For a move where the source exists only in the lower workspace, creating a source whiteout is not enough to preserve the data at the destination; there is no upper file to rename, so destination: write to upper needs to copy the lower file or directory contents first. Without that copy-up step, filesystem proxy mv can turn a lower-only rename into deletion or an empty placeholder while hiding the original source.
Useful? React with 👍 / 👎.
| | Delete | Create whiteout in upper (char device 0,0); directory: opaque xattr | | ||
| | Move | Source: create whiteout; destination: write to upper | | ||
| | Permissions (chmod/chown) | Upper has file → direct; lower only → copy-up to upper first | | ||
| | mkdir | Create in upper; existing lower directory → create opaque marker | |
There was a problem hiding this comment.
Preserve contents on mkdir of lower directories
The existing /directories API has mkdir-p semantics, so creating a directory that already exists in the lower workspace should not hide its current contents. Marking that path opaque makes the merged view show an empty upper directory instead, causing existing lower files to disappear from subsequent search/download and potentially from commit output.
Useful? React with 👍 / 👎.
Summary
Design Highlights
/v1/isolated/*API prefix, existing endpoints unchangedstrict/balanced) with per-session overridessetprivinstead of user namespace mapping — stronger UID isolation, no CRI dependencyTest plan
🤖 Generated with Claude Code