Skip to content

Commit 21bda5e

Browse files
committed
feat(sandbox): spike Landlock non-docker sandbox PoC (MCP-34.1)
Spike gating MCP-34 (non-docker isolation mode). Proves an unprivileged, non-Docker confinement primitive for stdio MCP servers on snap-docker Ubuntu 24.04, where Docker fails under AppArmor (MCPX_DOCKER_SNAP_APPARMOR, GH #71). - internal/sandbox: Landlock LSM fs allowlist via raw x/sys/unix syscalls (no new dependency) + setrlimit, ABI best-effort down-masking, fail-closed cross-platform stub. - Enforcement test re-execs a confined child: denies a path outside the allowlist, permits the RW allowlist, preserves stdin/stdout JSON-RPC framing. Runs in CI on ubuntu-latest (= Ubuntu 24.04). - Recommendation doc resolves plan decisions D2 (Landlock + rlimits + best-effort uid/gid; userns/bwrap deprioritized) and D3 (scanner degradation). Related MCP-3232, MCP-34
1 parent 3317e9d commit 21bda5e

6 files changed

Lines changed: 663 additions & 0 deletions

File tree

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# Spike: non-Docker sandbox mechanism for snap-docker Ubuntu 24.04 (MCP-3232)
2+
3+
**Status:** recommendation · gates [MCP-34](../../) (Non-Docker isolation mode) · resolves design decisions **D2** and **D3** in the [MCP-34 plan](../../).
4+
**Author:** BackendEngineer · **PoC:** `internal/sandbox/` (this branch).
5+
6+
## TL;DR
7+
8+
Use the **Linux Landlock LSM** (kernel 5.13+) for the writable-scope filesystem
9+
allowlist, plus **`setrlimit`** for resource caps, plus **best-effort
10+
`SysProcAttr.Credential{Uid,Gid}`** for uid/gid drop. **Deprioritize user
11+
namespaces / bubblewrap** — they are blocked by default on the exact hosts we
12+
target. This matches the plan's D2 assumption and the spike confirms it with a
13+
working PoC.
14+
15+
The PoC (`internal/sandbox`) proves the load-bearing claim from a Go process:
16+
Landlock **denies a path outside the allowlist, permits the allowlisted
17+
read-write subtree, and preserves raw stdin/stdout JSON-RPC framing** — all
18+
without user namespaces, so it is unaffected by
19+
`kernel.apparmor_restrict_unprivileged_userns=1`.
20+
21+
## Why Docker fails on these hosts (reproduction target)
22+
23+
On Ubuntu where Docker is installed via **snap**, AppArmor's profile transition
24+
fights the security flags the scanner sandbox requires
25+
(`--security-opt no-new-privileges` + a pinned AppArmor profile), so in-container
26+
commands fail with *operation not permitted*
27+
(`MCPX_DOCKER_SNAP_APPARMOR`, GH #71; the related systemd/snap-confine variant is
28+
already detected by `cmd/mcpproxy/doctor_env_snapdocker.go`, repo issue #457).
29+
The escapes today — remove snap docker / disable scanner / disable isolation —
30+
are all adoption blockers. We need an isolation path that does **not** depend on
31+
Docker or on a primitive AppArmor blocks.
32+
33+
## Candidate mechanisms compared
34+
35+
| Mechanism | Unprivileged? | Blocked by Ubuntu 24.04 AppArmor userns restriction? | FS write-allowlist | rlimits | uid/gid drop | Verdict |
36+
|---|---|---|---|---|---|---|
37+
| **Landlock LSM** (5.13+) | ✅ yes |**no** — needs no userns | ✅ path-beneath allowlist | n/a (pair with setrlimit) | ❌ no (orthogonal) | **Chosen** |
38+
| **user namespaces / bubblewrap** | ✅ yes (in principle) | ⚠️ **yes by default**`apparmor_restrict_unprivileged_userns=1` blocks `unshare(CLONE_NEWUSER)` unless a per-binary AppArmor profile grants `userns` | ✅ via bind mounts || ✅ (maps uid in the ns) | **Deprioritized** |
39+
| **`setpriv` + `setrlimit` only** | ✅ yes | ❌ no | ❌ none || ❌ no (needs CAP_SETUID) | **Floor / fallback** |
40+
41+
### Landlock — chosen (resolves D2)
42+
43+
- **Unprivileged and userns-free.** Landlock confines the calling thread/process
44+
via three syscalls (`landlock_create_ruleset`, `landlock_add_rule`,
45+
`landlock_restrict_self`); it requires **no** user or mount namespace, so the
46+
Ubuntu 23.10+/24.04 `apparmor_restrict_unprivileged_userns=1` default — which
47+
is exactly what breaks bubblewrap on our target hosts — **does not apply**.
48+
Confirmed by Chromium's and Ubuntu's own guidance (sources below).
49+
- **Inherited across `exec`.** A Landlock domain is preserved across `execve`
50+
and applied to every descendant; a child can only *further* restrict itself,
51+
never escape. That makes the integration a tiny **re-exec wrapper**: lock the
52+
OS thread, `Apply()` the ruleset, then `exec` the untrusted `npx`/`uvx`
53+
command. The proxy keeps owning the child's raw stdin/stdout pipes (D1: native
54+
launcher, not `process-compose`).
55+
- **Best-effort across kernels.** `landlock_create_ruleset(NULL,0,VERSION)`
56+
reports the supported ABI; we mask the handled access-rights down to that ABI
57+
so the same binary degrades cleanly from 6.10 (ABI 5) to 5.13 (ABI 1).
58+
Ubuntu 24.04 ships kernel 6.8 → **ABI 4** (adds TCP bind/connect; FS rights
59+
fully covered). `internal/sandbox/sandbox_linux.go:handledAccessFS`.
60+
- **No new dependency.** `golang.org/x/sys/unix v0.46` (already a direct
61+
dependency) ships the `SYS_LANDLOCK_*` numbers and `LandlockRulesetAttr` /
62+
`LandlockPathBeneathAttr` types. The PoC calls the raw syscalls — satisfies
63+
the repo's "avoid new dependencies" rule. (For the full build, the maintained
64+
`github.com/landlock-lsm/go-landlock` library — which also solves Go's
65+
multi-thread `restrict_self` caveat — is a reasonable alternative; the re-exec
66+
wrapper sidesteps that caveat by `exec`-ing a single-threaded image
67+
immediately after `Apply`.)
68+
69+
### user namespaces / bubblewrap — deprioritized (confirms D2)
70+
71+
Bubblewrap builds its sandbox with `unshare(CLONE_NEWUSER)`. Ubuntu 23.10+
72+
sets `kernel.apparmor_restrict_unprivileged_userns=1` by default, which **blocks
73+
unprivileged userns creation unless the program has an AppArmor profile granting
74+
the `userns` permission**. bubblewrap ships such a profile in recent Ubuntu, but
75+
a *custom Go binary* spawning userns would be denied on 24.04 out of the box —
76+
i.e. the userns-first design risks being blocked on the very hosts we target.
77+
This is the same failure class that breaks Docker-snap; choosing it would trade
78+
one AppArmor block for another. Deprioritized.
79+
80+
### setpriv + setrlimit only — the floor
81+
82+
No filesystem allowlist at all — only resource caps and (with privilege)
83+
capability/uid drop. Useful as a graceful fallback when Landlock is unavailable
84+
(kernel < 5.13 or LSM disabled), but it does **not** meet the "writable-scope
85+
allowlist" exit criterion on its own. The PoC applies `setrlimit` independently
86+
of Landlock so this floor is always available.
87+
88+
## Honest limits (must be documented — D2 caveat)
89+
90+
- **No uid/gid separation without privilege.** Landlock restricts *paths*, not
91+
*identity*. The confined process runs as the **same uid** as mcpproxy; it can
92+
still touch anything that uid owns *within the allowlist*. Real uid/gid drop
93+
needs `SysProcAttr.Credential{Uid,Gid}`, which requires root / `CAP_SETUID`
94+
(server edition under systemd, not the unprivileged desktop case). **Do not
95+
overclaim Docker parity on uid/gid** for the unprivileged desktop case — set
96+
it best-effort and surface the limitation.
97+
- **Filesystem + (on ABI 4+) TCP only.** Landlock does not restrict arbitrary
98+
syscalls (that is seccomp), nor PID/IPC/network namespaces. A confined process
99+
can still see `/proc`, signal same-uid processes, and (below kernel 6.7) open
100+
arbitrary network sockets. Pair with seccomp + `setrlimit(RLIMIT_NPROC)` for
101+
defense-in-depth in a later iteration; out of scope for this spike.
102+
- **Allowlist must include the loader + interpreter.** `exec`-ing `npx`/`uvx`
103+
needs read+execute on the binary, its `node`/`python` runtime, and the shared
104+
libraries (`/usr`, `/lib`, `/lib64`, …). The launcher must compute and grant
105+
these RO paths or the child fails to start. (The PoC test grants a generous
106+
system RO set to demonstrate this.)
107+
108+
## What the PoC proves vs. what still needs the host
109+
110+
**Proven by `internal/sandbox` (runs in CI on `ubuntu-latest` = Ubuntu 24.04 —
111+
see `.github/workflows/unit-tests.yml`, `go test -race ./...`):**
112+
113+
- `TestLandlockEnforcesFilesystemAllowlist` — re-execs a confined child that
114+
(1) echoes stdin→stdout (JSON-RPC framing survives), (2) reads+writes inside
115+
the RW allowlist, (3) is **denied** a secret path outside it. Exit-code
116+
assertions; skips gracefully if the kernel lacks Landlock.
117+
- `TestHandledAccessFSMasksByABI` — ABI down-masking is correct.
118+
- Cross-platform stub (`sandbox_other.go`) keeps macOS/Windows building with a
119+
documented no-op / fail-closed `ErrUnsupported`.
120+
121+
**Still requires a real snap-docker Ubuntu 24.04 host (deferred to MCP-34
122+
child issues #3/#4, where the spawn branch lands):**
123+
124+
- End-to-end launch of an actual `npx` and `uvx` MCP server under the wrapper
125+
(the PoC proves the *primitive* + passthrough; the server-specific RO
126+
allowlist tuning is launcher work).
127+
- Reproducing the `MCPX_DOCKER_SNAP_APPARMOR` Docker failure side-by-side to
128+
show the Landlock path succeeds where Docker-snap fails. (By construction
129+
Landlock is unaffected by the AppArmor userns restriction, so it is expected
130+
to work; this is the empirical confirmation step.)
131+
132+
## Recommendation for the D3 scanner question
133+
134+
The scanner *plugin* runtime is Docker-based (Spec 039) and is the broken path
135+
on snap-docker hosts. **Recommend D3 option (b): clean, surfaced degradation**
136+
run isolated stdio servers under the Landlock `sandbox` launcher, and when
137+
`isolation.mode: sandbox` is active on a host where the Docker scanner cannot
138+
run, **skip the Docker scanner pre-flight and surface a health-degraded warning**
139+
(via the unified `health` field + a `doctor` check, mirroring
140+
`doctor_env_snapdocker.go`). A native non-Docker scanner path (option a) is a
141+
larger effort and can follow once the sandbox launcher exists; degradation
142+
unblocks adoption now and is testable on the snap-docker host. Final call sits
143+
with the scanner child issue (MCP-34 #4).
144+
145+
## Proposed integration shape (for MCP-34 #2/#3, not built here)
146+
147+
- Config: `isolation.mode: "docker" | "sandbox" | "none"` (global + per-server),
148+
back-compat-mapped from today's `Enabled`/`DockerIsolation`. New
149+
`config.Config`/`ServerConfig` fields ⇒ register in
150+
`TestSaveServerSyncFieldCoverage` `expectedFields` and run `make swagger`
151+
(prior-art gotcha, memory).
152+
- Spawn: a fourth branch in `connectStdio` / `buildLauncherCmd` alongside the
153+
existing docker-isolation / user-`docker run` / shell-wrap branches. On Linux
154+
with `mode: sandbox`, route through a `mcpproxy sandbox-exec`-style re-exec
155+
wrapper that calls `sandbox.Apply(spec)` then `exec`s the resolved command;
156+
reuse the existing `SysProcAttr{Setpgid:true}` process-group cleanup
157+
(`process_unix.go`). macOS/Windows = documented no-op → effective `none`.
158+
- `Spec` is already shaped for this: `ReadOnlyPaths` (loader/runtime/binary),
159+
`ReadWritePaths` (working dir, cache, `/tmp` scope), `Rlimits`, `BestEffort`.
160+
161+
## Sources
162+
163+
- Linux kernel — Landlock (no userns required; ABI versions):
164+
https://docs.kernel.org/userspace-api/landlock.html
165+
- Ubuntu — Restricted unprivileged user namespaces (default `=1` since 23.10):
166+
https://ubuntu.com/blog/ubuntu-23-10-restricted-unprivileged-user-namespaces
167+
- Chromium docs — AppArmor userns restrictions vs. Landlock fallback (Landlock
168+
works where bwrap/userns is blocked):
169+
https://chromium.googlesource.com/chromium/src/+/main/docs/security/apparmor-userns-restrictions.md
170+
- bubblewrap blocked on Ubuntu 24.04 by AppArmor userns restriction:
171+
https://github.com/microsoft/vscode/issues/316046
172+
- go-landlock (library + multi-thread `restrict_self` caveat + `landlock-restrict`
173+
re-exec example): https://github.com/landlock-lsm/go-landlock
174+
- Repo: `golang.org/x/sys/unix v0.46` Landlock primitives (`go.mod`);
175+
snap-docker detection `cmd/mcpproxy/doctor_env_snapdocker.go` (issue #457);
176+
MCP-34 plan decisions D1–D3 (Paperclip plan doc).
177+
```

internal/sandbox/sandbox.go

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
// Package sandbox is a spike (MCP-3232) proving an *unprivileged*, non-Docker
2+
// confinement primitive for stdio MCP servers on hosts where Docker is
3+
// unavailable or broken — notably Ubuntu 24.04 with snap-installed Docker under
4+
// AppArmor, where the deb's systemd hardening fights snap-confine (see
5+
// internal repo issue #457 / cmd/mcpproxy/doctor_env_snapdocker.go).
6+
//
7+
// The mechanism chosen by the spike is the Linux Landlock LSM (kernel 5.13+).
8+
// Unlike user-namespace / bubblewrap sandboxes, Landlock does NOT require
9+
// unprivileged user namespaces and is therefore NOT blocked by
10+
// `kernel.apparmor_restrict_unprivileged_userns=1`, which Ubuntu 23.10+/24.04
11+
// enable by default. See docs/development/sandbox-spike-mcp-34.md for the full
12+
// recommendation and the honest limits (no uid/gid separation without
13+
// privilege, filesystem-allowlist + rlimits only).
14+
//
15+
// This package is intentionally minimal: it confines the *current* process and,
16+
// because Landlock domains are inherited across execve, every child it then
17+
// execs (the npx/uvx server and its descendants). The intended integration is a
18+
// tiny re-exec wrapper that calls Apply and then execs the untrusted command;
19+
// the package test exercises exactly that shape.
20+
package sandbox
21+
22+
import "errors"
23+
24+
// ErrUnsupported is returned by Apply on platforms or kernels that do not
25+
// provide the requested confinement primitive (e.g. non-Linux, or a kernel
26+
// without Landlock). Callers that want fail-open behaviour set Spec.BestEffort.
27+
var ErrUnsupported = errors.New("sandbox: confinement primitive unavailable on this platform/kernel")
28+
29+
// Spec describes the confinement to apply to the current process before it
30+
// execs an untrusted stdio MCP server. The zero value applies no filesystem
31+
// restriction (only rlimits, if any).
32+
type Spec struct {
33+
// ReadOnlyPaths are filesystem subtrees the confined process may read and
34+
// execute. Anything not covered by a ReadOnlyPaths/ReadWritePaths entry is
35+
// denied. Missing paths are skipped (best-effort) and noted in the Report.
36+
ReadOnlyPaths []string
37+
38+
// ReadWritePaths are subtrees the confined process may read, execute, and
39+
// write (create/delete/truncate within).
40+
ReadWritePaths []string
41+
42+
// Rlimits are resource limits applied via setrlimit before confinement.
43+
Rlimits []Rlimit
44+
45+
// BestEffort, when true, downgrades "primitive unavailable" from an error
46+
// to a no-op recorded in Report.LandlockNote, mirroring go-landlock's
47+
// BestEffort semantics. When false (the default, fail-closed), Apply
48+
// returns ErrUnsupported if Landlock cannot be enforced — a security
49+
// boundary should fail closed rather than silently run unconfined.
50+
BestEffort bool
51+
}
52+
53+
// Rlimit is a single setrlimit request. Resource is one of the unix.RLIMIT_*
54+
// constants (e.g. RLIMIT_AS, RLIMIT_NOFILE, RLIMIT_NPROC, RLIMIT_CPU).
55+
type Rlimit struct {
56+
Resource int
57+
Cur uint64
58+
Max uint64
59+
}
60+
61+
// Report records what Apply actually enforced, so callers can log an honest
62+
// account of the confinement (important because Apply is best-effort across
63+
// kernels with different Landlock ABI levels).
64+
type Report struct {
65+
// LandlockABI is the kernel's reported Landlock ABI version that was
66+
// enforced (>=1), 0 if Landlock was not requested, or -1 if Landlock is
67+
// unavailable on this kernel/platform.
68+
LandlockABI int
69+
// LandlockNote is a human-readable note (e.g. why Landlock was skipped, or
70+
// which paths were missing).
71+
LandlockNote string
72+
// RlimitsSet is the number of rlimits successfully applied.
73+
RlimitsSet int
74+
// NoNewPrivs reports whether PR_SET_NO_NEW_PRIVS was set (always true when
75+
// Landlock is enforced; Landlock requires it).
76+
NoNewPrivs bool
77+
}
78+
79+
// wantsLandlock reports whether the spec asks for any filesystem confinement.
80+
func (s Spec) wantsLandlock() bool {
81+
return len(s.ReadOnlyPaths) > 0 || len(s.ReadWritePaths) > 0
82+
}

0 commit comments

Comments
 (0)