Retry shim start without userns on clone failure#150
Retry shim start without userns on clone failure#150dmcgowan wants to merge 1 commit intocontainerd:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR makes shim startup more resilient on Linux by retrying Start() without user/mount namespace clone flags when the initial start fails after enabling user namespaces, allowing the shim to degrade to no mount isolation instead of failing the container start.
Changes:
- Change
cloneMntNsto return a boolean indicating whether userns/mntns clone flags were applied. - On Unix, retry shim
cmd.Start()without namespace isolation when the first start fails andcloneMntNshad enabled user namespaces. - Update non-Linux stub implementation to match the new
cloneMntNssignature.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| internal/shim/manager/mount_other.go | Update non-Linux cloneMntNs stub to return bool (always false). |
| internal/shim/manager/mount_linux.go | Return whether userns/mntns clone flags were applied so callers can distinguish the path taken. |
| internal/shim/manager/manager_unix.go | Retry shim start without namespace isolation when a userns-enabled start fails; add logging for the fallback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[0].f) | ||
| if opts.Debug && len(sockets) > 1 { | ||
| cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[1].f) |
There was a problem hiding this comment.
The retry path rebuilds cmd.ExtraFiles by indexing into sockets (always appending sockets[0] and conditionally sockets[1]). This is tightly coupled to the current socket creation logic and will silently break if additional sockets are ever added (or ordering changes). Consider rebuilding ExtraFiles by iterating over the existing sockets slice (or factoring the socket-attachment logic into a helper) so the retry stays correct as the surrounding code evolves.
| cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[0].f) | |
| if opts.Debug && len(sockets) > 1 { | |
| cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[1].f) | |
| for _, socket := range sockets { | |
| cmd.ExtraFiles = append(cmd.ExtraFiles, socket.f) |
| userns := cloneMntNs(ctx, cmd) | ||
|
|
||
| if err := cmd.Start(); err != nil { | ||
| return params, err | ||
| if !userns { | ||
| return params, err |
There was a problem hiding this comment.
The new fallback behavior (retrying cmd.Start() without namespace isolation when a userns-enabled start fails) isn’t covered by current tests. Consider adding an integration test that forces the first cmd.Start() to fail after cloneMntNs sets userns flags and asserts the retry succeeds and stdout remains parseable JSON.
ccfb76d to
c010af3
Compare
cloneMntNs sets CLONE_NEWUSER|CLONE_NEWNS on the child, but clone can fail for reasons the proactive AppArmor sysctl check cannot detect — seccomp filters, other LSM policies, or EACCES when inherited socket fds cross the user namespace boundary after exec triggers capability recomputation. Return whether namespace flags were set so the caller can distinguish a namespace-related Start failure from an unrelated one. On failure, rebuild the command without clone flags and retry, degrading gracefully to no mount isolation rather than failing the container start entirely. Signed-off-by: Derek McGowan <derek@mcg.dev>
c010af3 to
f68d962
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // unprivileged user namespaces), the function logs a warning and the shim | ||
| // will run without mount isolation. |
There was a problem hiding this comment.
The header comment says the function "logs a warning" when namespace creation is not possible, but the implementation explicitly avoids logging (and now returns false). Please update the comment to match actual behavior (e.g., "skips mount isolation" / "returns false") to avoid misleading future readers.
| // unprivileged user namespaces), the function logs a warning and the shim | |
| // will run without mount isolation. | |
| // unprivileged user namespaces), the shim runs without mount isolation | |
| // and this function returns false. |
| if err := cmd.Start(); err != nil { | ||
| return params, err | ||
| if !userns { | ||
| return params, err | ||
| } | ||
| // clone(CLONE_NEWUSER) can fail for reasons not covered by the | ||
| // proactive AppArmor check — e.g. seccomp filters, LSM policies, | ||
| // or EACCES from the child's capability recomputation when | ||
| // inherited socket fds cross the user namespace boundary after | ||
| // exec. Retry without namespace isolation rather than failing | ||
| // the container start. | ||
| // | ||
| // Note: we cannot log here — during "start" the logger output | ||
| // goes to stderr which containerd captures as part of the | ||
| // bootstrap response (CombinedOutput), corrupting the JSON. | ||
| cmd, err = newCommand(ctx, id, opts.Address, opts.TTRPCAddress, opts.Debug) | ||
| if err != nil { | ||
| return params, err | ||
| } | ||
| cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[0].f) | ||
| if opts.Debug && len(sockets) > 1 { | ||
| cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[1].f) | ||
| } | ||
| if err := cmd.Start(); err != nil { | ||
| return params, err | ||
| } |
There was a problem hiding this comment.
If the retry without userns also fails, the returned error currently only reflects the second failure, losing the original clone-related error which is often the key diagnostic. Preserve the first Start() error and wrap/aggregate it with the retry error so users can see both causes.
| if err := cmd.Start(); err != nil { | ||
| return params, err | ||
| if !userns { | ||
| return params, err | ||
| } | ||
| // clone(CLONE_NEWUSER) can fail for reasons not covered by the | ||
| // proactive AppArmor check — e.g. seccomp filters, LSM policies, | ||
| // or EACCES from the child's capability recomputation when | ||
| // inherited socket fds cross the user namespace boundary after | ||
| // exec. Retry without namespace isolation rather than failing | ||
| // the container start. |
There was a problem hiding this comment.
New fallback behavior (retrying Start() without user+mount namespaces after a clone failure) is not covered by existing integration tests. Please add a test that forces clone(CLONE_NEWUSER) to fail (e.g., via a seccomp filter/LSM simulation in test harness if available) and asserts Start still succeeds and returns valid bootstrap params.
cloneMntNs sets CLONE_NEWUSER|CLONE_NEWNS on the child, but clone can fail for reasons the proactive AppArmor sysctl check cannot detect — seccomp filters, other LSM policies, or EACCES when inherited socket fds cross the user namespace boundary after exec triggers capability recomputation.
Return whether namespace flags were set so the caller can distinguish a namespace-related Start failure from an unrelated one. On failure, rebuild the command without clone flags and retry, degrading gracefully to no mount isolation rather than failing the container start entirely.