Skip to content

Retry shim start without userns on clone failure#150

Draft
dmcgowan wants to merge 1 commit intocontainerd:mainfrom
dmcgowan:retry-on-userns-failure
Draft

Retry shim start without userns on clone failure#150
dmcgowan wants to merge 1 commit intocontainerd:mainfrom
dmcgowan:retry-on-userns-failure

Conversation

@dmcgowan
Copy link
Copy Markdown
Member

@dmcgowan dmcgowan commented Apr 7, 2026

cloneMntNs sets CLONE_NEWUSER|CLONE_NEWNS on the child, but clone can fail for reasons the proactive AppArmor sysctl check cannot detect — seccomp filters, other LSM policies, or EACCES when inherited socket fds cross the user namespace boundary after exec triggers capability recomputation.

Return whether namespace flags were set so the caller can distinguish a namespace-related Start failure from an unrelated one. On failure, rebuild the command without clone flags and retry, degrading gracefully to no mount isolation rather than failing the container start entirely.

Copilot AI review requested due to automatic review settings April 7, 2026 23:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes shim startup more resilient on Linux by retrying Start() without user/mount namespace clone flags when the initial start fails after enabling user namespaces, allowing the shim to degrade to no mount isolation instead of failing the container start.

Changes:

  • Change cloneMntNs to return a boolean indicating whether userns/mntns clone flags were applied.
  • On Unix, retry shim cmd.Start() without namespace isolation when the first start fails and cloneMntNs had enabled user namespaces.
  • Update non-Linux stub implementation to match the new cloneMntNs signature.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
internal/shim/manager/mount_other.go Update non-Linux cloneMntNs stub to return bool (always false).
internal/shim/manager/mount_linux.go Return whether userns/mntns clone flags were applied so callers can distinguish the path taken.
internal/shim/manager/manager_unix.go Retry shim start without namespace isolation when a userns-enabled start fails; add logging for the fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +204 to +206
cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[0].f)
if opts.Debug && len(sockets) > 1 {
cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[1].f)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry path rebuilds cmd.ExtraFiles by indexing into sockets (always appending sockets[0] and conditionally sockets[1]). This is tightly coupled to the current socket creation logic and will silently break if additional sockets are ever added (or ordering changes). Consider rebuilding ExtraFiles by iterating over the existing sockets slice (or factoring the socket-attachment logic into a helper) so the retry stays correct as the surrounding code evolves.

Suggested change
cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[0].f)
if opts.Debug && len(sockets) > 1 {
cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[1].f)
for _, socket := range sockets {
cmd.ExtraFiles = append(cmd.ExtraFiles, socket.f)

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +191
userns := cloneMntNs(ctx, cmd)

if err := cmd.Start(); err != nil {
return params, err
if !userns {
return params, err
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new fallback behavior (retrying cmd.Start() without namespace isolation when a userns-enabled start fails) isn’t covered by current tests. Consider adding an integration test that forces the first cmd.Start() to fail after cloneMntNs sets userns flags and asserts the retry succeeds and stdout remains parseable JSON.

Copilot uses AI. Check for mistakes.
@dmcgowan dmcgowan force-pushed the retry-on-userns-failure branch from ccfb76d to c010af3 Compare April 8, 2026 23:21
cloneMntNs sets CLONE_NEWUSER|CLONE_NEWNS on the child, but clone can
fail for reasons the proactive AppArmor sysctl check cannot detect —
seccomp filters, other LSM policies, or EACCES when inherited socket
fds cross the user namespace boundary after exec triggers capability
recomputation.

Return whether namespace flags were set so the caller can distinguish a
namespace-related Start failure from an unrelated one. On failure, rebuild
the command without clone flags and retry, degrading gracefully to no
mount isolation rather than failing the container start entirely.

Signed-off-by: Derek McGowan <derek@mcg.dev>
Copilot AI review requested due to automatic review settings April 11, 2026 06:28
@dmcgowan dmcgowan force-pushed the retry-on-userns-failure branch from c010af3 to f68d962 Compare April 11, 2026 06:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 55 to 56
// unprivileged user namespaces), the function logs a warning and the shim
// will run without mount isolation.
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment says the function "logs a warning" when namespace creation is not possible, but the implementation explicitly avoids logging (and now returns false). Please update the comment to match actual behavior (e.g., "skips mount isolation" / "returns false") to avoid misleading future readers.

Suggested change
// unprivileged user namespaces), the function logs a warning and the shim
// will run without mount isolation.
// unprivileged user namespaces), the shim runs without mount isolation
// and this function returns false.

Copilot uses AI. Check for mistakes.
Comment on lines 188 to +212
if err := cmd.Start(); err != nil {
return params, err
if !userns {
return params, err
}
// clone(CLONE_NEWUSER) can fail for reasons not covered by the
// proactive AppArmor check — e.g. seccomp filters, LSM policies,
// or EACCES from the child's capability recomputation when
// inherited socket fds cross the user namespace boundary after
// exec. Retry without namespace isolation rather than failing
// the container start.
//
// Note: we cannot log here — during "start" the logger output
// goes to stderr which containerd captures as part of the
// bootstrap response (CombinedOutput), corrupting the JSON.
cmd, err = newCommand(ctx, id, opts.Address, opts.TTRPCAddress, opts.Debug)
if err != nil {
return params, err
}
cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[0].f)
if opts.Debug && len(sockets) > 1 {
cmd.ExtraFiles = append(cmd.ExtraFiles, sockets[1].f)
}
if err := cmd.Start(); err != nil {
return params, err
}
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the retry without userns also fails, the returned error currently only reflects the second failure, losing the original clone-related error which is often the key diagnostic. Preserve the first Start() error and wrap/aggregate it with the retry error so users can see both causes.

Copilot uses AI. Check for mistakes.
Comment on lines 188 to +197
if err := cmd.Start(); err != nil {
return params, err
if !userns {
return params, err
}
// clone(CLONE_NEWUSER) can fail for reasons not covered by the
// proactive AppArmor check — e.g. seccomp filters, LSM policies,
// or EACCES from the child's capability recomputation when
// inherited socket fds cross the user namespace boundary after
// exec. Retry without namespace isolation rather than failing
// the container start.
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New fallback behavior (retrying Start() without user+mount namespaces after a clone failure) is not covered by existing integration tests. Please add a test that forces clone(CLONE_NEWUSER) to fail (e.g., via a seccomp filter/LSM simulation in test harness if available) and asserts Start still succeeds and returns valid bootstrap params.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants