Skip to content

fix(agent): keep SSH sessions alive in unprivileged LXC containers#6458

Open
otavio wants to merge 4 commits into
masterfrom
fix/agent-unprivileged-lxc-shell
Open

fix(agent): keep SSH sessions alive in unprivileged LXC containers#6458
otavio wants to merge 4 commits into
masterfrom
fix/agent-unprivileged-lxc-shell

Conversation

@otavio

@otavio otavio commented Jun 9, 2026

Copy link
Copy Markdown
Member

What

Stops SSH sessions from dropping the moment authentication completes when the
ShellHub agent runs inside an unprivileged Proxmox LXC container. Two unrelated
root causes in the agent's host-mode session handling are addressed.

Why

Closes #4634

Users running the agent in an unprivileged LXC reported sessions disconnecting
immediately after auth, with inappropriate ioctl for device and
nsenter: reassociate to namespace 'ns/time' failed: Operation not permitted
in the agent logs, plus a server-side use of closed network connection.
Investigation found two distinct bugs, one per agent mode.

Changes

  • agent (native mode), sessioner.go: startPty/initPty failures (ENOTTY
    in unprivileged LXC) were logged but execution continued, dereferencing the
    nil *os.File returned on failure — a nil-pointer panic that crashed the
    entire agent process. Shell() and Exec() now return the wrapped error
    after session.Exit(1), before any use of the nil file. A cmd.ProcessState
    nil-guard was added to the adjacent Exec() exit path, which had the same
    crash shape when cmd.Start() fails.

  • agent (Docker mode), command_docker.go: the nsenter wrapper always
    added -T to join the host time namespace whenever /proc/1/ns/time existed.
    In unprivileged LXC the mapped root lacks CAP_SYS_TIME, so the join fails
    with EPERM and nsenter aborts before exec'ing the shell. The join is now
    best-effort: a 500ms-timeout probe determines joinability once, results are
    memoized (definitive denials cached, transient/misconfig retried), and -T is
    dropped when not permitted. The time namespace only virtualizes
    CLOCK_MONOTONIC/BOOTTIME (not wall-clock) and Docker shares the host's by
    default, so dropping it has no practical effect on a shell.

  • agent/Dockerfile, cmd/true: the probe runs /bin/true as its payload,
    but the production image is FROM scratch. A tiny static CGO_ENABLED=0 Go
    true is compiled and copied in, with a build-time RUN ["/bin/true"]
    smoke-check so a broken payload fails the image build rather than production.

Testing

Unit tests cover both fixes (run the Docker-mode tests with -tags docker):

go test ./agent/server/modes/host/...
go test -tags docker ./agent/server/modes/host/...
  • Native: TestShell_StartPtyError / TestExec_InitPtyError assert no panic and
    a wrapped error on PTY failure; TestExec_NonPty_SucceedingCommand guards the
    non-PTY path (constrained to !docker, since the docker wrapper changes the
    exit semantics).
  • Docker: TestNsenterArgs, TestProbeTimeNS, TestTimeNSMemoizer,
    TestNsenterCommandWrapper cover flag assembly, probe exit-code
    classification, and the tri-state memoizer (cache-once vs. retry).

@otavio

otavio commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

/review

@shellhub-io shellhub-io deleted a comment from github-actions Bot Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude finished @otavio's task in 18m 53s —— View job


Code Review Complete

Reviewed 9 files across code quality, security, testing, Go patterns, and architecture. 2 inline comments posted (1 medium, 1 low).

Summary

The two core bug fixes are sound:

  • Native mode (sessioner.go): the early-return guard on startPtyFn/initPtyFn failure correctly prevents the nil-pointer dereference, and the cmd.ProcessState nil-guard in Exec() is correct.
  • Docker mode (command_docker.go): the tri-state memoizer and best-effort time-namespace probe are well-designed. The sync.Mutex usage is correct, map iteration order doesn't matter for nsenter flags, and error classification is solid.
  • Dockerfile / cmd/true: clean addition — static binary, correct cross-compilation flags, exec-form RUN for scratch image.
  • Tests: good coverage of the new behavior with proper function-variable seams.
  • Security: no issues — all nsenter/command arguments are hardcoded literals, no user input flows into command construction.
  • Architecture: no interface changes, no shared-package modifications, no cross-repo impact on cloud/.

To request another review round, comment /review.

Comment thread agent/server/modes/host/command/command_docker.go Outdated
Comment thread agent/server/modes/host/sessioner_test.go
@otavio otavio force-pushed the fix/agent-unprivileged-lxc-shell branch from 7989ac1 to fca8e9e Compare June 10, 2026 12:44
@otavio

otavio commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

/review

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @otavio's task in 40s —— View job


Code Review Complete

Reviewed 9 files across code quality, security, testing, Go patterns, and architecture — no issues found in the changed code. The fixes are sound:

  • Native mode (sessioner.go): Early-return guards on startPtyFn/initPtyFn failure correctly prevent nil-pointer dereferences, and the cmd.ProcessState nil-guard in Exec() is correct.
  • Docker mode (command_docker.go): The tri-state memoizer and best-effort time-namespace probe are well-designed. The sync.Mutex pattern is correct (no races), error classification is solid, and the os.ErrNotExist fix (fca8e9e) properly handles absolute-path binary lookup.
  • Dockerfile / cmd/true: Clean — static binary, correct cross-compilation flags, exec-form RUN for scratch image.
  • Tests: Good coverage of the new behavior with proper function-variable seams and t.Cleanup restoration.
  • Security: No issues — all nsenter/command arguments are hardcoded literals, no user input flows into command construction.
  • Architecture: No exported interface changes, no pkg/ modifications, no cross-repo impact on cloud/.

Additional note

Heredoc() (lines 151–189) has the same two nil-pointer risks that this PR fixes in Shell() and Exec(): cmd.ProcessState.ExitCode() without a nil-guard (line 189), and a kill-goroutine spawned before cmd.Start() (lines 151–154). These are pre-existing and outside this PR's diff, but since the PR is specifically fixing this class of bug, it may be worth patching Heredoc() for consistency in a follow-up.


To request another review round, comment /review.

@otavio otavio force-pushed the fix/agent-unprivileged-lxc-shell branch from 8bc5a28 to 7b8752c Compare June 10, 2026 16:08
@gustavosbarreto

gustavosbarreto commented Jun 14, 2026

Copy link
Copy Markdown
Member

The docker wrapper builds an nsenter -t 1 ... command that drops the session into PID 1's namespaces, and this PR makes the -T (time namespace) join conditional behind a runtime probe. I think the join can be dropped outright instead.

The rationale already says it: "Docker shares the host's [time namespace] by default, so dropping it has no practical effect on a shell." Docker keeps containers in the host time namespace (CLONE_NEWTIME isn't in the containerd/runc default set, and there's no flag to opt in), so nsenter -t 1 -T just re-enters the namespace the process is already in. That's a no-op on every Docker setup, not only in unprivileged LXC. The agent doesn't touch the time namespace anywhere either (CLONE_NEWTIME, ns/time, CLOCK_MONOTONIC, boottime all grep empty under agent/).

So rather than probing for joinability at runtime, I'd drop the entry in command_docker.go:

-		"time":   "-T",

That takes probeTimeNSJoinable and the memoizer out of the picture, and removes the need for cmd/true and the scratch-image RUN ["/bin/true"] smoke-check with it, since there's no probe payload left to run. The LXC fix stays and gets sharper: that -T join is exactly what EPERMs and aborts nsenter before the shell execs, so not passing it is the fix.

Tested on the docker-mode agent: builds clean, SSH session comes up, and the nsenter invocation is -m -u -i -n -p -C with no -T.

otavio added 4 commits June 14, 2026 21:37
Two distinct bugs caused SSH sessions to drop immediately when the agent
runs inside an unprivileged Proxmox LXC container:

- Native mode: a failed PTY allocation (ENOTTY) was only logged, then the
  nil *os.File was dereferenced, panicking and crashing the whole agent.
  Shell() and Exec() now return the error gracefully after Exit(1).

- Docker mode: nsenter always joined the host time namespace (-T), which
  fails with EPERM in unprivileged LXC and aborts the shell before it
  starts. The join is now best-effort -- probed once and dropped when not
  permitted. The time namespace has no practical effect on a shell.

A static /bin/true is added to the scratch image as the probe payload.

Fixes: #4634
gofumpt alignment in getWrappedCommand, and move the fakeGosshConn test
helper into the !docker test file where its only user lives so it is not
flagged as unused when building with -tags docker.
Verify the behaviour the test documents but did not check: a failed
startPty must call session.Exit once with code 1.
In native mode the agent switches credentials via SysProcAttr.Credential
when running as root, which makes the Go runtime call setgroups() during
cmd.Start() — even for an empty group list. In an unprivileged user
namespace /proc/self/setgroups is "deny", so setgroups fails with EPERM
and every native SSH session dies right after auth with a cryptic error.

Detect the denied policy up front (euid 0 AND /proc/self/setgroups ==
"deny") via a new command.CheckCredentialSwitch seam and refuse the
session with a clear error and Exit(1) before launching the shell, rather
than dropping supplementary groups silently. Docker mode is unaffected
(it switches privileges via setpriv/nsenter) and gets a no-op.

Also harden Heredoc against two latent nil-deref panics on a cmd.Start()
failure (nil ProcessState, and the kill-goroutine racing on a nil
Process), and add a non-prescriptive PTY-allocation hint to the Shell and
Exec logs on ENOTTY.
@otavio otavio force-pushed the fix/agent-unprivileged-lxc-shell branch from 7b8752c to 6e0427c Compare June 15, 2026 00:50
@otavio

otavio commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Agreed — consolidated into the change rather than kept as a follow-up commit, to keep the history clean. Dropped the -T join outright: probeTimeNSJoinable, the memoizer, cmd/true, and the Dockerfile RUN ["/bin/true"] smoke-check are all gone, folded into 4d5bf85. The nsenter invocation is now -m -u -i -n -p -C with no -T. Builds clean and tests pass under -tags docker.

Note: the branch was force-pushed to fold this in.

@otavio

otavio commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/review

@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown

Claude finished @otavio's task in 6m 57s —— View job


Code Review Complete

Reviewed 10 files across code quality, security, testing, Go patterns, and architecture — no issues found in the changed code.

  • Native mode (sessioner.go): Early-return guards on startPtyFn/initPtyFn failure correctly prevent nil-pointer dereferences. The refuseIfCredentialSwitchDenied pre-flight check is properly applied to Shell(), Heredoc(), and Exec(). The cmd.ProcessState nil-guard in Exec() is correct. The kill-goroutine in Heredoc() is correctly moved after cmd.Start().
  • Docker mode (command_docker.go): Clean simplification — dropping -T outright instead of probing. The nsenterArgs helper and statFn seam are well-factored. All nsenter/command arguments remain hardcoded literals with no user input in command construction.
  • Credential switch (command_native.go): CheckCredentialSwitch() / setgroupsDenied() correctly read the kernel-controlled /proc/self/setgroups with proper error classification (os.ErrNotExist vs other errors). The geteuidFn seam short-circuits correctly for non-root.
  • Tests: Thorough coverage of the new behavior — PTY failure paths, credential switch denial for all three session types, Heredoc start failure, non-PTY happy path. Compile-time interface assertions for both build tags. Proper seam variable save/restore via t.Cleanup.
  • Security: No issues — no user input flows into command construction, credential checks use kernel-controlled files.
  • Architecture: No exported interface changes, no pkg/ modifications, no cross-repo impact on cloud/.

Additional note

The Heredoc() method (line 239) accesses cmd.ProcessState.ExitCode() without the nil-guard that was added to Exec() (lines 374–377). This is safe because cmd.Start() succeeds before reaching that point (early-return guard at line 206–211), and Go guarantees ProcessState is populated after cmd.Wait(). However, for defensive consistency with Exec(), a nil-guard could be added in a follow-up.


To request another review round, comment /review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: SSH Connection not Working

2 participants