fix(agent): keep SSH sessions alive in unprivileged LXC containers#6458
fix(agent): keep SSH sessions alive in unprivileged LXC containers#6458otavio wants to merge 4 commits into
Conversation
|
/review |
|
Claude finished @otavio's task in 18m 53s —— View job Code Review CompleteReviewed 9 files across code quality, security, testing, Go patterns, and architecture. 2 inline comments posted (1 medium, 1 low). SummaryThe two core bug fixes are sound:
To request another review round, comment |
7989ac1 to
fca8e9e
Compare
|
/review |
|
Claude finished @otavio's task in 40s —— View job Code Review CompleteReviewed 9 files across code quality, security, testing, Go patterns, and architecture — no issues found in the changed code. The fixes are sound:
Additional note
To request another review round, comment |
8bc5a28 to
7b8752c
Compare
|
The docker wrapper builds an The rationale already says it: "Docker shares the host's [time namespace] by default, so dropping it has no practical effect on a shell." Docker keeps containers in the host time namespace ( So rather than probing for joinability at runtime, I'd drop the entry in - "time": "-T",That takes Tested on the docker-mode agent: builds clean, SSH session comes up, and the |
Two distinct bugs caused SSH sessions to drop immediately when the agent runs inside an unprivileged Proxmox LXC container: - Native mode: a failed PTY allocation (ENOTTY) was only logged, then the nil *os.File was dereferenced, panicking and crashing the whole agent. Shell() and Exec() now return the error gracefully after Exit(1). - Docker mode: nsenter always joined the host time namespace (-T), which fails with EPERM in unprivileged LXC and aborts the shell before it starts. The join is now best-effort -- probed once and dropped when not permitted. The time namespace has no practical effect on a shell. A static /bin/true is added to the scratch image as the probe payload. Fixes: #4634
gofumpt alignment in getWrappedCommand, and move the fakeGosshConn test helper into the !docker test file where its only user lives so it is not flagged as unused when building with -tags docker.
Verify the behaviour the test documents but did not check: a failed startPty must call session.Exit once with code 1.
In native mode the agent switches credentials via SysProcAttr.Credential when running as root, which makes the Go runtime call setgroups() during cmd.Start() — even for an empty group list. In an unprivileged user namespace /proc/self/setgroups is "deny", so setgroups fails with EPERM and every native SSH session dies right after auth with a cryptic error. Detect the denied policy up front (euid 0 AND /proc/self/setgroups == "deny") via a new command.CheckCredentialSwitch seam and refuse the session with a clear error and Exit(1) before launching the shell, rather than dropping supplementary groups silently. Docker mode is unaffected (it switches privileges via setpriv/nsenter) and gets a no-op. Also harden Heredoc against two latent nil-deref panics on a cmd.Start() failure (nil ProcessState, and the kill-goroutine racing on a nil Process), and add a non-prescriptive PTY-allocation hint to the Shell and Exec logs on ENOTTY.
7b8752c to
6e0427c
Compare
|
Agreed — consolidated into the change rather than kept as a follow-up commit, to keep the history clean. Dropped the Note: the branch was force-pushed to fold this in. |
|
/review |
|
Claude finished @otavio's task in 6m 57s —— View job Code Review CompleteReviewed 10 files across code quality, security, testing, Go patterns, and architecture — no issues found in the changed code.
Additional noteThe To request another review round, comment |
What
Stops SSH sessions from dropping the moment authentication completes when the
ShellHub agent runs inside an unprivileged Proxmox LXC container. Two unrelated
root causes in the agent's host-mode session handling are addressed.
Why
Closes #4634
Users running the agent in an unprivileged LXC reported sessions disconnecting
immediately after auth, with
inappropriate ioctl for deviceandnsenter: reassociate to namespace 'ns/time' failed: Operation not permittedin the agent logs, plus a server-side
use of closed network connection.Investigation found two distinct bugs, one per agent mode.
Changes
agent (native mode),
sessioner.go:startPty/initPtyfailures (ENOTTYin unprivileged LXC) were logged but execution continued, dereferencing the
nil
*os.Filereturned on failure — a nil-pointer panic that crashed theentire agent process.
Shell()andExec()now return the wrapped errorafter
session.Exit(1), before any use of the nil file. Acmd.ProcessStatenil-guard was added to the adjacent
Exec()exit path, which had the samecrash shape when
cmd.Start()fails.agent (Docker mode),
command_docker.go: thensenterwrapper alwaysadded
-Tto join the host time namespace whenever/proc/1/ns/timeexisted.In unprivileged LXC the mapped root lacks
CAP_SYS_TIME, so the join failswith EPERM and
nsenteraborts before exec'ing the shell. The join is nowbest-effort: a 500ms-timeout probe determines joinability once, results are
memoized (definitive denials cached, transient/misconfig retried), and
-Tisdropped when not permitted. The time namespace only virtualizes
CLOCK_MONOTONIC/BOOTTIME(not wall-clock) and Docker shares the host's bydefault, so dropping it has no practical effect on a shell.
agent/Dockerfile,
cmd/true: the probe runs/bin/trueas its payload,but the production image is
FROM scratch. A tiny staticCGO_ENABLED=0Gotrueis compiled and copied in, with a build-timeRUN ["/bin/true"]smoke-check so a broken payload fails the image build rather than production.
Testing
Unit tests cover both fixes (run the Docker-mode tests with
-tags docker):TestShell_StartPtyError/TestExec_InitPtyErrorassert no panic anda wrapped error on PTY failure;
TestExec_NonPty_SucceedingCommandguards thenon-PTY path (constrained to
!docker, since the docker wrapper changes theexit semantics).
TestNsenterArgs,TestProbeTimeNS,TestTimeNSMemoizer,TestNsenterCommandWrappercover flag assembly, probe exit-codeclassification, and the tri-state memoizer (cache-once vs. retry).