Skip to content

Commit a934f8c

Browse files
authored
experimental/ssh: surface connect failures instead of hanging (#5456)
## Why Originated from a customer case: `databricks ssh connect` to a dedicated cluster whose Docker container image was missing an OpenSSH **server** (`/usr/sbin/sshd`). The failure surfaced terribly — either a generic `server metadata error / metadata.json doesn't exist`, or the client just **hung** (the local `ssh` waited on its 360s `ConnectTimeout`). The root cause was buried in the cluster's job-run logs. This PR improves the diagnostics for `ssh connect` failures. ## What 1. **Surface bootstrap job-run errors.** When the SSH server bootstrap job reaches a terminal/failed state, fetch the run's state message, notebook error/trace, and run-page URL and show them — both when the task terminates before reaching RUNNING and when it dies after, during metadata polling. (`experimental/ssh/internal/client/client.go`) 2. **Guard against hangs when the server is up but the handshake never completes.** If the container image has no `sshd`, the server can't launch `/usr/sbin/sshd` on connect and **holds the websocket open**, so both proxy loops block forever. The client now runs the proxy loops in the background and aborts after a handshake timeout (no server response) with an actionable hint, and also exits promptly when the server *does* close the connection. (`experimental/ssh/internal/proxy/client.go`) 3. **openssh-server hint** when `ssh` exits with its connection-failure code (255). (`spawnSSHClient`) ## Tests - `client_internal_test.go`: failed-run message formatting (state message + trace + run URL), truncation, terminal-state detection (SDK mocks). - `proxy/client_server_test.go`: fast exit when the server closes the connection; abort on the handshake timeout when the server sends nothing. All `experimental/ssh/...` tests pass; lint clean. ## Status / follow-ups (WIP) - The missing-`sshd` path still incurs a ~30s handshake-timeout wait before failing. The cleaner fix is a **server-side pre-flight `sshd` check** (fail the bootstrap job immediately with a clear message), tracked separately — that would turn this case into an instant, clear job failure handled by improvement #1. - The handshake timeout (30s) is conservative and currently a package constant; could be shortened or made configurable. - The proxy error and the outer 255 hint are slightly redundant; may consolidate. This pull request and its description were written by Isaac.
1 parent 7e85efd commit a934f8c

6 files changed

Lines changed: 553 additions & 29 deletions

File tree

experimental/ssh/FAILURE_MODES.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# Reproducing `databricks ssh connect` failure modes
2+
3+
This guide documents container/cluster misconfigurations that make `databricks ssh connect`
4+
fail, how to reproduce each one, the symptom the user sees, and where the real error lives. It
5+
is primarily a testing aid for the SSH feature's error-handling paths.
6+
7+
For the connection flow and architecture, see [README.md](./README.md).
8+
9+
## Background: where failures surface
10+
11+
The bootstrap is a **Python notebook job** that starts `databricks ssh server` on the cluster.
12+
The server publishes its port to the workspace (`metadata.json`), the client reads it, prints
13+
`Connected!`, and spawns `ssh`. The SSH daemon (`/usr/sbin/sshd`) is launched **lazily, per
14+
client connection** (see `internal/server/sshd.go` and `internal/proxy/server.go`). Because of
15+
this ordering, different misconfigurations fail at different stages:
16+
17+
| Stage | Needs | Failure mode if missing |
18+
| --- | --- | --- |
19+
| Bootstrap job runs | a working Databricks **Python** runtime in the image | [Mode 2](#mode-2-container-cant-run-the-python-bootstrap) |
20+
| Per-connection SSH | **`/usr/sbin/sshd`** (OpenSSH server) in the image | [Mode 1](#mode-1-container-missing-the-openssh-server-sshd) |
21+
22+
## Prerequisites
23+
24+
- A workspace with **Databricks Container Services** (custom Docker images) enabled.
25+
- Permission to create a **dedicated (single-user)** cluster.
26+
- A dev build of the CLI. See the *Development* section of [README.md](./README.md):
27+
```shell
28+
./task build snapshot-release
29+
./cli ssh connect --cluster=<cluster-id> --releases-dir=./dist --debug
30+
```
31+
- A container registry the workspace can pull from (e.g. a public Docker Hub repo) to host the
32+
test images below. Build them for the cluster's architecture (`linux/amd64` on most clouds):
33+
```shell
34+
docker buildx build --platform linux/amd64 -t <namespace>/<image>:<tag> --push .
35+
```
36+
37+
The cluster specs below use a single-node dedicated cluster. Adjust `node_type_id` and
38+
`spark_version` for your cloud and DBR version:
39+
40+
```json
41+
{
42+
"cluster_name": "ssh-failure-repro",
43+
"spark_version": "16.4.x-scala2.12",
44+
"node_type_id": "<your-cloud-node-type>",
45+
"num_workers": 0,
46+
"data_security_mode": "SINGLE_USER",
47+
"single_user_name": "<you@example.com>",
48+
"spark_conf": { "spark.databricks.cluster.profile": "singleNode", "spark.master": "local[*, 4]" },
49+
"custom_tags": { "ResourceClass": "SingleNode" },
50+
"autotermination_minutes": 60,
51+
"docker_image": { "url": "<namespace>/<image>:<tag>" }
52+
}
53+
```
54+
55+
Create it with `databricks clusters create --json @cluster.json --no-wait` and wait for the
56+
`RUNNING` state (a custom-container pull can take several minutes).
57+
58+
## Mode 1: container missing the OpenSSH server (`sshd`)
59+
60+
A notebook-capable image that does **not** ship `openssh-server`. Build it by removing the SSH
61+
server from an image that otherwise works:
62+
63+
```dockerfile
64+
FROM databricksruntime/standard:16.4-LTS
65+
RUN (apt-get remove -y openssh-server || true) \
66+
&& rm -f /usr/sbin/sshd /usr/bin/sshd
67+
```
68+
69+
Create a cluster on this image, then:
70+
71+
```shell
72+
./cli ssh connect --cluster=<cluster-id> --releases-dir=./dist
73+
```
74+
75+
**Symptom.** The bootstrap job succeeds and publishes metadata, so the client prints
76+
`Connected!` — and then the connection drops. The server can't launch `/usr/sbin/sshd` for the
77+
incoming connection and holds the websocket open, so historically the `ssh` client **hung**
78+
until its `ConnectTimeout`. The real error,
79+
`failed to start SSHD process: ... /usr/sbin/sshd: no such file or directory`, is only written
80+
to the bootstrap job's **stdout logs** while the job is still `RUNNING` — it is never a failed
81+
job state.
82+
83+
**With the error-handling improvements** the client aborts after a handshake timeout (no SSH
84+
banner from the server) with an actionable hint to install `openssh-server`, and exits
85+
promptly instead of hanging.
86+
87+
**Fix.** Install `openssh-server` in the image (`apt-get install -y openssh-server`).
88+
89+
## Mode 2: container can't run the Python bootstrap
90+
91+
A bare/minimal base that lacks a working Databricks Python runtime. The simplest example is
92+
`databricksruntime/rbase:16.4-LTS` used directly as the cluster image (it is an R *base* layer;
93+
notably it has no functioning `/databricks/python` notebook-execution environment).
94+
95+
Create a cluster on `databricksruntime/rbase:16.4-LTS`, then:
96+
97+
```shell
98+
./cli ssh connect --cluster=<cluster-id> --releases-dir=./dist
99+
```
100+
101+
**Symptom.** The bootstrap is a Python notebook, but the image can't execute notebook commands,
102+
so the job fails with `Could not reach driver of cluster <id>`. The SSH server never starts and
103+
never publishes metadata, so the client fails with
104+
`server metadata error / ... metadata.json doesn't exist`**before** the `sshd` step is ever
105+
reached. (A trivial `print(...)` notebook job submitted to the same cluster fails the same way,
106+
which is a quick way to confirm the image, not the SSH feature, is at fault.)
107+
108+
**With the error-handling improvements** the client fetches the failed run's state message,
109+
notebook error/trace, and run-page URL and shows them instead of the generic metadata error.
110+
111+
**Fix.** Build on a notebook-capable base (e.g. `databricksruntime/standard:...`) or otherwise
112+
provide a working Databricks Python environment, in addition to `openssh-server`.
113+
114+
## Working control
115+
116+
`databricksruntime/standard:16.4-LTS` ships **both** a working Python runtime **and** `sshd`,
117+
so `ssh connect` to a cluster on it succeeds end to end. Use it as a baseline to confirm your
118+
workspace, cluster spec, and dev build are healthy before reproducing a failure mode.
119+
120+
## Inspecting the bootstrap job logs
121+
122+
`ssh connect` prints `Job submitted successfully with run ID: <id>`. Inspect it with:
123+
124+
```shell
125+
databricks jobs get-run <id> # open run_page_url in the UI
126+
databricks jobs get-run-output <task-run-id> # task-run-id = .tasks[0].run_id of the run
127+
```
128+
129+
Caveat: for a **running** server task, `get-run-output`'s `logs`/`error` are not populated —
130+
the `sshd` error from [Mode 1](#mode-1-container-missing-the-openssh-server-sshd) lives in the
131+
live notebook cell stdout / driver logs, not the Jobs run-output API. A failed run from
132+
[Mode 2](#mode-2-container-cant-run-the-python-bootstrap) does populate the run's state message
133+
and error.
134+
135+
## Reproducing locally, without a workspace
136+
137+
The proxy-layer behaviors have unit tests that don't need a cluster:
138+
139+
- `internal/proxy/client_server_test.go`
140+
- `TestClientExitsWhenServerCommandFails` — server can't launch its command and closes the
141+
connection; the client exits promptly.
142+
- `TestClientTimesOutWhenServerSendsNothing` — server holds the connection open and sends
143+
nothing (the Mode 1 shape); the client aborts on the handshake timeout.
144+
- `internal/client/client_internal_test.go` — formatting of a failed bootstrap run's error
145+
(state message, error trace, run-page URL) using SDK mocks.
146+
147+
```shell
148+
go test ./experimental/ssh/...
149+
```

experimental/ssh/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,9 @@ databricks ssh connect --cluster=id
2323
./cli ssh connect --cluster=<id> --releases-dir=./dist --debug # or modify ssh config accordingly
2424
```
2525

26+
To reproduce and test the known `ssh connect` failure modes (container missing `sshd`, or a
27+
container that can't run the Python bootstrap), see [FAILURE_MODES.md](./FAILURE_MODES.md).
28+
2629
## Design
2730

2831
High level:

experimental/ssh/internal/client/client.go

Lines changed: 123 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -489,16 +489,19 @@ func getServerMetadata(ctx context.Context, client *databricks.WorkspaceClient,
489489
return wsMetadata.Port, string(bodyBytes), effectiveClusterID, nil
490490
}
491491

492-
func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient, version, secretScopeName string, opts ClientOptions) error {
492+
// submitSSHTunnelJob submits the bootstrap job and waits for the SSH server task to start.
493+
// It returns the job run ID (when known) so callers can fetch and surface the run's error
494+
// details if the server never comes up.
495+
func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient, version, secretScopeName string, opts ClientOptions) (int64, error) {
493496
sessionID := opts.SessionIdentifier()
494497
contentDir, err := sshWorkspace.GetWorkspaceContentDir(ctx, client, version, sessionID)
495498
if err != nil {
496-
return fmt.Errorf("failed to get workspace content directory: %w", err)
499+
return 0, fmt.Errorf("failed to get workspace content directory: %w", err)
497500
}
498501

499502
err = client.Workspace.MkdirsByPath(ctx, contentDir) //nolint:staticcheck // Deprecated in SDK v0.127.0. Migration to WorkspaceHierarchyService tracked separately.
500503
if err != nil {
501-
return fmt.Errorf("failed to create directory in the remote workspace: %w", err)
504+
return 0, fmt.Errorf("failed to create directory in the remote workspace: %w", err)
502505
}
503506

504507
sshTunnelJobName := "ssh-server-bootstrap-" + sessionID
@@ -514,7 +517,7 @@ func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient,
514517
Overwrite: true,
515518
})
516519
if err != nil {
517-
return fmt.Errorf("failed to create ssh-tunnel notebook: %w", err)
520+
return 0, fmt.Errorf("failed to create ssh-tunnel notebook: %w", err)
518521
}
519522

520523
baseParams := map[string]string{
@@ -569,12 +572,13 @@ func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient,
569572

570573
waiter, err := client.Jobs.Submit(ctx, submitRequest)
571574
if err != nil {
572-
return fmt.Errorf("failed to submit job: %w", err)
575+
return 0, fmt.Errorf("failed to submit job: %w", err)
573576
}
574577

575578
cmdio.LogString(ctx, fmt.Sprintf("Job submitted successfully with run ID: %d", waiter.RunId))
576579

577-
return waitForJobToStart(ctx, client, waiter.RunId, opts.TaskStartupTimeout)
580+
// Return the run ID even on error so callers can fetch the run's failure details.
581+
return waiter.RunId, waitForJobToStart(ctx, client, waiter.RunId, opts.TaskStartupTimeout)
578582
}
579583

580584
func spawnSSHClient(ctx context.Context, userName, privateKeyPath string, serverPort int, clusterID string, opts ClientOptions) error {
@@ -610,7 +614,18 @@ func spawnSSHClient(ctx context.Context, userName, privateKeyPath string, server
610614
sshCmd.Stdout = os.Stdout
611615
sshCmd.Stderr = os.Stderr
612616

613-
return sshCmd.Run()
617+
err = sshCmd.Run()
618+
// ssh reserves exit code 255 for its own connection-level failures (a remote command's exit
619+
// code is passed through as-is, 0-254). The most common cause here is the cluster's container
620+
// image missing an OpenSSH server, so the server can't launch sshd once we connect — the
621+
// connection then drops right after "Connected!". Surface an actionable hint rather than
622+
// leaving the user with ssh's opaque "Connection closed" message.
623+
if exitErr, ok := errors.AsType[*exec.ExitError](err); ok && exitErr.ExitCode() == 255 {
624+
cmdio.LogString(ctx, cmdio.Yellow(ctx, "The SSH connection closed unexpectedly. If it dropped right after connecting, "+
625+
"the cluster's container image is likely missing an OpenSSH server: ensure 'openssh-server' "+
626+
"is installed (it provides /usr/sbin/sshd), then check the SSH server job run logs."))
627+
}
628+
return err
614629
}
615630

616631
func runSSHProxy(ctx context.Context, client *databricks.WorkspaceClient, serverPort int, clusterID string, opts ClientOptions) error {
@@ -691,9 +706,10 @@ func waitForJobToStart(ctx context.Context, client *databricks.WorkspaceClient,
691706
return sshTask, nil
692707
}
693708

694-
// Check for terminal failure states
709+
// Check for terminal failure states. Surface the run's actual error (e.g. a notebook
710+
// traceback or "Could not reach driver") instead of a generic message.
695711
if currentState == jobs.RunLifecycleStateV2StateTerminated {
696-
return nil, retries.Halt(errors.New("task terminated before reaching running state"))
712+
return nil, retries.Halt(fmt.Errorf("ssh server bootstrap job failed:\n%s", describeRunFailure(ctx, client, runID)))
697713
}
698714

699715
// Continue polling for other states
@@ -703,6 +719,94 @@ func waitForJobToStart(ctx context.Context, client *databricks.WorkspaceClient,
703719
return err
704720
}
705721

722+
// maxRunFailureTraceBytes bounds how much of a failed run's error trace we print to the
723+
// terminal; the full output is always available via the run page URL.
724+
const maxRunFailureTraceBytes = 2000
725+
726+
// describeRunFailure fetches a failed bootstrap run's error details and formats them for the
727+
// terminal. It is best-effort: any API error is folded into the returned text rather than
728+
// propagated, so callers can always embed the result in their own error.
729+
func describeRunFailure(ctx context.Context, client *databricks.WorkspaceClient, runID int64) string {
730+
if runID == 0 {
731+
return " (no job run ID available)"
732+
}
733+
734+
run, err := client.Jobs.GetRun(ctx, jobs.GetRunRequest{RunId: runID})
735+
if err != nil {
736+
return fmt.Sprintf(" could not fetch job run %d: %v", runID, err)
737+
}
738+
739+
var b strings.Builder
740+
741+
// Locate the SSH server task to read its termination reason and per-task run output.
742+
var sshTask *jobs.RunTask
743+
for i := range run.Tasks {
744+
if run.Tasks[i].TaskKey == sshServerTaskKey {
745+
sshTask = &run.Tasks[i]
746+
break
747+
}
748+
}
749+
750+
if sshTask != nil && sshTask.Status != nil && sshTask.Status.TerminationDetails != nil {
751+
if msg := strings.TrimSpace(sshTask.Status.TerminationDetails.Message); msg != "" {
752+
fmt.Fprintf(&b, " %s\n", msg)
753+
}
754+
}
755+
756+
// The notebook error/traceback carries the real cause (e.g. a Python exception).
757+
outputRunID := runID
758+
if sshTask != nil && sshTask.RunId != 0 {
759+
outputRunID = sshTask.RunId
760+
}
761+
if output, err := client.Jobs.GetRunOutput(ctx, jobs.GetRunOutputRequest{RunId: outputRunID}); err == nil && output != nil {
762+
if e := strings.TrimSpace(output.Error); e != "" {
763+
fmt.Fprintf(&b, " %s\n", e)
764+
}
765+
if trace := strings.TrimSpace(output.ErrorTrace); trace != "" {
766+
fmt.Fprintf(&b, "%s\n", truncateTail(trace, maxRunFailureTraceBytes))
767+
}
768+
}
769+
770+
if run.RunPageUrl != "" {
771+
fmt.Fprintf(&b, " See the full job logs: %s", run.RunPageUrl)
772+
}
773+
774+
if b.Len() == 0 {
775+
return fmt.Sprintf(" job run %d failed; see run details in the workspace", runID)
776+
}
777+
return strings.TrimRight(b.String(), "\n")
778+
}
779+
780+
// runFailureIfTerminated reports whether the bootstrap run has reached a terminal state (so the
781+
// SSH server will never come up), returning a formatted failure description when it has.
782+
func runFailureIfTerminated(ctx context.Context, client *databricks.WorkspaceClient, runID int64) (string, bool) {
783+
if runID == 0 {
784+
return "", false
785+
}
786+
run, err := client.Jobs.GetRun(ctx, jobs.GetRunRequest{RunId: runID})
787+
if err != nil {
788+
return "", false
789+
}
790+
for i := range run.Tasks {
791+
if run.Tasks[i].TaskKey != sshServerTaskKey {
792+
continue
793+
}
794+
if run.Tasks[i].Status != nil && run.Tasks[i].Status.State == jobs.RunLifecycleStateV2StateTerminated {
795+
return describeRunFailure(ctx, client, runID), true
796+
}
797+
return "", false
798+
}
799+
return "", false
800+
}
801+
802+
// truncateTail returns the last maxBytes of s, marking the cut when truncated.
803+
func truncateTail(s string, maxBytes int) string {
804+
if len(s) <= maxBytes {
805+
return s
806+
}
807+
return " ...\n" + s[len(s)-maxBytes:]
808+
}
809+
706810
func ensureSSHServerIsRunning(ctx context.Context, client *databricks.WorkspaceClient, version, secretScopeName string, opts ClientOptions) (string, int, string, error) {
707811
sessionID := opts.SessionIdentifier()
708812
// For dedicated clusters, use clusterID; for serverless, it will be read from metadata
@@ -712,7 +816,7 @@ func ensureSSHServerIsRunning(ctx context.Context, client *databricks.WorkspaceC
712816
if errors.Is(err, errServerMetadata) {
713817
cmdio.LogString(ctx, "Starting SSH server...")
714818

715-
err := submitSSHTunnelJob(ctx, client, version, secretScopeName, opts)
819+
runID, err := submitSSHTunnelJob(ctx, client, version, secretScopeName, opts)
716820
if err != nil {
717821
return "", 0, "", fmt.Errorf("failed to submit and start ssh server job: %w", err)
718822
}
@@ -729,10 +833,17 @@ func ensureSSHServerIsRunning(ctx context.Context, client *databricks.WorkspaceC
729833
if err == nil {
730834
cmdio.LogString(ctx, "Health check successful, starting ssh WebSocket connection...")
731835
break
732-
} else if retries < maxRetries-1 {
836+
}
837+
// The metadata never appears if the bootstrap job dies after reaching RUNNING.
838+
// Surface the job's actual error instead of waiting out the full timeout with a
839+
// generic "metadata.json doesn't exist" message.
840+
if failure, terminated := runFailureIfTerminated(ctx, client, runID); terminated {
841+
return "", 0, "", fmt.Errorf("ssh server bootstrap job failed:\n%s", failure)
842+
}
843+
if retries < maxRetries-1 {
733844
time.Sleep(2 * time.Second)
734845
} else {
735-
return "", 0, "", fmt.Errorf("failed to start the ssh server: %w", err)
846+
return "", 0, "", fmt.Errorf("failed to start the ssh server: %w\n%s", err, describeRunFailure(ctx, client, runID))
736847
}
737848
}
738849
} else if err != nil {

0 commit comments

Comments
 (0)