databricks
diff --git a/‎experimental/ssh/FAILURE_MODES.md‎
Lines changed: 149 additions & 0 deletions b/‎experimental/ssh/FAILURE_MODES.md‎
Lines changed: 149 additions & 0 deletions
diff --git a/‎experimental/ssh/README.md‎
Lines changed: 3 additions & 0 deletions b/‎experimental/ssh/README.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎experimental/ssh/internal/client/client.go‎
Lines changed: 123 additions & 12 deletions b/‎experimental/ssh/internal/client/client.go‎
Lines changed: 123 additions & 12 deletions
@@ -0,0 +1,149 @@
+# Reproducing `databricks ssh connect` failure modes
+
+This guide documents container/cluster misconfigurations that make `databricks ssh connect`
+fail, how to reproduce each one, the symptom the user sees, and where the real error lives. It
+is primarily a testing aid for the SSH feature's error-handling paths.
+
+For the connection flow and architecture, see [README.md](./README.md).
+
+## Background: where failures surface
+
+The bootstrap is a **Python notebook job** that starts `databricks ssh server` on the cluster.
+The server publishes its port to the workspace (`metadata.json`), the client reads it, prints
+`Connected!`, and spawns `ssh`. The SSH daemon (`/usr/sbin/sshd`) is launched **lazily, per
+client connection** (see `internal/server/sshd.go` and `internal/proxy/server.go`). Because of
+this ordering, different misconfigurations fail at different stages:
+
+| Stage | Needs | Failure mode if missing |
+| --- | --- | --- |
+| Bootstrap job runs | a working Databricks **Python** runtime in the image | [Mode 2](#mode-2-container-cant-run-the-python-bootstrap) |
+| Per-connection SSH | **`/usr/sbin/sshd`** (OpenSSH server) in the image | [Mode 1](#mode-1-container-missing-the-openssh-server-sshd) |
+
+## Prerequisites
+
+- A workspace with **Databricks Container Services** (custom Docker images) enabled.
+- Permission to create a **dedicated (single-user)** cluster.
+- A dev build of the CLI. See the *Development* section of [README.md](./README.md):
+  ```shell
+  ./task build snapshot-release
+  ./cli ssh connect --cluster=<cluster-id> --releases-dir=./dist --debug
+  ```
+- A container registry the workspace can pull from (e.g. a public Docker Hub repo) to host the
+  test images below. Build them for the cluster's architecture (`linux/amd64` on most clouds):
+  ```shell
+  docker buildx build --platform linux/amd64 -t <namespace>/<image>:<tag> --push .
+  ```
+
+The cluster specs below use a single-node dedicated cluster. Adjust `node_type_id` and
+`spark_version` for your cloud and DBR version:
+
+```json
+{
+  "cluster_name": "ssh-failure-repro",
+  "spark_version": "16.4.x-scala2.12",
+  "node_type_id": "<your-cloud-node-type>",
+  "num_workers": 0,
+  "data_security_mode": "SINGLE_USER",
+  "single_user_name": "<you@example.com>",
+  "spark_conf": { "spark.databricks.cluster.profile": "singleNode", "spark.master": "local[*, 4]" },
+  "custom_tags": { "ResourceClass": "SingleNode" },
+  "autotermination_minutes": 60,
+  "docker_image": { "url": "<namespace>/<image>:<tag>" }
+}
+```
+
+Create it with `databricks clusters create --json @cluster.json --no-wait` and wait for the
+`RUNNING` state (a custom-container pull can take several minutes).
+
+## Mode 1: container missing the OpenSSH server (`sshd`)
+
+A notebook-capable image that does **not** ship `openssh-server`. Build it by removing the SSH
+server from an image that otherwise works:
+
+```dockerfile
+FROM databricksruntime/standard:16.4-LTS
+RUN (apt-get remove -y openssh-server || true) \
+    && rm -f /usr/sbin/sshd /usr/bin/sshd
+```
+
+Create a cluster on this image, then:
+
+```shell
+./cli ssh connect --cluster=<cluster-id> --releases-dir=./dist
+```
+
+**Symptom.** The bootstrap job succeeds and publishes metadata, so the client prints
+`Connected!` — and then the connection drops. The server can't launch `/usr/sbin/sshd` for the
+incoming connection and holds the websocket open, so historically the `ssh` client **hung**
+until its `ConnectTimeout`. The real error,
+`failed to start SSHD process: ... /usr/sbin/sshd: no such file or directory`, is only written
+to the bootstrap job's **stdout logs** while the job is still `RUNNING` — it is never a failed
+job state.
+
+**With the error-handling improvements** the client aborts after a handshake timeout (no SSH
+banner from the server) with an actionable hint to install `openssh-server`, and exits
+promptly instead of hanging.
+
+**Fix.** Install `openssh-server` in the image (`apt-get install -y openssh-server`).
+
+## Mode 2: container can't run the Python bootstrap
+
+A bare/minimal base that lacks a working Databricks Python runtime. The simplest example is
+`databricksruntime/rbase:16.4-LTS` used directly as the cluster image (it is an R *base* layer;
+notably it has no functioning `/databricks/python` notebook-execution environment).
+
+Create a cluster on `databricksruntime/rbase:16.4-LTS`, then:
+
+```shell
+./cli ssh connect --cluster=<cluster-id> --releases-dir=./dist
+```
+
+**Symptom.** The bootstrap is a Python notebook, but the image can't execute notebook commands,
+so the job fails with `Could not reach driver of cluster <id>`. The SSH server never starts and
+never publishes metadata, so the client fails with
+`server metadata error / ... metadata.json doesn't exist` — **before** the `sshd` step is ever
+reached. (A trivial `print(...)` notebook job submitted to the same cluster fails the same way,
+which is a quick way to confirm the image, not the SSH feature, is at fault.)
+
+**With the error-handling improvements** the client fetches the failed run's state message,
+notebook error/trace, and run-page URL and shows them instead of the generic metadata error.
+
+**Fix.** Build on a notebook-capable base (e.g. `databricksruntime/standard:...`) or otherwise
+provide a working Databricks Python environment, in addition to `openssh-server`.
+
+## Working control
+
+`databricksruntime/standard:16.4-LTS` ships **both** a working Python runtime **and** `sshd`,
+so `ssh connect` to a cluster on it succeeds end to end. Use it as a baseline to confirm your
+workspace, cluster spec, and dev build are healthy before reproducing a failure mode.
+
+## Inspecting the bootstrap job logs
+
+`ssh connect` prints `Job submitted successfully with run ID: <id>`. Inspect it with:
+
+```shell
+databricks jobs get-run <id>              # open run_page_url in the UI
+databricks jobs get-run-output <task-run-id>   # task-run-id = .tasks[0].run_id of the run
+```
+
+Caveat: for a **running** server task, `get-run-output`'s `logs`/`error` are not populated —
+the `sshd` error from [Mode 1](#mode-1-container-missing-the-openssh-server-sshd) lives in the
+live notebook cell stdout / driver logs, not the Jobs run-output API. A failed run from
+[Mode 2](#mode-2-container-cant-run-the-python-bootstrap) does populate the run's state message
+and error.
+
+## Reproducing locally, without a workspace
+
+The proxy-layer behaviors have unit tests that don't need a cluster:
+
+- `internal/proxy/client_server_test.go`
+  - `TestClientExitsWhenServerCommandFails` — server can't launch its command and closes the
+    connection; the client exits promptly.
+  - `TestClientTimesOutWhenServerSendsNothing` — server holds the connection open and sends
+    nothing (the Mode 1 shape); the client aborts on the handshake timeout.
+- `internal/client/client_internal_test.go` — formatting of a failed bootstrap run's error
+  (state message, error trace, run-page URL) using SDK mocks.
+
+```shell
+go test ./experimental/ssh/...
+```
@@ -23,6 +23,9 @@ databricks ssh connect --cluster=id
 ./cli ssh connect --cluster=<id> --releases-dir=./dist --debug # or modify ssh config accordingly
 ```
 
+To reproduce and test the known `ssh connect` failure modes (container missing `sshd`, or a
+container that can't run the Python bootstrap), see [FAILURE_MODES.md](./FAILURE_MODES.md).
+
 ## Design
 
 High level:
 
@@ -489,16 +489,19 @@ func getServerMetadata(ctx context.Context, client *databricks.WorkspaceClient,
 	return wsMetadata.Port, string(bodyBytes), effectiveClusterID, nil
 }
 
-func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient, version, secretScopeName string, opts ClientOptions) error {
+// submitSSHTunnelJob submits the bootstrap job and waits for the SSH server task to start.
+// It returns the job run ID (when known) so callers can fetch and surface the run's error
+// details if the server never comes up.
+func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient, version, secretScopeName string, opts ClientOptions) (int64, error) {
 	sessionID := opts.SessionIdentifier()
 	contentDir, err := sshWorkspace.GetWorkspaceContentDir(ctx, client, version, sessionID)
 	if err != nil {
-		return fmt.Errorf("failed to get workspace content directory: %w", err)
+		return 0, fmt.Errorf("failed to get workspace content directory: %w", err)
 	}
 
 	err = client.Workspace.MkdirsByPath(ctx, contentDir) //nolint:staticcheck // Deprecated in SDK v0.127.0. Migration to WorkspaceHierarchyService tracked separately.
 	if err != nil {
-		return fmt.Errorf("failed to create directory in the remote workspace: %w", err)
+		return 0, fmt.Errorf("failed to create directory in the remote workspace: %w", err)
 	}
 
 	sshTunnelJobName := "ssh-server-bootstrap-" + sessionID
@@ -514,7 +517,7 @@ func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient,
 		Overwrite: true,
 	})
 	if err != nil {
-		return fmt.Errorf("failed to create ssh-tunnel notebook: %w", err)
+		return 0, fmt.Errorf("failed to create ssh-tunnel notebook: %w", err)
 	}
 
 	baseParams := map[string]string{
@@ -569,12 +572,13 @@ func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient,
 
 	waiter, err := client.Jobs.Submit(ctx, submitRequest)
 	if err != nil {
-		return fmt.Errorf("failed to submit job: %w", err)
+		return 0, fmt.Errorf("failed to submit job: %w", err)
 	}
 
 	cmdio.LogString(ctx, fmt.Sprintf("Job submitted successfully with run ID: %d", waiter.RunId))
 
-	return waitForJobToStart(ctx, client, waiter.RunId, opts.TaskStartupTimeout)
+	// Return the run ID even on error so callers can fetch the run's failure details.
+	return waiter.RunId, waitForJobToStart(ctx, client, waiter.RunId, opts.TaskStartupTimeout)
 }
 
 func spawnSSHClient(ctx context.Context, userName, privateKeyPath string, serverPort int, clusterID string, opts ClientOptions) error {
@@ -610,7 +614,18 @@ func spawnSSHClient(ctx context.Context, userName, privateKeyPath string, server
 	sshCmd.Stdout = os.Stdout
 	sshCmd.Stderr = os.Stderr
 
-	return sshCmd.Run()
+	err = sshCmd.Run()
+	// ssh reserves exit code 255 for its own connection-level failures (a remote command's exit
+	// code is passed through as-is, 0-254). The most common cause here is the cluster's container
+	// image missing an OpenSSH server, so the server can't launch sshd once we connect — the
+	// connection then drops right after "Connected!". Surface an actionable hint rather than
+	// leaving the user with ssh's opaque "Connection closed" message.
+	if exitErr, ok := errors.AsType[*exec.ExitError](err); ok && exitErr.ExitCode() == 255 {
+		cmdio.LogString(ctx, cmdio.Yellow(ctx, "The SSH connection closed unexpectedly. If it dropped right after connecting, "+
+			"the cluster's container image is likely missing an OpenSSH server: ensure 'openssh-server' "+
+			"is installed (it provides /usr/sbin/sshd), then check the SSH server job run logs."))
+	}
+	return err
 }
 
 func runSSHProxy(ctx context.Context, client *databricks.WorkspaceClient, serverPort int, clusterID string, opts ClientOptions) error {
@@ -691,9 +706,10 @@ func waitForJobToStart(ctx context.Context, client *databricks.WorkspaceClient,
 			return sshTask, nil
 		}
 
-		// Check for terminal failure states
+		// Check for terminal failure states. Surface the run's actual error (e.g. a notebook
+		// traceback or "Could not reach driver") instead of a generic message.
 		if currentState == jobs.RunLifecycleStateV2StateTerminated {
-			return nil, retries.Halt(errors.New("task terminated before reaching running state"))
+			return nil, retries.Halt(fmt.Errorf("ssh server bootstrap job failed:\n%s", describeRunFailure(ctx, client, runID)))
 		}
 
 		// Continue polling for other states
@@ -703,6 +719,94 @@ func waitForJobToStart(ctx context.Context, client *databricks.WorkspaceClient,
 	return err
 }
 
+// maxRunFailureTraceBytes bounds how much of a failed run's error trace we print to the
+// terminal; the full output is always available via the run page URL.
+const maxRunFailureTraceBytes = 2000
+
+// describeRunFailure fetches a failed bootstrap run's error details and formats them for the
+// terminal. It is best-effort: any API error is folded into the returned text rather than
+// propagated, so callers can always embed the result in their own error.
+func describeRunFailure(ctx context.Context, client *databricks.WorkspaceClient, runID int64) string {
+	if runID == 0 {
+		return "  (no job run ID available)"
+	}
+
+	run, err := client.Jobs.GetRun(ctx, jobs.GetRunRequest{RunId: runID})
+	if err != nil {
+		return fmt.Sprintf("  could not fetch job run %d: %v", runID, err)
+	}
+
+	var b strings.Builder
+
+	// Locate the SSH server task to read its termination reason and per-task run output.
+	var sshTask *jobs.RunTask
+	for i := range run.Tasks {
+		if run.Tasks[i].TaskKey == sshServerTaskKey {
+			sshTask = &run.Tasks[i]
+			break
+		}
+	}
+
+	if sshTask != nil && sshTask.Status != nil && sshTask.Status.TerminationDetails != nil {
+		if msg := strings.TrimSpace(sshTask.Status.TerminationDetails.Message); msg != "" {
+			fmt.Fprintf(&b, "  %s\n", msg)
+		}
+	}
+
+	// The notebook error/traceback carries the real cause (e.g. a Python exception).
+	outputRunID := runID
+	if sshTask != nil && sshTask.RunId != 0 {
+		outputRunID = sshTask.RunId
+	}
+	if output, err := client.Jobs.GetRunOutput(ctx, jobs.GetRunOutputRequest{RunId: outputRunID}); err == nil && output != nil {
+		if e := strings.TrimSpace(output.Error); e != "" {
+			fmt.Fprintf(&b, "  %s\n", e)
+		}
+		if trace := strings.TrimSpace(output.ErrorTrace); trace != "" {
+			fmt.Fprintf(&b, "%s\n", truncateTail(trace, maxRunFailureTraceBytes))
+		}
+	}
+
+	if run.RunPageUrl != "" {
+		fmt.Fprintf(&b, "  See the full job logs: %s", run.RunPageUrl)
+	}
+
+	if b.Len() == 0 {
+		return fmt.Sprintf("  job run %d failed; see run details in the workspace", runID)
+	}
+	return strings.TrimRight(b.String(), "\n")
+}
+
+// runFailureIfTerminated reports whether the bootstrap run has reached a terminal state (so the
+// SSH server will never come up), returning a formatted failure description when it has.
+func runFailureIfTerminated(ctx context.Context, client *databricks.WorkspaceClient, runID int64) (string, bool) {
+	if runID == 0 {
+		return "", false
+	}
+	run, err := client.Jobs.GetRun(ctx, jobs.GetRunRequest{RunId: runID})
+	if err != nil {
+		return "", false
+	}
+	for i := range run.Tasks {
+		if run.Tasks[i].TaskKey != sshServerTaskKey {
+			continue
+		}
+		if run.Tasks[i].Status != nil && run.Tasks[i].Status.State == jobs.RunLifecycleStateV2StateTerminated {
+			return describeRunFailure(ctx, client, runID), true
+		}
+		return "", false
+	}
+	return "", false
+}
+
+// truncateTail returns the last maxBytes of s, marking the cut when truncated.
+func truncateTail(s string, maxBytes int) string {
+	if len(s) <= maxBytes {
+		return s
+	}
+	return "  ...\n" + s[len(s)-maxBytes:]
+}
+
 func ensureSSHServerIsRunning(ctx context.Context, client *databricks.WorkspaceClient, version, secretScopeName string, opts ClientOptions) (string, int, string, error) {
 	sessionID := opts.SessionIdentifier()
 	// For dedicated clusters, use clusterID; for serverless, it will be read from metadata
@@ -712,7 +816,7 @@ func ensureSSHServerIsRunning(ctx context.Context, client *databricks.WorkspaceC
 	if errors.Is(err, errServerMetadata) {
 		cmdio.LogString(ctx, "Starting SSH server...")
 
-		err := submitSSHTunnelJob(ctx, client, version, secretScopeName, opts)
+		runID, err := submitSSHTunnelJob(ctx, client, version, secretScopeName, opts)
 		if err != nil {
 			return "", 0, "", fmt.Errorf("failed to submit and start ssh server job: %w", err)
 		}
@@ -729,10 +833,17 @@ func ensureSSHServerIsRunning(ctx context.Context, client *databricks.WorkspaceC
 			if err == nil {
 				cmdio.LogString(ctx, "Health check successful, starting ssh WebSocket connection...")
 				break
-			} else if retries < maxRetries-1 {
+			}
+			// The metadata never appears if the bootstrap job dies after reaching RUNNING.
+			// Surface the job's actual error instead of waiting out the full timeout with a
+			// generic "metadata.json doesn't exist" message.
+			if failure, terminated := runFailureIfTerminated(ctx, client, runID); terminated {
+				return "", 0, "", fmt.Errorf("ssh server bootstrap job failed:\n%s", failure)
+			}
+			if retries < maxRetries-1 {
 				time.Sleep(2 * time.Second)
 			} else {
-				return "", 0, "", fmt.Errorf("failed to start the ssh server: %w", err)
+				return "", 0, "", fmt.Errorf("failed to start the ssh server: %w\n%s", err, describeRunFailure(ctx, client, runID))
 			}
 		}
 	} else if err != nil {