experimental/ssh: wait full startup timeout for server health check#5807
experimental/ssh: wait full startup timeout for server health check#5807TanishqDatabricks wants to merge 1 commit into
Conversation
The post-RUNNING health check against the driver-proxy /metadata endpoint used a fixed 30 x 2s = 60s window. After the bootstrap task reaches RUNNING the driver proxy still answers 503 until the container's HTTP endpoint is reachable; with a custom --base-environment that install happens post-RUNNING, so warmup routinely outlasts 60s and `ssh connect` fails with a driver-proxy 503 even though the server comes up fine. Poll with retries.Poll for the same accelerator-aware budget already used for the task to start (10m CPU / 45m GPU), backing off between attempts instead of hammering the proxy every 2s during a long wait. Extract formatServerStartError so a terminated bootstrap job returns its failure trace once, while only a timeout appends run diagnostics. Co-authored-by: Isaac
cc2f502 to
b99d10c
Compare
Waiting for approvalBased on git history, these people are best suited to review:
Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Integration test reportCommit: b99d10c
23 interesting tests: 13 SKIP, 10 RECOVERED
Top 8 slowest tests (at least 2 minutes):
|
|
Closing this — superseded by upcoming work that rejects unsupported serverless environments up front. Investigation showed the health-check timeout wasn't the actual fix: |
Changes
databricks ssh connecthealth-checks the bootstrap SSH server after its job task reachesRUNNINGby polling the driver-proxy/metadataendpoint. That poll used a hardcoded30 × 2s = 60swindow.The problem: after the task reaches
RUNNING, the driver proxy still answers/metadatawith 503 until the container's HTTP endpoint is actually reachable. With a custom--base-environment, that environment install happens after the task isRUNNING, so warmup routinely outlasts 60s — andssh connectfails withserver is not ok, status code 503even though the server comes up fine moments later. The 503 originates in the serverless driver-proxy layer, not our server binary (serveMetadataonly ever returns 200/500).This replaces the fixed loop with
retries.Pollbounded by the same accelerator-aware budget already used to wait for the task to start (TaskStartupTimeout: 10 min CPU / 45 min GPU). The SDK poller backs off between attempts (linear, capped ~10s, jittered) instead of hammering the proxy every 2s during a long install, and it still succeeds immediately on the first attempt when the server is already warm.formatServerStartErroris extracted so the two failure modes stay distinct: a halted poll (bootstrap job terminated) already carries the job's failure trace and is returned as-is, while a timeout appendsdescribeRunFailurediagnostics. This preserves the prior behavior and avoids printing the failure trace twice.Why
Serverless SSH connections with a custom base environment (introduced in #5706) frequently fail on first connect with a driver-proxy 503, because installing the base environment pushes health-readiness past the fixed 60s window. Sizing the health-check wait to match the provisioning wait the user already opted into fixes this without a new flag.
Tests
TestFormatServerStartErrorTimeoutAppendsRunFailureandTestFormatServerStartErrorHaltReturnedAsIscover both error-routing branches (timeout appends run diagnostics; halt returns as-is with the trace exactly once).go test ./experimental/ssh/...passes.TestAccept/sshacceptance suite passes:ok github.com/databricks/cli/acceptance 481.255s.Local run timing
Measured locally (Apple Silicon). Note the acceptance figure includes compiling the full CLI binary on a cold build cache:
go mod download(deps, warm module cache)go test ./experimental/ssh/...(unit, compile + run)TestAccept/ssh(acceptance, incl. full CLI compile)The bulk of the acceptance wall-clock is the one-time CLI/dependency compilation on a cold
go-buildcache; the ssh test execution itself is a small fraction of that.This pull request and its description were written by Isaac.