Skip to content

Add telemetry event for SSH tunnel connections#4881

Merged
anton-107 merged 4 commits into
mainfrom
antonnek/ssh-tunnel-telemetry2
Jun 2, 2026
Merged

Add telemetry event for SSH tunnel connections#4881
anton-107 merged 4 commits into
mainfrom
antonnek/ssh-tunnel-telemetry2

Conversation

@anton-107

@anton-107 anton-107 commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

Changes

Add a SshTunnelEvent telemetry event that fires once per ssh connect / ssh proxy / IDE-mode invocation in experimental/ssh/internal/client/client.go.

Captured fields:

  • compute_type — dedicated cluster or serverless
  • accelerator_type — GPU accelerator for serverless (empty for dedicated)
  • ide_type — IDE that initiated the connection (e.g. vscode, cursor), empty for raw SSH / proxy
  • client_modeSSH_CLIENT, PROXY, or IDE
  • is_reconnect — true when invoked with ServerMetadata (a follow-up connection re-using an existing server)
  • auto_start_cluster — whether --auto-start-cluster was set
  • server_start_time_ms — time spent inside ensureSSHServerIsRunning; 0 on reconnect
  • is_success — set to true only after the connection is fully established

New files:

  • libs/telemetry/protos/ssh_tunnel.go — event struct and the SshTunnelComputeType / SshTunnelClientMode string-enum types, following the same pattern as BundleMode in enum.go.
  • Wires the event into DatabricksCliLog in libs/telemetry/protos/frontend_log.go.

Emission is done from a defer registered before the first early-return path inside Run, so the event fires on every exit (cluster-state failure, binary-upload failure, metadata-parse failure, success) rather than only the original two explicit call sites.

Why

We have no visibility into how the experimental SSH tunnel feature is being used. We want to answer basic questions before promoting it out of experimental:

  • Dedicated vs serverless split
  • IDE breakdown (raw SSH client vs proxy vs IDE-initiated, and which IDEs)
  • Success rate and where failures cluster
  • How long server startup takes on serverless (cold start signal)

The defer pattern is deliberate: an event emitted only on the happy path would under-report failures and bias all of the above metrics.

Tests

  • task test-exp-ssh passes on linux/macos/windows (all green in CI).
  • task lint clean.
  • Verified locally that the event is recorded on the in-memory telemetry logger for both dedicated and serverless invocations, including failure paths (cluster-state check failure, binary upload failure).

No new test files: this repo has no existing tests asserting on telemetry event payloads — BundleInitEvent and BundleDeployEvent are emitted untested as well (bundle/phases/telemetry.go, libs/template/writer.go). Adding a fixture-based test only for SshTunnelEvent would be inconsistent with the rest of the codebase; that work should land as a separate PR if we want it.

No NEXT_CHANGELOG.md entry: the feature is under experimental/ and the change is internal telemetry, not a user-visible behavior change.

anton-107 added 2 commits May 8, 2026 15:58
Move telemetry event emission to a deferred closure so it fires on
all exit paths (success, server-start failure, binary-upload failure,
cluster-state check failure, etc.), not only on the two explicit call
sites that existed before.

Co-authored-by: Isaac
@anton-107 anton-107 force-pushed the antonnek/ssh-tunnel-telemetry2 branch from 6b048a9 to 0d5be27 Compare May 8, 2026 14:26
@anton-107 anton-107 temporarily deployed to test-trigger-is May 8, 2026 14:27 — with GitHub Actions Inactive
@anton-107 anton-107 temporarily deployed to test-trigger-is May 8, 2026 14:27 — with GitHub Actions Inactive
@anton-107 anton-107 temporarily deployed to test-trigger-is May 8, 2026 14:38 — with GitHub Actions Inactive
@anton-107 anton-107 temporarily deployed to test-trigger-is May 8, 2026 14:38 — with GitHub Actions Inactive
@anton-107

Copy link
Copy Markdown
Contributor Author

Rebased onto current main and resolved the conflict in client.go (new serverless CPU error handling block from main).

Additionally, changed telemetry emission from two explicit call sites to a defer-based pattern so the SshTunnelEvent fires on every invocation path — including early failures (cluster state check, binary upload, metadata parsing) that previously skipped telemetry. Also fixed a gofmt alignment issue in ssh_tunnel.go.

Verified locally that the telemetry event fires on invocation for both dedicated and serverless modes.

Ready for re-review @andrewnester @denik @pietern @shreyas-goenka.

@anton-107 anton-107 requested a review from simonfaltum May 8, 2026 14:59
@anton-107 anton-107 requested a review from rclarey June 2, 2026 07:31

@rclarey rclarey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one point about clarity of the startup time ms

Comment on lines +321 to 322
serverStartTime := time.Now()
userName, serverPort, clusterID, err = ensureSSHServerIsRunning(ctx, client, version, secretScopeName, opts)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to differentiate between

  1. submitting a job for the SSH server and waiting for it to start
  2. checking the server metadata.json and confirming the server is already running (so skipping submitting a new job)

We can probably assume that reported times below a certain threshold are all (2), but if we specifically want to track serverless compute cold start then I think it's better to clearly distinguish these two cases somehow

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will evolve this telemetry along the way - right now it is good enough for the initial metrics collection

@anton-107 anton-107 enabled auto-merge June 2, 2026 16:12
@anton-107 anton-107 temporarily deployed to test-trigger-is June 2, 2026 16:12 — with GitHub Actions Inactive
@anton-107 anton-107 temporarily deployed to test-trigger-is June 2, 2026 16:12 — with GitHub Actions Inactive
@anton-107 anton-107 added this pull request to the merge queue Jun 2, 2026
@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Commit: 4fd8116

Run: 26832629765

Merged via the queue into main with commit b512028 Jun 2, 2026
25 checks passed
@anton-107 anton-107 deleted the antonnek/ssh-tunnel-telemetry2 branch June 2, 2026 17:05
@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Commit: b512028

Run: 26835521351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants