Add telemetry event for SSH tunnel connections#4881
Conversation
Move telemetry event emission to a deferred closure so it fires on all exit paths (success, server-start failure, binary-upload failure, cluster-state check failure, etc.), not only on the two explicit call sites that existed before. Co-authored-by: Isaac
6b048a9 to
0d5be27
Compare
Co-authored-by: Isaac
|
Rebased onto current main and resolved the conflict in Additionally, changed telemetry emission from two explicit call sites to a Verified locally that the telemetry event fires on invocation for both dedicated and serverless modes. Ready for re-review @andrewnester @denik @pietern @shreyas-goenka. |
rclarey
left a comment
There was a problem hiding this comment.
LGTM, just one point about clarity of the startup time ms
| serverStartTime := time.Now() | ||
| userName, serverPort, clusterID, err = ensureSSHServerIsRunning(ctx, client, version, secretScopeName, opts) |
There was a problem hiding this comment.
Do we want to differentiate between
- submitting a job for the SSH server and waiting for it to start
- checking the server metadata.json and confirming the server is already running (so skipping submitting a new job)
We can probably assume that reported times below a certain threshold are all (2), but if we specifically want to track serverless compute cold start then I think it's better to clearly distinguish these two cases somehow
There was a problem hiding this comment.
we will evolve this telemetry along the way - right now it is good enough for the initial metrics collection
|
Commit: 4fd8116 |
|
Commit: b512028 |
Changes
Add a
SshTunnelEventtelemetry event that fires once perssh connect/ssh proxy/ IDE-mode invocation inexperimental/ssh/internal/client/client.go.Captured fields:
compute_type— dedicated cluster or serverlessaccelerator_type— GPU accelerator for serverless (empty for dedicated)ide_type— IDE that initiated the connection (e.g.vscode,cursor), empty for raw SSH / proxyclient_mode—SSH_CLIENT,PROXY, orIDEis_reconnect— true when invoked withServerMetadata(a follow-up connection re-using an existing server)auto_start_cluster— whether--auto-start-clusterwas setserver_start_time_ms— time spent insideensureSSHServerIsRunning; 0 on reconnectis_success— set to true only after the connection is fully establishedNew files:
libs/telemetry/protos/ssh_tunnel.go— event struct and theSshTunnelComputeType/SshTunnelClientModestring-enum types, following the same pattern asBundleModeinenum.go.DatabricksCliLoginlibs/telemetry/protos/frontend_log.go.Emission is done from a
deferregistered before the first early-return path insideRun, so the event fires on every exit (cluster-state failure, binary-upload failure, metadata-parse failure, success) rather than only the original two explicit call sites.Why
We have no visibility into how the experimental SSH tunnel feature is being used. We want to answer basic questions before promoting it out of experimental:
The
deferpattern is deliberate: an event emitted only on the happy path would under-report failures and bias all of the above metrics.Tests
task test-exp-sshpasses on linux/macos/windows (all green in CI).task lintclean.No new test files: this repo has no existing tests asserting on telemetry event payloads —
BundleInitEventandBundleDeployEventare emitted untested as well (bundle/phases/telemetry.go,libs/template/writer.go). Adding a fixture-based test only forSshTunnelEventwould be inconsistent with the rest of the codebase; that work should land as a separate PR if we want it.No
NEXT_CHANGELOG.mdentry: the feature is underexperimental/and the change is internal telemetry, not a user-visible behavior change.