Skip to content

Implement SSH connection pool for runner instances#3936

Merged
r4victor merged 26 commits into
masterfrom
issue_3920_instances_ssh_pool
Jun 8, 2026
Merged

Implement SSH connection pool for runner instances#3936
r4victor merged 26 commits into
masterfrom
issue_3920_instances_ssh_pool

Conversation

@r4victor

@r4victor r4victor commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Part of #3920
Closes #2933

Add InstanceConnectionPool/InstanceConnection classes that allow re-using SSH connections to runner instances for shim and runner API port forwarding. Previously, the dstack server had to constantly re-establish SSH connections which affected CPU load and slowed down processing. The runner_ssh_tunnel() decorator is updated to use the pool, so clients are mostly unchanged.

Impact

The run startup time (#3920) on a provisioned instance with pulled image went from ~7s to 1-2s (as it was mostly limited by ssh connection re-creation):

[2026-06-05 14:32:29] [👤admin] [run blue-dog-1] Run submitted. Status: SUBMITTED
[2026-06-05 14:32:29] [job blue-dog-1-0-0] Job created on run submission. Status: SUBMITTED
[2026-06-05 14:32:29] [instance cloud-fleet-0] Instance status changed IDLE -> BUSY
[2026-06-05 14:32:29] [job blue-dog-1-0-0, instance cloud-fleet-0] Job assigned to instance. Instance blocks: 1/1 busy
[2026-06-05 14:32:29] [job blue-dog-1-0-0] Job status changed SUBMITTED -> PROVISIONING
[2026-06-05 14:32:30] [job blue-dog-1-0-0] Job status changed PROVISIONING -> PULLING
[2026-06-05 14:32:31] [job blue-dog-1-0-0] Job status changed PULLING -> RUNNING
[2026-06-05 14:32:31] [run blue-dog-1] Run status changed SUBMITTED -> RUNNING

Also CPU utilization on the dstack server machine no longer spikes due to opening many SSH connections to many instances constantly (#2933).

Notes and implementation details

  • The pool is unbounded. One active instance is expected to add ~2-10MB of RAM usage. The pool is disabled by default and can be enabled with DSTACK_SERVER_SSH_POOL_ENABLED. The plan is to test the pool in the next release, then enable the pool by default, and document how to opt-out if RAM usage is a concern.
  • The pool is not currently intended for arbitrary ports forwarding, only for shim and runner ports. It's not used to forward services ports for probes or router-worker communication. This probably can be generalized later.
  • The pool is not used for container-based backends. (Connections from dstack-server to runner's sshd are expected to be short as the inactivity_duration feature distinguishes user and server connections based on duration.)
  • The pool is incompatible with multiple dstack server processes on one host with the same DSTACK_SERVER_DIR. It's expected that the pool is disabled in such setups. (It's already kinda half-working with gateway connections.)
  • Dropped all params from runner_ssh_tunnel() incl. retries=3 – retries seems to be legacy here and are no longer needed after Introduce JOB_DISCONNECTED_RETRY_TIMEOUT #2627 (2m timeout before running job is kicked from an unreachable instance). Added and documented DSTACK_SERVER_SSH_CONNECT_TIMEOUT env var to increase default ConnectTimeout if server-instance latency is always >3s.

@r4victor r4victor changed the title Implement SSH pool for runner instances Implement SSH connection pool for runner instances Jun 5, 2026
@r4victor r4victor requested review from jvstme and un-def June 5, 2026 09:55
@r4victor r4victor merged commit 1203e3e into master Jun 8, 2026
25 checks passed
@r4victor r4victor deleted the issue_3920_instances_ssh_pool branch June 8, 2026 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dstack server consumes lots of CPU correlated wtih opening SSH connections

2 participants