Problem
When a spot instance is reclaimed while a job is assigned to it, the runner retries SSH connections to the dead instance with a 10-minute dial timeout × 3 retries = ~30 minutes of hanging before the job fails with runner_system_failure.
The [runners.autoscaler.connector_config] section supports a timeout field that controls how long the runner waits for an SSH connection, but the module doesn't expose it.
Current template
[runners.autoscaler.connector_config]
username = "${connector_config_user}"
use_external_addr = false
Requested change
Expose timeout (and ideally use_static_credentials) in the runner_worker_docker_autoscaler variable:
variable "runner_worker_docker_autoscaler" {
type = object({
...
connector_config_user = optional(string, "ec2-user")
connector_config_timeout = optional(string, "") # e.g. "2m"
...
})
}
Template would become:
[runners.autoscaler.connector_config]
username = "${connector_config_user}"
use_external_addr = false
%{~ if connector_config_timeout != "" ~}
timeout = "${connector_config_timeout}"
%{~ endif ~}
Impact
With timeout = "2m", a spot reclaim would fail the job in ~6 minutes (2 min × 3 retries) instead of ~30 minutes. Combined with retry: { max: 2, when: [runner_system_failure] } in CI config, jobs would recover in ~7 minutes total instead of failing the entire pipeline after 30 minutes.
Environment
- Module version: 9.5.0
- Executor: docker-autoscaler
- Instance types: spot, mixed pool with price-capacity-optimized allocation
Problem
When a spot instance is reclaimed while a job is assigned to it, the runner retries SSH connections to the dead instance with a 10-minute dial timeout × 3 retries = ~30 minutes of hanging before the job fails with
runner_system_failure.The
[runners.autoscaler.connector_config]section supports atimeoutfield that controls how long the runner waits for an SSH connection, but the module doesn't expose it.Current template
Requested change
Expose
timeout(and ideallyuse_static_credentials) in therunner_worker_docker_autoscalervariable:Template would become:
Impact
With
timeout = "2m", a spot reclaim would fail the job in ~6 minutes (2 min × 3 retries) instead of ~30 minutes. Combined withretry: { max: 2, when: [runner_system_failure] }in CI config, jobs would recover in ~7 minutes total instead of failing the entire pipeline after 30 minutes.Environment