Problem
When mixed_instances_policy changes in runner_worker_docker_autoscaler_asg (e.g. changing instance types from c6a.4xlarge to c6a.8xlarge), AWS triggers an instance refresh on the worker ASG that terminates existing instances immediately. The fleeting plugin on the manager is not notified, so jobs running on those workers fail with runner_system_failure (SSH connection timeout).
The upgrade_strategy = "rolling" option exists but does not coordinate with the GitLab runner's job awareness — AWS simply terminates instances without checking if they have active jobs.
Current behavior
- User changes instance types in
runner_worker_docker_autoscaler_asg.types
terraform apply updates the launch template and mixed_instances_policy
- AWS instance refresh terminates old workers that no longer match
- In-flight jobs on those workers fail with SSH timeout /
runner_system_failure
- Failures continue for ~30 minutes until all old workers are replaced
Expected behavior
Workers with running jobs should be drained gracefully before termination — either by waiting for jobs to complete or by signaling the fleeting plugin to stop scheduling new work on those instances.
Suggested approach
Add an ASG lifecycle hook (autoscaling:EC2_INSTANCE_TERMINATING) on the worker ASG — similar to the existing terminate-agent-hook on the manager ASG. The hook could:
- Query the runner manager to check if the instance has active jobs (via the fleeting plugin API or by checking active containers on the instance)
- Send heartbeats to extend the termination timeout while jobs are still running
- Allow termination once the instance is idle
Alternatively, coordinate with the fleeting plugin's max_use_count / idle timeout so that instances scheduled for termination are marked as "do not schedule new jobs" and drain naturally.
Workaround
Pause the runner in GitLab Admin before applying, wait for in-flight jobs to finish, then apply and unpause. Or apply during off-hours when no workers are running.
Environment
- Module version: 9.5.0
- Executor: docker-autoscaler
upgrade_strategy: rolling (default)
Problem
When
mixed_instances_policychanges inrunner_worker_docker_autoscaler_asg(e.g. changing instance types fromc6a.4xlargetoc6a.8xlarge), AWS triggers an instance refresh on the worker ASG that terminates existing instances immediately. The fleeting plugin on the manager is not notified, so jobs running on those workers fail withrunner_system_failure(SSH connection timeout).The
upgrade_strategy = "rolling"option exists but does not coordinate with the GitLab runner's job awareness — AWS simply terminates instances without checking if they have active jobs.Current behavior
runner_worker_docker_autoscaler_asg.typesterraform applyupdates the launch template and mixed_instances_policyrunner_system_failureExpected behavior
Workers with running jobs should be drained gracefully before termination — either by waiting for jobs to complete or by signaling the fleeting plugin to stop scheduling new work on those instances.
Suggested approach
Add an ASG lifecycle hook (
autoscaling:EC2_INSTANCE_TERMINATING) on the worker ASG — similar to the existingterminate-agent-hookon the manager ASG. The hook could:Alternatively, coordinate with the fleeting plugin's
max_use_count/ idle timeout so that instances scheduled for termination are marked as "do not schedule new jobs" and drain naturally.Workaround
Pause the runner in GitLab Admin before applying, wait for in-flight jobs to finish, then apply and unpause. Or apply during off-hours when no workers are running.
Environment
upgrade_strategy: rolling (default)