Skip to content

Worker ASG instance refresh terminates instances with running jobs #1375

@lorenzstorm1

Description

@lorenzstorm1

Problem

When mixed_instances_policy changes in runner_worker_docker_autoscaler_asg (e.g. changing instance types from c6a.4xlarge to c6a.8xlarge), AWS triggers an instance refresh on the worker ASG that terminates existing instances immediately. The fleeting plugin on the manager is not notified, so jobs running on those workers fail with runner_system_failure (SSH connection timeout).

The upgrade_strategy = "rolling" option exists but does not coordinate with the GitLab runner's job awareness — AWS simply terminates instances without checking if they have active jobs.

Current behavior

  1. User changes instance types in runner_worker_docker_autoscaler_asg.types
  2. terraform apply updates the launch template and mixed_instances_policy
  3. AWS instance refresh terminates old workers that no longer match
  4. In-flight jobs on those workers fail with SSH timeout / runner_system_failure
  5. Failures continue for ~30 minutes until all old workers are replaced

Expected behavior

Workers with running jobs should be drained gracefully before termination — either by waiting for jobs to complete or by signaling the fleeting plugin to stop scheduling new work on those instances.

Suggested approach

Add an ASG lifecycle hook (autoscaling:EC2_INSTANCE_TERMINATING) on the worker ASG — similar to the existing terminate-agent-hook on the manager ASG. The hook could:

  1. Query the runner manager to check if the instance has active jobs (via the fleeting plugin API or by checking active containers on the instance)
  2. Send heartbeats to extend the termination timeout while jobs are still running
  3. Allow termination once the instance is idle

Alternatively, coordinate with the fleeting plugin's max_use_count / idle timeout so that instances scheduled for termination are marked as "do not schedule new jobs" and drain naturally.

Workaround

Pause the runner in GitLab Admin before applying, wait for in-flight jobs to finish, then apply and unpause. Or apply during off-hours when no workers are running.

Environment

  • Module version: 9.5.0
  • Executor: docker-autoscaler
  • upgrade_strategy: rolling (default)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions