Skip to content

Support graceful agent draining during stack updates to reduce rollover time #1744

@jasonwbarnett

Description

@jasonwbarnett

Is your feature request related to a problem? Please describe.

When updating a deployed stack (e.g., changing the AMI or any parameter that triggers an ASG replacement), the old ASG instances can take a very long time to terminate because they're actively running Buildkite jobs. The ASG can't scale down until those jobs complete, which creates a long, unpredictable rollover window.

This is particularly painful for teams managing custom AMIs where stack updates are a regular occurrence — you end up waiting for long-running jobs to finish on the old ASG before the update completes, with no way to speed up the transition short of killing jobs.

Describe the solution you'd like

Before terminating old ASG instances, the stack should proactively signal agents to stop accepting new jobs in the Buildkite control plane. This would let currently running jobs drain naturally without new work being dispatched to soon-to-be-terminated instances, significantly reducing the rollover window.

Ideally this would be built into the stack's update lifecycle so it happens automatically during any ASG replacement.

Possible approaches

  • ASG lifecycle hook (Terminating:Wait): Hook into the termination lifecycle, call the Buildkite API to stop the agent from accepting new jobs, wait for running jobs to complete, then signal the lifecycle action to proceed
  • Integration with buildkite-agent graceful shutdown: Send SIGTERM or use disconnect-after-job during the termination lifecycle to let the agent finish its current job and exit cleanly
  • Pre-update queue pause: A step that uses the Buildkite API to pause agents in the target queue before initiating the stack update, then unpause after the new ASG is healthy

Describe alternatives you've considered

  • Manually waiting for a quiet period before deploying stack updates (doesn't scale, error-prone)
  • Accepting the long rollover time (frustrating, blocks infrastructure changes)
  • Running a separate script to drain agents via the Buildkite API before triggering the update (works but is manual and not integrated into the stack lifecycle)

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions