Is your feature request related to a problem? Please describe.
When updating a deployed stack (e.g., changing the AMI or any parameter that triggers an ASG replacement), the old ASG instances can take a very long time to terminate because they're actively running Buildkite jobs. The ASG can't scale down until those jobs complete, which creates a long, unpredictable rollover window.
This is particularly painful for teams managing custom AMIs where stack updates are a regular occurrence — you end up waiting for long-running jobs to finish on the old ASG before the update completes, with no way to speed up the transition short of killing jobs.
Describe the solution you'd like
Before terminating old ASG instances, the stack should proactively signal agents to stop accepting new jobs in the Buildkite control plane. This would let currently running jobs drain naturally without new work being dispatched to soon-to-be-terminated instances, significantly reducing the rollover window.
Ideally this would be built into the stack's update lifecycle so it happens automatically during any ASG replacement.
Possible approaches
- ASG lifecycle hook (Terminating:Wait): Hook into the termination lifecycle, call the Buildkite API to stop the agent from accepting new jobs, wait for running jobs to complete, then signal the lifecycle action to proceed
- Integration with
buildkite-agent graceful shutdown: Send SIGTERM or use disconnect-after-job during the termination lifecycle to let the agent finish its current job and exit cleanly
- Pre-update queue pause: A step that uses the Buildkite API to pause agents in the target queue before initiating the stack update, then unpause after the new ASG is healthy
Describe alternatives you've considered
- Manually waiting for a quiet period before deploying stack updates (doesn't scale, error-prone)
- Accepting the long rollover time (frustrating, blocks infrastructure changes)
- Running a separate script to drain agents via the Buildkite API before triggering the update (works but is manual and not integrated into the stack lifecycle)
Related issues
Is your feature request related to a problem? Please describe.
When updating a deployed stack (e.g., changing the AMI or any parameter that triggers an ASG replacement), the old ASG instances can take a very long time to terminate because they're actively running Buildkite jobs. The ASG can't scale down until those jobs complete, which creates a long, unpredictable rollover window.
This is particularly painful for teams managing custom AMIs where stack updates are a regular occurrence — you end up waiting for long-running jobs to finish on the old ASG before the update completes, with no way to speed up the transition short of killing jobs.
Describe the solution you'd like
Before terminating old ASG instances, the stack should proactively signal agents to stop accepting new jobs in the Buildkite control plane. This would let currently running jobs drain naturally without new work being dispatched to soon-to-be-terminated instances, significantly reducing the rollover window.
Ideally this would be built into the stack's update lifecycle so it happens automatically during any ASG replacement.
Possible approaches
buildkite-agentgraceful shutdown: SendSIGTERMor usedisconnect-after-jobduring the termination lifecycle to let the agent finish its current job and exit cleanlyDescribe alternatives you've considered
Related issues