Support graceful agent draining during stack updates to reduce rollover time

**Is your feature request related to a problem? Please describe.**

When updating a deployed stack (e.g., changing the AMI or any parameter that triggers an ASG replacement), the old ASG instances can take a very long time to terminate because they're actively running Buildkite jobs. The ASG can't scale down until those jobs complete, which creates a long, unpredictable rollover window.

This is particularly painful for teams managing custom AMIs where stack updates are a regular occurrence — you end up waiting for long-running jobs to finish on the old ASG before the update completes, with no way to speed up the transition short of killing jobs.

**Describe the solution you'd like**

Before terminating old ASG instances, the stack should proactively signal agents to stop accepting new jobs in the Buildkite control plane. This would let currently running jobs drain naturally without new work being dispatched to soon-to-be-terminated instances, significantly reducing the rollover window.

Ideally this would be built into the stack's update lifecycle so it happens automatically during any ASG replacement.

**Possible approaches**

- **ASG lifecycle hook (Terminating:Wait)**: Hook into the termination lifecycle, call the Buildkite API to stop the agent from accepting new jobs, wait for running jobs to complete, then signal the lifecycle action to proceed
- **Integration with `buildkite-agent` graceful shutdown**: Send `SIGTERM` or use `disconnect-after-job` during the termination lifecycle to let the agent finish its current job and exit cleanly
- **Pre-update queue pause**: A step that uses the Buildkite API to pause agents in the target queue before initiating the stack update, then unpause after the new ASG is healthy

**Describe alternatives you've considered**

- Manually waiting for a quiet period before deploying stack updates (doesn't scale, error-prone)
- Accepting the long rollover time (frustrating, blocks infrastructure changes)
- Running a separate script to drain agents via the Buildkite API before triggering the update (works but is manual and not integrated into the stack lifecycle)

**Related issues**

- #1010 — Zero downtime deployments for stack updates (broader scope, same underlying pain)
- #972 — Instance cordoning (similar concept of removing instances from dispatch without terminating)
- #764 — ASG replacement behavior during updates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support graceful agent draining during stack updates to reduce rollover time #1744

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support graceful agent draining during stack updates to reduce rollover time #1744

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions