fix(EN-4232): batch TerminateInstances to reduce EC2 bucket-1 quota pressure by karouf · Pull Request #4 · doctolib-lab/terraform-aws-github-runner

Renaud Martinet (karouf) · 2026-06-19T16:29:55Z

Description

EC2 API bucket-1 (Fleet mutations: CreateFleet + RunInstances + TerminateInstances + DeleteNetworkInterfaces) in cicd-prod is at p99=552 calls/min against a 540/min sustained limit. TerminateInstances accounts for ~40% of bucket-1 and was being called once per runner in scale-down. This change collapses N per-runner calls into 1 batched EC2 call per lambda invocation.

Changes:

aws/runners.ts: add terminateRunners(ids[]) — chunks into batches of 100 (AWS max 1000/call). TerminateInstances is all-or-nothing, so on a batch error it retries each id individually; the bad id surfaces at size 1 and is logged while every good id still terminates. Replaces the old per-runner terminateRunner, which is removed.
scale-runners/scale-down.ts: deregisterRunner (formerly removeRunner) returns the de-registered instanceId (or undefined) instead of calling EC2 directly. evaluateAndRemoveRunners and terminateOrphan collect IDs across their loops, then flush with a single terminateRunners(toTerminate) call. Since termination is deferred to end-of-loop, per-iteration failures are isolated so one bad runner can't abandon the whole batch: the orphan loop body is wrapped in try/catch, and getOrCreateOctokit was moved inside deregisterRunner's try (honoring its never-throws / returns-undefined contract).
scale-runners/scale-up.ts: failed-creation cleanup migrated from terminateRunner(id) to terminateRunners([id]).
Tests: mock assertions use flatMap over all call arrays to be call-order-agnostic; checkTerminated asserts exact-set equality of terminated ids (catches duplicate/spurious calls); new terminateRunners describe covers empty list, single batch, >100 chunking, and per-id retry on error.
scripts/check-terminate-quota.sh: before/after metric snapshot script using pup (temporary; removed in a follow-up).

Expected outcome: TerminateInstances drops from ~N calls/invocation to 2 calls/invocation (one for orphan cleanup, one for idle scale-down). At burst scale-down this cuts dozens of calls to 2.

Context

Jira ticket: EN-4232

Rollout:

Merge → upload runners.zip to s3://doctolib-gh-runners-lambdas-staging/github-runner/runners.zip
terraform apply staging 04_github_runner
Run doctolib/ci-sandbox load-generator workflow (runner-scale-load.yml, separate PR) to verify batch termination in logs
Run scripts/check-terminate-quota.sh --from=2h before vs after to confirm call reduction
Promote: fast-forward cicd-prod → cicd-staging, rebuild zip, commit to terraform-infra prod lambdas

…ressure EC2 bucket-1 (Fleet mutations) was hitting p99=552 calls/min vs 540 limit. TerminateInstances was called once per runner; now all runners in a scale-down invocation are terminated in a single batched EC2 call (up to 100 per batch). - Add `terminateRunners(ids[])` with chunked batching (TERMINATE_BATCH_SIZE=100) and recursive bisect on error to isolate bad IDs while terminating all good ones - Keep `terminateRunner(id)` as back-compat wrapper (scale-up cleanup unchanged) - `removeRunner` returns instanceId instead of calling terminate directly - `evaluateAndRemoveRunners` + `terminateOrphan` collect IDs then call once - Add `scripts/check-terminate-quota.sh` for before/after metric snapshots Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ch error Recursive bisection was O(log n) extra EC2 calls and required a 6-line comment to explain. A simple loop that retries each id individually on batch failure is equivalent in behaviour (bad id isolated at size 1, all good ids still terminate) and obvious at a glance. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Single-id callers have been migrated to terminateRunners; the thin wrapper and its dedicated test are no longer needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…unners The back-compat wrapper was removed in the previous commit but the call site in createRunners() was missed. Wraps the single instanceId in an array to match the new batch signature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Deferred batching means a throw inside the loop abandons the entire toTerminate array. Wrap each loop iteration in try/catch so a failing lastChanceCheckOrphanRunner call is logged and skipped, and the batch flush always runs for the runners that were successfully evaluated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

getOrCreateOctokit was called before the try block, so an auth failure escaped removeRunner entirely instead of returning undefined. This broke removeRunner's documented contract (returns undefined on any failure) and caused evaluateAndRemoveRunners to abort its loop, abandoning the entire toTerminate batch for subsequent runners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The migration from terminateRunner to terminateRunners weakened checkTerminated from toHaveBeenCalledTimes(n) + per-id check to per-id toContain — duplicate or spurious termination calls would pass undetected. Restore the equivalent guarantee with sorted-array equality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Renaud Martinet (karouf) · 2026-06-23T13:01:21Z

Will be removed after testing in staging

The function only de-registers the GitHub runner; EC2 termination is now batched by the caller. The old name implied full removal (de-register + EC2 terminate), which is no longer accurate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

deregisterRunner only de-registers; the queueing (push to toTerminate) happens in evaluateAndRemoveRunners. Log each step where it actually occurs: de-registration stays in deregisterRunner, queueing log moves to the push site. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…te-quota.sh Printed 'Window: --from=2h' instead of 'Window: 2h'. Strip the flag prefix with bash parameter expansion when echoing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

This PR reduces EC2 API bucket-1 quota pressure by batching TerminateInstances calls, turning per-runner termination during scale-down into batched termination per lambda invocation while keeping GitHub de-registration behavior intact.

Changes:

Add terminateRunners(instanceIds) in the AWS runners module with chunking and per-ID fallback on batch failure.
Refactor scale-down to de-register runners first, collect instance IDs, then perform batched termination for both idle scale-down and orphan cleanup.
Update/add tests to be order-agnostic across batched calls and validate exact terminated-id sets; add a temporary quota snapshot script.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
scripts/check-terminate-quota.sh	Adds a CLI helper to snapshot EC2 bucket-1 usage before/after rollout.
lambdas/functions/control-plane/src/scale-runners/scale-up.ts	Migrates failed-creation cleanup from single terminate to batched API wrapper.
lambdas/functions/control-plane/src/scale-runners/scale-down.ts	Defers EC2 termination and batches instance termination for idle/orphan cleanup.
lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts	Updates assertions to work with batched termination calls (and set-based checks).
lambdas/functions/control-plane/src/aws/runners.ts	Introduces `terminateRunners` with chunking and per-id retry on batch error.
lambdas/functions/control-plane/src/aws/runners.test.ts	Adds/updates tests covering empty list, chunking, and per-id retry behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ions expected checkTerminated returned early when idsToTerminate was empty, so tests expecting zero terminations never verified terminateRunners stayed uncalled. Remove the guard: empty expected set now asserts allCalledIds == [], catching any spurious termination call. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ting 'Terminating all non busy runners.' logged before deregisterRunner, before any EC2 call. Reword to 'Runner X exceeds idle limit, de-registering.' so the log matches what happens at that line. EC2 termination is logged later when the deferred batch is flushed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

'Terminating orphan runner X' logged when the code only pushes to toTerminate. Actual EC2 termination happens after the loop. Reword to 'Queuing orphan runner X for termination.' to match what the line actually does. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Renaud Martinet (karouf) changed the title ~~fix(scale-down): batch TerminateInstances to reduce EC2 bucket-1 quota pressure~~ fix(EN-4232): batch TerminateInstances to reduce EC2 bucket-1 quota pressure Jun 22, 2026

Renaud Martinet (karouf) marked this pull request as draft June 22, 2026 15:52

Renaud Martinet (karouf) and others added 6 commits June 23, 2026 11:21

refactor: remove terminateRunner back-compat wrapper

16dc1f9

Single-id callers have been migrated to terminateRunners; the thin wrapper and its dedicated test are no longer needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Renaud Martinet (karouf) commented Jun 23, 2026

View reviewed changes

Comment thread scripts/check-terminate-quota.sh

Renaud Martinet (karouf) Jun 23, 2026

Copy link
Copy Markdown

Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be removed after testing in staging

Renaud Martinet (karouf) and others added 3 commits June 23, 2026 15:07

fix(scripts): strip --from= prefix from Window label in check-termina…

3cb32a0

…te-quota.sh Printed 'Window: --from=2h' instead of 'Window: 2h'. Strip the flag prefix with bash parameter expansion when echoing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Renaud Martinet (karouf) marked this pull request as ready for review June 23, 2026 13:21

Renaud Martinet (karouf) requested a review from Copilot June 23, 2026 13:21

Renaud Martinet (karouf) marked this pull request as draft June 23, 2026 13:21

Copilot started reviewing on behalf of Renaud Martinet (karouf) June 23, 2026 13:21 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts

Comment thread lambdas/functions/control-plane/src/scale-runners/scale-down.ts Outdated

Comment thread lambdas/functions/control-plane/src/scale-runners/scale-down.ts Outdated

Renaud Martinet (karouf) and others added 3 commits June 23, 2026 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(EN-4232): batch TerminateInstances to reduce EC2 bucket-1 quota pressure#4

fix(EN-4232): batch TerminateInstances to reduce EC2 bucket-1 quota pressure#4
Renaud Martinet (karouf) wants to merge 13 commits into
cicd-stagingfrom
fix/batch-terminate-instances

Renaud Martinet (karouf) commented Jun 19, 2026 •

edited

Loading

Uh oh!

Renaud Martinet (karouf) Jun 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Renaud Martinet (karouf) commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Context

Uh oh!

Renaud Martinet (karouf) Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Renaud Martinet (karouf) commented Jun 19, 2026 •

edited

Loading