fix(EN-4232): batch TerminateInstances to reduce EC2 bucket-1 quota pressure#4
Draft
Renaud Martinet (karouf) wants to merge 13 commits into
Draft
fix(EN-4232): batch TerminateInstances to reduce EC2 bucket-1 quota pressure#4Renaud Martinet (karouf) wants to merge 13 commits into
Renaud Martinet (karouf) wants to merge 13 commits into
Conversation
…ressure EC2 bucket-1 (Fleet mutations) was hitting p99=552 calls/min vs 540 limit. TerminateInstances was called once per runner; now all runners in a scale-down invocation are terminated in a single batched EC2 call (up to 100 per batch). - Add `terminateRunners(ids[])` with chunked batching (TERMINATE_BATCH_SIZE=100) and recursive bisect on error to isolate bad IDs while terminating all good ones - Keep `terminateRunner(id)` as back-compat wrapper (scale-up cleanup unchanged) - `removeRunner` returns instanceId instead of calling terminate directly - `evaluateAndRemoveRunners` + `terminateOrphan` collect IDs then call once - Add `scripts/check-terminate-quota.sh` for before/after metric snapshots Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ch error Recursive bisection was O(log n) extra EC2 calls and required a 6-line comment to explain. A simple loop that retries each id individually on batch failure is equivalent in behaviour (bad id isolated at size 1, all good ids still terminate) and obvious at a glance. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Single-id callers have been migrated to terminateRunners; the thin wrapper and its dedicated test are no longer needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…unners The back-compat wrapper was removed in the previous commit but the call site in createRunners() was missed. Wraps the single instanceId in an array to match the new batch signature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Deferred batching means a throw inside the loop abandons the entire toTerminate array. Wrap each loop iteration in try/catch so a failing lastChanceCheckOrphanRunner call is logged and skipped, and the batch flush always runs for the runners that were successfully evaluated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
getOrCreateOctokit was called before the try block, so an auth failure escaped removeRunner entirely instead of returning undefined. This broke removeRunner's documented contract (returns undefined on any failure) and caused evaluateAndRemoveRunners to abort its loop, abandoning the entire toTerminate batch for subsequent runners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The migration from terminateRunner to terminateRunners weakened checkTerminated from toHaveBeenCalledTimes(n) + per-id check to per-id toContain — duplicate or spurious termination calls would pass undetected. Restore the equivalent guarantee with sorted-array equality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Will be removed after testing in staging
The function only de-registers the GitHub runner; EC2 termination is now batched by the caller. The old name implied full removal (de-register + EC2 terminate), which is no longer accurate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deregisterRunner only de-registers; the queueing (push to toTerminate) happens in evaluateAndRemoveRunners. Log each step where it actually occurs: de-registration stays in deregisterRunner, queueing log moves to the push site. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…te-quota.sh Printed 'Window: --from=2h' instead of 'Window: 2h'. Strip the flag prefix with bash parameter expansion when echoing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR reduces EC2 API bucket-1 quota pressure by batching TerminateInstances calls, turning per-runner termination during scale-down into batched termination per lambda invocation while keeping GitHub de-registration behavior intact.
Changes:
- Add
terminateRunners(instanceIds)in the AWS runners module with chunking and per-ID fallback on batch failure. - Refactor scale-down to de-register runners first, collect instance IDs, then perform batched termination for both idle scale-down and orphan cleanup.
- Update/add tests to be order-agnostic across batched calls and validate exact terminated-id sets; add a temporary quota snapshot script.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/check-terminate-quota.sh | Adds a CLI helper to snapshot EC2 bucket-1 usage before/after rollout. |
| lambdas/functions/control-plane/src/scale-runners/scale-up.ts | Migrates failed-creation cleanup from single terminate to batched API wrapper. |
| lambdas/functions/control-plane/src/scale-runners/scale-down.ts | Defers EC2 termination and batches instance termination for idle/orphan cleanup. |
| lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts | Updates assertions to work with batched termination calls (and set-based checks). |
| lambdas/functions/control-plane/src/aws/runners.ts | Introduces terminateRunners with chunking and per-id retry on batch error. |
| lambdas/functions/control-plane/src/aws/runners.test.ts | Adds/updates tests covering empty list, chunking, and per-id retry behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ions expected checkTerminated returned early when idsToTerminate was empty, so tests expecting zero terminations never verified terminateRunners stayed uncalled. Remove the guard: empty expected set now asserts allCalledIds == [], catching any spurious termination call. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ting 'Terminating all non busy runners.' logged before deregisterRunner, before any EC2 call. Reword to 'Runner X exceeds idle limit, de-registering.' so the log matches what happens at that line. EC2 termination is logged later when the deferred batch is flushed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
'Terminating orphan runner X' logged when the code only pushes to toTerminate. Actual EC2 termination happens after the loop. Reword to 'Queuing orphan runner X for termination.' to match what the line actually does. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
EC2 API bucket-1 (Fleet mutations: CreateFleet + RunInstances + TerminateInstances + DeleteNetworkInterfaces) in cicd-prod is at p99=552 calls/min against a 540/min sustained limit. TerminateInstances accounts for ~40% of bucket-1 and was being called once per runner in scale-down. This change collapses N per-runner calls into 1 batched EC2 call per lambda invocation.
Changes:
aws/runners.ts: addterminateRunners(ids[])— chunks into batches of 100 (AWS max 1000/call).TerminateInstancesis all-or-nothing, so on a batch error it retries each id individually; the bad id surfaces at size 1 and is logged while every good id still terminates. Replaces the old per-runnerterminateRunner, which is removed.scale-runners/scale-down.ts:deregisterRunner(formerlyremoveRunner) returns the de-registeredinstanceId(orundefined) instead of calling EC2 directly.evaluateAndRemoveRunnersandterminateOrphancollect IDs across their loops, then flush with a singleterminateRunners(toTerminate)call. Since termination is deferred to end-of-loop, per-iteration failures are isolated so one bad runner can't abandon the whole batch: the orphan loop body is wrapped in try/catch, andgetOrCreateOctokitwas moved insidederegisterRunner's try (honoring its never-throws / returns-undefinedcontract).scale-runners/scale-up.ts: failed-creation cleanup migrated fromterminateRunner(id)toterminateRunners([id]).flatMapover all call arrays to be call-order-agnostic;checkTerminatedasserts exact-set equality of terminated ids (catches duplicate/spurious calls); newterminateRunnersdescribe covers empty list, single batch, >100 chunking, and per-id retry on error.scripts/check-terminate-quota.sh: before/after metric snapshot script usingpup(temporary; removed in a follow-up).Expected outcome: TerminateInstances drops from ~N calls/invocation to 2 calls/invocation (one for orphan cleanup, one for idle scale-down). At burst scale-down this cuts dozens of calls to 2.
Context
Jira ticket: EN-4232
Rollout:
runners.ziptos3://doctolib-gh-runners-lambdas-staging/github-runner/runners.zipterraform applystaging04_github_runnerdoctolib/ci-sandboxload-generator workflow (runner-scale-load.yml, separate PR) to verify batch termination in logsscripts/check-terminate-quota.sh --from=2hbefore vs after to confirm call reductioncicd-prod→cicd-staging, rebuild zip, commit to terraform-infra prod lambdas