Skip to content

fix(EN-4232): batch TerminateInstances to reduce EC2 bucket-1 quota pressure#4

Draft
Renaud Martinet (karouf) wants to merge 13 commits into
cicd-stagingfrom
fix/batch-terminate-instances
Draft

fix(EN-4232): batch TerminateInstances to reduce EC2 bucket-1 quota pressure#4
Renaud Martinet (karouf) wants to merge 13 commits into
cicd-stagingfrom
fix/batch-terminate-instances

Conversation

@karouf

@karouf Renaud Martinet (karouf) commented Jun 19, 2026

Copy link
Copy Markdown

Description

EC2 API bucket-1 (Fleet mutations: CreateFleet + RunInstances + TerminateInstances + DeleteNetworkInterfaces) in cicd-prod is at p99=552 calls/min against a 540/min sustained limit. TerminateInstances accounts for ~40% of bucket-1 and was being called once per runner in scale-down. This change collapses N per-runner calls into 1 batched EC2 call per lambda invocation.

Changes:

  • aws/runners.ts: add terminateRunners(ids[]) — chunks into batches of 100 (AWS max 1000/call). TerminateInstances is all-or-nothing, so on a batch error it retries each id individually; the bad id surfaces at size 1 and is logged while every good id still terminates. Replaces the old per-runner terminateRunner, which is removed.
  • scale-runners/scale-down.ts: deregisterRunner (formerly removeRunner) returns the de-registered instanceId (or undefined) instead of calling EC2 directly. evaluateAndRemoveRunners and terminateOrphan collect IDs across their loops, then flush with a single terminateRunners(toTerminate) call. Since termination is deferred to end-of-loop, per-iteration failures are isolated so one bad runner can't abandon the whole batch: the orphan loop body is wrapped in try/catch, and getOrCreateOctokit was moved inside deregisterRunner's try (honoring its never-throws / returns-undefined contract).
  • scale-runners/scale-up.ts: failed-creation cleanup migrated from terminateRunner(id) to terminateRunners([id]).
  • Tests: mock assertions use flatMap over all call arrays to be call-order-agnostic; checkTerminated asserts exact-set equality of terminated ids (catches duplicate/spurious calls); new terminateRunners describe covers empty list, single batch, >100 chunking, and per-id retry on error.
  • scripts/check-terminate-quota.sh: before/after metric snapshot script using pup (temporary; removed in a follow-up).

Expected outcome: TerminateInstances drops from ~N calls/invocation to 2 calls/invocation (one for orphan cleanup, one for idle scale-down). At burst scale-down this cuts dozens of calls to 2.

Context

Jira ticket: EN-4232

Rollout:

  1. Merge → upload runners.zip to s3://doctolib-gh-runners-lambdas-staging/github-runner/runners.zip
  2. terraform apply staging 04_github_runner
  3. Run doctolib/ci-sandbox load-generator workflow (runner-scale-load.yml, separate PR) to verify batch termination in logs
  4. Run scripts/check-terminate-quota.sh --from=2h before vs after to confirm call reduction
  5. Promote: fast-forward cicd-prodcicd-staging, rebuild zip, commit to terraform-infra prod lambdas

…ressure

EC2 bucket-1 (Fleet mutations) was hitting p99=552 calls/min vs 540 limit.
TerminateInstances was called once per runner; now all runners in a scale-down
invocation are terminated in a single batched EC2 call (up to 100 per batch).

- Add `terminateRunners(ids[])` with chunked batching (TERMINATE_BATCH_SIZE=100)
  and recursive bisect on error to isolate bad IDs while terminating all good ones
- Keep `terminateRunner(id)` as back-compat wrapper (scale-up cleanup unchanged)
- `removeRunner` returns instanceId instead of calling terminate directly
- `evaluateAndRemoveRunners` + `terminateOrphan` collect IDs then call once
- Add `scripts/check-terminate-quota.sh` for before/after metric snapshots

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@karouf Renaud Martinet (karouf) changed the title fix(scale-down): batch TerminateInstances to reduce EC2 bucket-1 quota pressure fix(EN-4232): batch TerminateInstances to reduce EC2 bucket-1 quota pressure Jun 22, 2026
@karouf Renaud Martinet (karouf) marked this pull request as draft June 22, 2026 15:52
Renaud Martinet (karouf) and others added 6 commits June 23, 2026 11:21
…ch error

Recursive bisection was O(log n) extra EC2 calls and required a 6-line
comment to explain. A simple loop that retries each id individually on
batch failure is equivalent in behaviour (bad id isolated at size 1,
all good ids still terminate) and obvious at a glance.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Single-id callers have been migrated to terminateRunners; the thin
wrapper and its dedicated test are no longer needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…unners

The back-compat wrapper was removed in the previous commit but the call
site in createRunners() was missed. Wraps the single instanceId in an
array to match the new batch signature.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Deferred batching means a throw inside the loop abandons the entire
toTerminate array. Wrap each loop iteration in try/catch so a failing
lastChanceCheckOrphanRunner call is logged and skipped, and the batch
flush always runs for the runners that were successfully evaluated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
getOrCreateOctokit was called before the try block, so an auth failure
escaped removeRunner entirely instead of returning undefined. This broke
removeRunner's documented contract (returns undefined on any failure) and
caused evaluateAndRemoveRunners to abort its loop, abandoning the entire
toTerminate batch for subsequent runners.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The migration from terminateRunner to terminateRunners weakened
checkTerminated from toHaveBeenCalledTimes(n) + per-id check to
per-id toContain — duplicate or spurious termination calls would pass
undetected. Restore the equivalent guarantee with sorted-array equality.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be removed after testing in staging

Renaud Martinet (karouf) and others added 3 commits June 23, 2026 15:07
The function only de-registers the GitHub runner; EC2 termination is now
batched by the caller. The old name implied full removal (de-register + EC2
terminate), which is no longer accurate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deregisterRunner only de-registers; the queueing (push to toTerminate)
happens in evaluateAndRemoveRunners. Log each step where it actually occurs:
de-registration stays in deregisterRunner, queueing log moves to the push site.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…te-quota.sh

Printed 'Window: --from=2h' instead of 'Window: 2h'. Strip the flag prefix
with bash parameter expansion when echoing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces EC2 API bucket-1 quota pressure by batching TerminateInstances calls, turning per-runner termination during scale-down into batched termination per lambda invocation while keeping GitHub de-registration behavior intact.

Changes:

  • Add terminateRunners(instanceIds) in the AWS runners module with chunking and per-ID fallback on batch failure.
  • Refactor scale-down to de-register runners first, collect instance IDs, then perform batched termination for both idle scale-down and orphan cleanup.
  • Update/add tests to be order-agnostic across batched calls and validate exact terminated-id sets; add a temporary quota snapshot script.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scripts/check-terminate-quota.sh Adds a CLI helper to snapshot EC2 bucket-1 usage before/after rollout.
lambdas/functions/control-plane/src/scale-runners/scale-up.ts Migrates failed-creation cleanup from single terminate to batched API wrapper.
lambdas/functions/control-plane/src/scale-runners/scale-down.ts Defers EC2 termination and batches instance termination for idle/orphan cleanup.
lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts Updates assertions to work with batched termination calls (and set-based checks).
lambdas/functions/control-plane/src/aws/runners.ts Introduces terminateRunners with chunking and per-id retry on batch error.
lambdas/functions/control-plane/src/aws/runners.test.ts Adds/updates tests covering empty list, chunking, and per-id retry behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lambdas/functions/control-plane/src/scale-runners/scale-down.ts Outdated
Comment thread lambdas/functions/control-plane/src/scale-runners/scale-down.ts Outdated
Renaud Martinet (karouf) and others added 3 commits June 23, 2026 15:35
…ions expected

checkTerminated returned early when idsToTerminate was empty, so tests
expecting zero terminations never verified terminateRunners stayed uncalled.
Remove the guard: empty expected set now asserts allCalledIds == [], catching
any spurious termination call.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ting

'Terminating all non busy runners.' logged before deregisterRunner, before
any EC2 call. Reword to 'Runner X exceeds idle limit, de-registering.' so
the log matches what happens at that line. EC2 termination is logged later
when the deferred batch is flushed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
'Terminating orphan runner X' logged when the code only pushes to toTerminate.
Actual EC2 termination happens after the loop. Reword to 'Queuing orphan
runner X for termination.' to match what the line actually does.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants