Add retries to calls to the k8s api#338
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a generic retry layer in front of the Kubernetes client APIs (CoreV1Api, BatchV1Api, AuthorizationV1Api) to absorb transient apiserver/network failures, plus idempotency handling on the affected create/delete helpers so a retry that lands after a successful first call doesn't surface as a hard error.
Changes:
- New
withRetryClientProxy that retries on a fixed allow-list of HTTP statuses (408/429/500/502/503/504) and network error codes via thecausechain, with exponential backoff + jitter andRetry-Afterhonored (capped at 30s) for 429. createJobPod,createContainerStepPod,createDockerSecret, andcreateSecretForEnvsnow treat 409 as success (reading back the existing object where applicable);deletePodanddeleteSecretswallow 404.- New unit tests for
isRetryableErrorandretryAfterDelaycovering status codes, cause-chain traversal, andRetry-Afterparsing edge cases.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| packages/k8s/src/k8s/index.ts | Adds retry constants, isRetryableError / retryAfterDelay / describeError helpers, the withRetryClient Proxy, and 409/404 handling on the four create + two delete helpers. |
| packages/k8s/tests/k8s-retry-test.ts | New Jest suite exercising isRetryableError (status codes, network codes, cause chain, edge inputs) and retryAfterDelay (header casing, missing/invalid/zero/negative values, 30s cap). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| // Cap the delay to 30 seconds | ||
| const maxDelaySeconds = 30; |
| ? retryAfterDelay(err, attempt) | ||
| : retryDelay(attempt) | ||
| core.warning( | ||
| `K8s API call ${name} failed (${describeError(err)}), retrying in ${Math.round(delay)}ms (attempt ${attempt + 1}/${MAX_RETRIES})` |
| function withRetryClient<T extends object>(client: T): T { | ||
| const callWithRetry = async ( | ||
| fn: (...args: unknown[]) => unknown, | ||
| name: string, | ||
| args: unknown[] | ||
| ): Promise<unknown> => { | ||
| for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) { | ||
| try { | ||
| return await fn(...args) | ||
| } catch (err) { | ||
| if (!isRetryableError(err) || attempt === MAX_RETRIES) { | ||
| throw err | ||
| } | ||
| const delay = | ||
| err instanceof k8s.ApiException && err.code === 429 | ||
| ? retryAfterDelay(err, attempt) | ||
| : retryDelay(attempt) | ||
| core.warning( | ||
| `K8s API call ${name} failed (${describeError(err)}), retrying in ${Math.round(delay)}ms (attempt ${attempt + 1}/${MAX_RETRIES})` | ||
| ) | ||
| await sleep(delay) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return new Proxy(client, { | ||
| get(target, prop, receiver) { | ||
| const value = Reflect.get(target, prop, receiver) | ||
| if (typeof value !== 'function') { | ||
| return value | ||
| } | ||
| return async (...args: unknown[]) => | ||
| callWithRetry(value.bind(target), String(prop), args) | ||
| } | ||
| }) | ||
| } |
| try { | ||
| await k8sApi.createNamespacedSecret({ | ||
| namespace: namespace(), | ||
| body: secret | ||
| }) | ||
| } catch (err) { | ||
| if (!(err instanceof k8s.ApiException && err.code === 409)) { | ||
| throw err | ||
| } | ||
| } | ||
|
|
||
| return secretName |
There was a problem hiding this comment.
Fixed, the code no longer swallows the 409. replaceNamespacedSecret/patch is not a viable fallback here, because the secret is created with immutable = true (line 824). The code comment at 794–795 calls this out. So read-back-and-compare was the correct of the two options offered. If a mismatch is found an error is thrown.
| import * as k8s from '@kubernetes/client-node' | ||
| import { isRetryableError, retryAfterDelay } from '../src/k8s' |
Summary
Wraps the k8s API clients (
CoreV1Api,BatchV1Api,AuthorizationV1Api) in a retry Proxy that handles transient apiserver failures:ECONNREFUSED,ECONNRESET,ETIMEDOUT,ENETUNREACH,EAI_AGAIN,ENOTFOUND) with exponential backoff + jitter, capped at 4 attempts total. HonorsRetry-Afteron 429 (capped at 30s).createPod,createJob,createDockerSecret, andcreateSecretForEnvshandle 409 by reading back the existing resource;deletePodanddeleteSecretswallow 404. Covers the lost-response-after-success case the retry wrapper can itself produce.Why
Under load we see intermittent
ECONNRESET/ 503 against the kube apiserver duringprepareJob, which currently fail the entire workflow. With this change those fail-fast on transient errors become short retry loops. No change to success-path behavior.Test plan
tests/k8s-retry-test.tsforisRetryableError(status codes, network codes via cause chain, precedence, edge inputs) andretryAfterDelay(header formats, 30s cap, fallbacks).npx jest tests/k8s-retry-test.tspasses.constants-test.tsandk8s-utils-test.tsstill pass.npm run buildclean;npx tsc --noEmitclean.Known limitations
The
createPod409 handler returns the existing pod unconditionally. In ARC, the pod name is derived from the ephemeral runner pod name, so a stale pod with a mismatched spec is unlikely — 409 is almost always the lost-response case. If the operational signal shows otherwise, we'll add an ownership-label check as a follow-up.