Add retries to calls to the k8s api by murphpdx · Pull Request #338 · actions/runner-container-hooks

murphpdx · 2026-04-20T22:32:04Z

Summary

Wraps the k8s API clients (CoreV1Api, BatchV1Api, AuthorizationV1Api) in a retry Proxy that handles transient apiserver failures:

Retries on HTTP 408/429/500/502/503/504 and network errors (ECONNREFUSED, ECONNRESET, ETIMEDOUT, ENETUNREACH, EAI_AGAIN, ENOTFOUND) with exponential backoff + jitter, capped at 4 attempts total. Honors Retry-After on 429 (capped at 30s).
Makes create/delete helpers idempotent: createPod, createJob, createDockerSecret, and createSecretForEnvs handle 409 by reading back the existing resource; deletePod and deleteSecret swallow 404. Covers the lost-response-after-success case the retry wrapper can itself produce.

Why

Under load we see intermittent ECONNRESET / 503 against the kube apiserver during prepareJob, which currently fail the entire workflow. With this change those fail-fast on transient errors become short retry loops. No change to success-path behavior.

Test plan

Unit tests: 34 new tests in tests/k8s-retry-test.ts for isRetryableError (status codes, network codes via cause chain, precedence, edge inputs) and retryAfterDelay (header formats, 30s cap, fallbacks). npx jest tests/k8s-retry-test.ts passes.
Existing constants-test.ts and k8s-utils-test.ts still pass.
npm run build clean; npx tsc --noEmit clean.
Cluster test: deploy to a test runner and run workflow.

Known limitations

The createPod 409 handler returns the existing pod unconditionally. In ARC, the pod name is derived from the ephemeral runner pod name, so a stale pod with a mismatched spec is unlikely — 409 is almost always the lost-response case. If the operational signal shows otherwise, we'll add an ownership-label check as a follow-up.

Copilot

Pull request overview

Adds a generic retry layer in front of the Kubernetes client APIs (CoreV1Api, BatchV1Api, AuthorizationV1Api) to absorb transient apiserver/network failures, plus idempotency handling on the affected create/delete helpers so a retry that lands after a successful first call doesn't surface as a hard error.

Changes:

New withRetryClient Proxy that retries on a fixed allow-list of HTTP statuses (408/429/500/502/503/504) and network error codes via the cause chain, with exponential backoff + jitter and Retry-After honored (capped at 30s) for 429.
createJobPod, createContainerStepPod, createDockerSecret, and createSecretForEnvs now treat 409 as success (reading back the existing object where applicable); deletePod and deleteSecret swallow 404.
New unit tests for isRetryableError and retryAfterDelay covering status codes, cause-chain traversal, and Retry-After parsing edge cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
packages/k8s/src/k8s/index.ts	Adds retry constants, `isRetryableError` / `retryAfterDelay` / `describeError` helpers, the `withRetryClient` Proxy, and 409/404 handling on the four create + two delete helpers.
packages/k8s/tests/k8s-retry-test.ts	New Jest suite exercising `isRetryableError` (status codes, network codes, cause chain, edge inputs) and `retryAfterDelay` (header casing, missing/invalid/zero/negative values, 30s cap).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  }
+
+  // Cap the delay to 30 seconds
+  const maxDelaySeconds = 30;


+            ? retryAfterDelay(err, attempt)
+            : retryDelay(attempt)
+        core.warning(
+          `K8s API call ${name} failed (${describeError(err)}), retrying in ${Math.round(delay)}ms (attempt ${attempt + 1}/${MAX_RETRIES})`


+function withRetryClient<T extends object>(client: T): T {
+  const callWithRetry = async (
+    fn: (...args: unknown[]) => unknown,
+    name: string,
+    args: unknown[]
+  ): Promise<unknown> => {
+    for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) {
+      try {
+        return await fn(...args)
+      } catch (err) {
+        if (!isRetryableError(err) || attempt === MAX_RETRIES) {
+          throw err
+        }
+        const delay =
+          err instanceof k8s.ApiException && err.code === 429
+            ? retryAfterDelay(err, attempt)
+            : retryDelay(attempt)
+        core.warning(
+          `K8s API call ${name} failed (${describeError(err)}), retrying in ${Math.round(delay)}ms (attempt ${attempt + 1}/${MAX_RETRIES})`
+        )
+        await sleep(delay)
+      }
+    }
+  }
+
+  return new Proxy(client, {
+    get(target, prop, receiver) {
+      const value = Reflect.get(target, prop, receiver)
+      if (typeof value !== 'function') {
+        return value
+      }
+      return async (...args: unknown[]) =>
+        callWithRetry(value.bind(target), String(prop), args)
+    }
+  })
+}


murphpdx · 2026-06-16T21:57:05Z

+  try {
+    await k8sApi.createNamespacedSecret({
+      namespace: namespace(),
+      body: secret
+    })
+  } catch (err) {
+    if (!(err instanceof k8s.ApiException && err.code === 409)) {
+      throw err
+    }
+  }
+
  return secretName


Fixed, the code no longer swallows the 409. replaceNamespacedSecret/patch is not a viable fallback here, because the secret is created with immutable = true (line 824). The code comment at 794–795 calls this out. So read-back-and-compare was the correct of the two options offered. If a mismatch is found an error is thrown.

+import * as k8s from '@kubernetes/client-node'
+import { isRetryableError, retryAfterDelay } from '../src/k8s'


… retry-k8s-api

murphpdx added 2 commits April 20, 2026 11:29

Retry k8s endpoints

cdb3e24

Add unit tests

3119f84

murphpdx requested review from a team and nikola-jokic as code owners April 20, 2026 22:32

nikola-jokic requested a review from Copilot May 14, 2026 14:46

Copilot started reviewing on behalf of nikola-jokic May 14, 2026 14:47 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

murphpdx added 5 commits May 22, 2026 12:24

Merge branch 'main' of github.com:actions/runner-container-hooks into…

4d2051b

… retry-k8s-api

Verify secret data on retry-induced 409s; add retry-wrapper tests

34f9afa

Npm format

a3ce115

lint fix

571e8d4

Remove .idea directory

b97b61e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retries to calls to the k8s api#338

Add retries to calls to the k8s api#338
murphpdx wants to merge 7 commits into
actions:mainfrom
hydrolix:retry-k8s-api

murphpdx commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

murphpdx Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		import * as k8s from '@kubernetes/client-node'
		import { isRetryableError, retryAfterDelay } from '../src/k8s'

Uh oh!

Conversation

murphpdx commented Apr 20, 2026

Summary

Why

Test plan

Known limitations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

murphpdx Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants