Skip to content

Drain tar writer before destroy in execCpFromPod to prevent silent workspace truncation#378

Draft
jeanschmidt wants to merge 1 commit into
actions:mainfrom
jeanschmidt:jeanschmidt/upstream-pr04-defer-writer-destroy
Draft

Drain tar writer before destroy in execCpFromPod to prevent silent workspace truncation#378
jeanschmidt wants to merge 1 commit into
actions:mainfrom
jeanschmidt:jeanschmidt/upstream-pr04-defer-writer-destroy

Conversation

@jeanschmidt

Copy link
Copy Markdown
Contributor

Drain tar writer before destroy in execCpFromPod to prevent silent workspace truncation

Impact: every Kubernetes hook job that copies files out of the job container via execCpFromPod (workspace export, hash verification staging, any consumer of cp from pod)
Risk: medium

What

On the success path of execCpFromPod, await the tar extraction stream's finished event (bounded by a timeout) before destroying the writer, instead of destroying it immediately when the Kubernetes exec status callback fires.

Why

The Kubernetes Exec status callback fires when the remote tar process exits, but the WebSocket carrying the tar bytes may still have data in flight that the local tar-fs extractor has not yet written to disk. Destroying the writer at that moment silently truncates the extracted workspace — the operation reports success while files on disk are missing or partial. This has been observed in production: cp "succeeds" but downstream steps fail with missing-file errors that do not point back at the copy step.

How

  • Success path now awaits stream.promises.finished(writerStream, { signal: AbortSignal.timeout(timeoutMs) }) before destroying. This blocks until tar-fs has actually closed its file descriptors.
  • The wait is bounded so a malformed or stuck stream cannot hang the hook indefinitely. If the bound trips, a core.warning is logged and the writer is destroyed in a finally — the operation is allowed to complete rather than triggering another retry, since the bytes that did arrive are already on disk and re-running risks compounding the problem.
  • The failure path is unchanged in spirit: writer is destroyed immediately and the error rethrown so the existing 30-attempt retry loop kicks in. Awaiting finished on a writer whose upstream WebSocket reported Failure can hang against a half-open connection, so the drain is deliberately skipped there.
  • The timeout is exposed via ACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MS (default 60000). Configuration is wired through a named constant + accessor (ENV_TAR_DRAIN_TIMEOUT_MS, tarDrainTimeoutMs()) in packages/k8s/src/k8s/utils.ts, matching the existing useKubeScheduler pattern. Invalid values (NaN, zero, negative, empty) fall back to the default.

Changes

  • packages/k8s/src/k8s/index.ts: wrap the kc exec promise in try/catch; failure path destroys + rethrows; success path awaits stream.finished with AbortSignal.timeout, logs a warning on timeout, destroys the writer in finally.
  • packages/k8s/src/k8s/utils.ts: add ENV_TAR_DRAIN_TIMEOUT_MS, DEFAULT_TAR_DRAIN_TIMEOUT_MS = 60000, and tarDrainTimeoutMs() accessor.
  • packages/k8s/README.md: new Configuration section documenting ACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MS.
  • packages/k8s/tests/k8s-utils-test.ts: unit tests for the accessor (default, valid override, NaN, zero/negative, empty).
  • packages/k8s/tests/exec-cp-from-pod-drain-test.ts: new integration tests covering (1) success with a late end() is honored without warning, (2) success with a writer that never ends trips the timeout, logs the warning, destroys, and still resolves, (3) failure status destroys immediately and never waits on drain.
  • packages/k8s/tests/error-serialization-test.ts: add destroy: jest.fn() to the tar.extract mock so the unconditional destroy on the production path does not crash an unrelated test.

Notes

  • AbortSignal.timeout requires Node 17.3+. The CI matrix runs Node 20+, and package.json does not pin a lower engines.node, so this is compatible with the supported runtimes; flagging it in case the project decides to formalize the minimum.
  • The default behavior changes: the success path now blocks (up to 60s) on tar drain where it previously returned immediately. For well-formed streams this is a wait on the order of milliseconds. For pathological streams the worst-case latency added per cp is the configured timeout.
  • Operators who need to tune this (very large workspaces over slow links, or to harden against an upstream stream bug) set ACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MS on the runner container.

- Await `stream.finished(writerStream)` on the success path before destroying, bounded by `AbortSignal.timeout`, so tar-fs finishes writing extracted files to disk
- Keep the failure path's immediate destroy + rethrow so the existing 30-attempt retry loop is unchanged
- Add `ENV_TAR_DRAIN_TIMEOUT_MS` constant, `DEFAULT_TAR_DRAIN_TIMEOUT_MS = 60000`, and `tarDrainTimeoutMs()` accessor following the existing `useKubeScheduler` pattern
- Document `ACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MS` in `packages/k8s/README.md` under a new Configuration section
- Add accessor unit tests (default, valid override, NaN, zero/negative, empty) and three integration tests for `execCpFromPod` covering late-end success, drain timeout, and failure path
- Add `destroy: jest.fn()` to the tar.extract mock in `error-serialization-test.ts` so the new unconditional destroy does not crash an unrelated test

Notes

The kc exec status callback fires when the remote tar process exits, but the WebSocket may still have tar bytes in flight that the local tar-fs extractor has not yet flushed to disk. Destroying the writer at that point silently truncates the extracted workspace. The success path now awaits `stream.promises.finished(writerStream, { signal: AbortSignal.timeout(timeoutMs) })`, so the destroy happens only after tar-fs has closed its file descriptors. The timeout bounds a malformed-stream hang; if it fires, a `core.warning` is emitted and the (possibly incomplete) extract is finalized rather than retrying. The failure path is intentionally not drained — awaiting on a writer whose upstream kc channel reported Failure can hang against a half-open WebSocket.
Copilot AI review requested due to automatic review settings June 12, 2026 14:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a bounded “tar drain” phase to execCpFromPod to avoid returning before tar-fs has finished writing extracted bytes, and introduces a configurable timeout via env var.

Changes:

  • Add tarDrainTimeoutMs() + ACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MS with a 60s default.
  • Update execCpFromPod to await stream.promises.finished() (bounded by timeout) and warn on timeout.
  • Add/extend Jest tests covering timeout parsing and the tar drain success/timeout/failure paths; document the new env var.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/k8s/src/k8s/utils.ts Introduces env var + default and helper to parse the tar drain timeout.
packages/k8s/src/k8s/index.ts Waits for tar extraction stream to drain (with timeout) and destroys stream on both success/failure paths.
packages/k8s/tests/k8s-utils-test.ts Adds unit tests for timeout parsing behavior.
packages/k8s/tests/exec-cp-from-pod-drain-test.ts Adds focused tests to validate drain behavior, timeout warning, and failure handling.
packages/k8s/tests/error-serialization-test.ts Updates tar-fs mock to include destroy() for new stream lifecycle behavior.
packages/k8s/README.md Documents ACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MS configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +601 to 610
async status => {
if (errStream.size()) {
reject(
new Error(
`Error from cpFromPod - details: \n ${errStream.getContentsAsString()}`
)
)
)
}
resolve(status)
}
Comment on lines +601 to 610
async status => {
if (errStream.size()) {
reject(
new Error(
`Error from cpFromPod - details: \n ${errStream.getContentsAsString()}`
)
)
)
}
resolve(status)
}
Comment on lines +275 to +285
export function tarDrainTimeoutMs(): number {
const raw = process.env[ENV_TAR_DRAIN_TIMEOUT_MS]
if (raw === undefined || raw === '') {
return DEFAULT_TAR_DRAIN_TIMEOUT_MS
}
const parsed = parseInt(raw, 10)
if (Number.isNaN(parsed) || parsed <= 0) {
return DEFAULT_TAR_DRAIN_TIMEOUT_MS
}
return parsed
}
Comment thread packages/k8s/README.md

## Configuration

These environment variables are read on the runner container of the runner pod. All knobs are optional; defaults preserve the historical behavior of the hooks.
@jeanschmidt jeanschmidt marked this pull request as draft June 12, 2026 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants