Drain tar writer before destroy in execCpFromPod to prevent silent workspace truncation#378
Draft
jeanschmidt wants to merge 1 commit into
Draft
Conversation
- Await `stream.finished(writerStream)` on the success path before destroying, bounded by `AbortSignal.timeout`, so tar-fs finishes writing extracted files to disk
- Keep the failure path's immediate destroy + rethrow so the existing 30-attempt retry loop is unchanged
- Add `ENV_TAR_DRAIN_TIMEOUT_MS` constant, `DEFAULT_TAR_DRAIN_TIMEOUT_MS = 60000`, and `tarDrainTimeoutMs()` accessor following the existing `useKubeScheduler` pattern
- Document `ACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MS` in `packages/k8s/README.md` under a new Configuration section
- Add accessor unit tests (default, valid override, NaN, zero/negative, empty) and three integration tests for `execCpFromPod` covering late-end success, drain timeout, and failure path
- Add `destroy: jest.fn()` to the tar.extract mock in `error-serialization-test.ts` so the new unconditional destroy does not crash an unrelated test
Notes
The kc exec status callback fires when the remote tar process exits, but the WebSocket may still have tar bytes in flight that the local tar-fs extractor has not yet flushed to disk. Destroying the writer at that point silently truncates the extracted workspace. The success path now awaits `stream.promises.finished(writerStream, { signal: AbortSignal.timeout(timeoutMs) })`, so the destroy happens only after tar-fs has closed its file descriptors. The timeout bounds a malformed-stream hang; if it fires, a `core.warning` is emitted and the (possibly incomplete) extract is finalized rather than retrying. The failure path is intentionally not drained — awaiting on a writer whose upstream kc channel reported Failure can hang against a half-open WebSocket.
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a bounded “tar drain” phase to execCpFromPod to avoid returning before tar-fs has finished writing extracted bytes, and introduces a configurable timeout via env var.
Changes:
- Add
tarDrainTimeoutMs()+ACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MSwith a 60s default. - Update
execCpFromPodto awaitstream.promises.finished()(bounded by timeout) and warn on timeout. - Add/extend Jest tests covering timeout parsing and the tar drain success/timeout/failure paths; document the new env var.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/k8s/src/k8s/utils.ts | Introduces env var + default and helper to parse the tar drain timeout. |
| packages/k8s/src/k8s/index.ts | Waits for tar extraction stream to drain (with timeout) and destroys stream on both success/failure paths. |
| packages/k8s/tests/k8s-utils-test.ts | Adds unit tests for timeout parsing behavior. |
| packages/k8s/tests/exec-cp-from-pod-drain-test.ts | Adds focused tests to validate drain behavior, timeout warning, and failure handling. |
| packages/k8s/tests/error-serialization-test.ts | Updates tar-fs mock to include destroy() for new stream lifecycle behavior. |
| packages/k8s/README.md | Documents ACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MS configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+601
to
610
| async status => { | ||
| if (errStream.size()) { | ||
| reject( | ||
| new Error( | ||
| `Error from cpFromPod - details: \n ${errStream.getContentsAsString()}` | ||
| ) | ||
| ) | ||
| ) | ||
| } | ||
| resolve(status) | ||
| } |
Comment on lines
+601
to
610
| async status => { | ||
| if (errStream.size()) { | ||
| reject( | ||
| new Error( | ||
| `Error from cpFromPod - details: \n ${errStream.getContentsAsString()}` | ||
| ) | ||
| ) | ||
| ) | ||
| } | ||
| resolve(status) | ||
| } |
Comment on lines
+275
to
+285
| export function tarDrainTimeoutMs(): number { | ||
| const raw = process.env[ENV_TAR_DRAIN_TIMEOUT_MS] | ||
| if (raw === undefined || raw === '') { | ||
| return DEFAULT_TAR_DRAIN_TIMEOUT_MS | ||
| } | ||
| const parsed = parseInt(raw, 10) | ||
| if (Number.isNaN(parsed) || parsed <= 0) { | ||
| return DEFAULT_TAR_DRAIN_TIMEOUT_MS | ||
| } | ||
| return parsed | ||
| } |
|
|
||
| ## Configuration | ||
|
|
||
| These environment variables are read on the runner container of the runner pod. All knobs are optional; defaults preserve the historical behavior of the hooks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Drain tar writer before destroy in execCpFromPod to prevent silent workspace truncation
Impact: every Kubernetes hook job that copies files out of the job container via
execCpFromPod(workspace export, hash verification staging, any consumer ofcpfrom pod)Risk: medium
What
On the success path of
execCpFromPod, await the tar extraction stream'sfinishedevent (bounded by a timeout) before destroying the writer, instead of destroying it immediately when the Kubernetes exec status callback fires.Why
The Kubernetes
Execstatus callback fires when the remotetarprocess exits, but the WebSocket carrying the tar bytes may still have data in flight that the localtar-fsextractor has not yet written to disk. Destroying the writer at that moment silently truncates the extracted workspace — the operation reports success while files on disk are missing or partial. This has been observed in production:cp"succeeds" but downstream steps fail with missing-file errors that do not point back at the copy step.How
stream.promises.finished(writerStream, { signal: AbortSignal.timeout(timeoutMs) })before destroying. This blocks untiltar-fshas actually closed its file descriptors.core.warningis logged and the writer is destroyed in afinally— the operation is allowed to complete rather than triggering another retry, since the bytes that did arrive are already on disk and re-running risks compounding the problem.finishedon a writer whose upstream WebSocket reportedFailurecan hang against a half-open connection, so the drain is deliberately skipped there.ACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MS(default60000). Configuration is wired through a named constant + accessor (ENV_TAR_DRAIN_TIMEOUT_MS,tarDrainTimeoutMs()) inpackages/k8s/src/k8s/utils.ts, matching the existinguseKubeSchedulerpattern. Invalid values (NaN, zero, negative, empty) fall back to the default.Changes
packages/k8s/src/k8s/index.ts: wrap the kcexecpromise intry/catch; failure path destroys + rethrows; success path awaitsstream.finishedwithAbortSignal.timeout, logs a warning on timeout, destroys the writer infinally.packages/k8s/src/k8s/utils.ts: addENV_TAR_DRAIN_TIMEOUT_MS,DEFAULT_TAR_DRAIN_TIMEOUT_MS = 60000, andtarDrainTimeoutMs()accessor.packages/k8s/README.md: new Configuration section documentingACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MS.packages/k8s/tests/k8s-utils-test.ts: unit tests for the accessor (default, valid override, NaN, zero/negative, empty).packages/k8s/tests/exec-cp-from-pod-drain-test.ts: new integration tests covering (1) success with a lateend()is honored without warning, (2) success with a writer that never ends trips the timeout, logs the warning, destroys, and still resolves, (3) failure status destroys immediately and never waits on drain.packages/k8s/tests/error-serialization-test.ts: adddestroy: jest.fn()to thetar.extractmock so the unconditional destroy on the production path does not crash an unrelated test.Notes
AbortSignal.timeoutrequires Node 17.3+. The CI matrix runs Node 20+, andpackage.jsondoes not pin a lowerengines.node, so this is compatible with the supported runtimes; flagging it in case the project decides to formalize the minimum.cpis the configured timeout.ACTIONS_RUNNER_TAR_DRAIN_TIMEOUT_MSon the runner container.