Self-heal pods wedged by transient gcsfuse-sidecar metadata-server timeouts by morgan-wowk · Pull Request #286 · TangleML/tangle

morgan-wowk · 2026-06-24T20:36:14Z

Problem

GKE-injected gke-gcsfuse-sidecar pods can wedge in CreateContainerConfigError when the sidecar's bucket-access-check pre-flight times out resolving Workload Identity via a degraded metadata server (sidecar exits 255). The GCS volume never mounts, the main container never starts, and the orchestrator polls the pod until the run-level timeout cancels it with no logs — burning a whole run (e.g. a daily eval) on a transient platform fault. There is no execution-level auto-retry today.

Approach: re-queue in place

When the wedge is detected on a running execution, the orchestrator:

Terminates the wedged pod (best-effort delete).
Marks its ContainerExecution SYSTEM_ERROR — this records the failed attempt for forensics and removes it from cache-reuse candidates (reuse only considers PENDING/RUNNING/SUCCEEDED).
Re-queues the same ExecutionNode within the same run, so the canonical launch_container_task path builds a fresh, correct pod (usually landing on a healthier node).

Capped at 2 retries, tracked on the node's extra_data. Beyond the cap the node is failed (SYSTEM_ERROR) and its downstream skipped, so the run fails fast instead of hanging until its timeout.

Same run, same node, same provenance — no surprise sibling run, no pod-spec surgery.

Why this design

Two earlier approaches were rejected:

skipCSIBucketAccessCheck=true (oasis-backend #405) — proven ineffective on driver v1.21.24-gke.5 in staging (wrong knob; it skips the CSI-node check, not the sidecar check that hits the metadata server).
Pod-spec surgery (recreate the pod by stripping server-assigned fields + webhook-injected artifacts) — too brittle / version-coupled.
Resubmitting a whole new run — breaks provenance: runs come from many sources (tangle deploy, UI, tangent shell); a surprise run nobody triggered is worse for observability.

Changes

launchers/interfaces.py — replace the no-op try_self_heal() hook with a detection predicate transient_infra_failure_reason() -> str | None (default None, so other launchers are unaffected).
launchers/kubernetes_launchers.py — implement the predicate for the gcsfuse-sidecar wedge (sidecar exit 255 + transient reason + "bucket access check" message, guarded by the main container not having started). Detection-only; no pod mutation.
orchestrator_sql.py — dispatch to _handle_transient_infra_failure(...) from the running-execution path; terminate + SYSTEM_ERROR + re-queue with the 2-retry cap.

Model / persistence / UX

A node points to one ContainerExecution (FK); on re-queue the FK is repointed to the fresh execution at relaunch. The wedged execution persists as an unreferenced SYSTEM_ERROR row (forensics). No logs lost — a wedged pod never produced container logs.
The UI resolves a node → its current container_execution, so after relaunch the tile follows the fresh pod and ends green. The retry is auditable via the node's auto container_execution_status_history and the standalone SYSTEM_ERROR row.

Tests

Launcher: transient_infra_failure_reason() returns a reason for the wedge signature and None otherwise (running/terminated main, clean sidecar exit, unrelated message, no sidecar).
Orchestrator (in-memory DB, end-to-end): a wedged attempt becomes SYSTEM_ERROR, the node returns to QUEUED with retry count 1 and the reason recorded; exceeding the cap flips the node to SYSTEM_ERROR and skips downstream.

Verified: pytest, black, import smoke all green.

morgan-wowk · 2026-06-24T20:36:27Z

Self-heal pods wedged by transient gcsfuse-sidecar metadata-server timeouts #286 👈 (View in Graphite)
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

…meouts GKE-injected gke-gcsfuse-sidecar pods can wedge in CreateContainerConfigError when the sidecar's bucket-access-check pre-flight times out resolving Workload Identity via a degraded metadata server (exit 255). The GCS volume never mounts, the main container never starts, and the orchestrator polls the pod until the run-level timeout cancels it with no logs — burning a whole run on a transient platform fault. Detect that signature and self-heal by relaunching the task in place: - launchers/interfaces.py: replace the no-op try_self_heal() hook with a detection predicate transient_infra_failure_reason() -> str | None (default None, so other launchers are unaffected). - kubernetes_launchers.py: implement the predicate for the gcsfuse-sidecar wedge (sidecar exit 255 + transient reason + bucket-access-check message, guarded by the main container not having started). No pod-spec surgery. - orchestrator_sql.py: when a running execution reports a transient infra failure, terminate the wedged pod, mark its ContainerExecution SYSTEM_ERROR (which records the failed attempt and excludes it from cache reuse), and re-queue the same ExecutionNode within the same run so the canonical launch path builds a fresh pod. Capped at 2 retries (tracked on the node's extra_data); beyond that the node fails and downstream is skipped so the run fails fast. Same run, same node, same provenance — no new run, no pod surgery. Tests cover the launcher detection signature and the orchestrator re-queue / retry-cap behaviour end to end against an in-memory DB.

github-code-quality Bot found potential problems Jun 24, 2026

View reviewed changes

Comment thread cloud_pipelines_backend/launchers/kubernetes_launchers.py Fixed

morgan-wowk force-pushed the self-heal-wedged-gcsfuse-sidecar branch from 279723d to 13dd31c Compare June 24, 2026 21:08

morgan-wowk marked this pull request as ready for review June 24, 2026 21:16

morgan-wowk requested a review from Ark-kun as a code owner June 24, 2026 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Self-heal pods wedged by transient gcsfuse-sidecar metadata-server timeouts#286

Self-heal pods wedged by transient gcsfuse-sidecar metadata-server timeouts#286
morgan-wowk wants to merge 1 commit into
masterfrom
self-heal-wedged-gcsfuse-sidecar

morgan-wowk commented Jun 24, 2026 •

edited

Loading

Uh oh!

morgan-wowk commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

morgan-wowk commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach: re-queue in place

Why this design

Changes

Model / persistence / UX

Tests

Uh oh!

morgan-wowk commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

morgan-wowk commented Jun 24, 2026 •

edited

Loading