Skip to content

Self-heal pods wedged by transient gcsfuse-sidecar metadata-server timeouts#286

Open
morgan-wowk wants to merge 1 commit into
masterfrom
self-heal-wedged-gcsfuse-sidecar
Open

Self-heal pods wedged by transient gcsfuse-sidecar metadata-server timeouts#286
morgan-wowk wants to merge 1 commit into
masterfrom
self-heal-wedged-gcsfuse-sidecar

Conversation

@morgan-wowk

@morgan-wowk morgan-wowk commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Problem

GKE-injected gke-gcsfuse-sidecar pods can wedge in CreateContainerConfigError when the sidecar's bucket-access-check pre-flight times out resolving Workload Identity via a degraded metadata server (sidecar exits 255). The GCS volume never mounts, the main container never starts, and the orchestrator polls the pod until the run-level timeout cancels it with no logs — burning a whole run (e.g. a daily eval) on a transient platform fault. There is no execution-level auto-retry today.

Approach: re-queue in place

When the wedge is detected on a running execution, the orchestrator:

  1. Terminates the wedged pod (best-effort delete).
  2. Marks its ContainerExecution SYSTEM_ERROR — this records the failed attempt for forensics and removes it from cache-reuse candidates (reuse only considers PENDING/RUNNING/SUCCEEDED).
  3. Re-queues the same ExecutionNode within the same run, so the canonical launch_container_task path builds a fresh, correct pod (usually landing on a healthier node).

Capped at 2 retries, tracked on the node's extra_data. Beyond the cap the node is failed (SYSTEM_ERROR) and its downstream skipped, so the run fails fast instead of hanging until its timeout.

Same run, same node, same provenance — no surprise sibling run, no pod-spec surgery.

Why this design

Two earlier approaches were rejected:

  • skipCSIBucketAccessCheck=true (oasis-backend #405) — proven ineffective on driver v1.21.24-gke.5 in staging (wrong knob; it skips the CSI-node check, not the sidecar check that hits the metadata server).
  • Pod-spec surgery (recreate the pod by stripping server-assigned fields + webhook-injected artifacts) — too brittle / version-coupled.
  • Resubmitting a whole new run — breaks provenance: runs come from many sources (tangle deploy, UI, tangent shell); a surprise run nobody triggered is worse for observability.

Changes

  • launchers/interfaces.py — replace the no-op try_self_heal() hook with a detection predicate transient_infra_failure_reason() -> str | None (default None, so other launchers are unaffected).
  • launchers/kubernetes_launchers.py — implement the predicate for the gcsfuse-sidecar wedge (sidecar exit 255 + transient reason + "bucket access check" message, guarded by the main container not having started). Detection-only; no pod mutation.
  • orchestrator_sql.py — dispatch to _handle_transient_infra_failure(...) from the running-execution path; terminate + SYSTEM_ERROR + re-queue with the 2-retry cap.

Model / persistence / UX

  • A node points to one ContainerExecution (FK); on re-queue the FK is repointed to the fresh execution at relaunch. The wedged execution persists as an unreferenced SYSTEM_ERROR row (forensics). No logs lost — a wedged pod never produced container logs.
  • The UI resolves a node → its current container_execution, so after relaunch the tile follows the fresh pod and ends green. The retry is auditable via the node's auto container_execution_status_history and the standalone SYSTEM_ERROR row.

Tests

  • Launcher: transient_infra_failure_reason() returns a reason for the wedge signature and None otherwise (running/terminated main, clean sidecar exit, unrelated message, no sidecar).
  • Orchestrator (in-memory DB, end-to-end): a wedged attempt becomes SYSTEM_ERROR, the node returns to QUEUED with retry count 1 and the reason recorded; exceeding the cap flips the node to SYSTEM_ERROR and skips downstream.

Verified: pytest, black, import smoke all green.

Copy link
Copy Markdown
Collaborator Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Comment thread cloud_pipelines_backend/launchers/kubernetes_launchers.py Fixed
…meouts

GKE-injected gke-gcsfuse-sidecar pods can wedge in CreateContainerConfigError
when the sidecar's bucket-access-check pre-flight times out resolving Workload
Identity via a degraded metadata server (exit 255). The GCS volume never mounts,
the main container never starts, and the orchestrator polls the pod until the
run-level timeout cancels it with no logs — burning a whole run on a transient
platform fault.

Detect that signature and self-heal by relaunching the task in place:

- launchers/interfaces.py: replace the no-op try_self_heal() hook with a
  detection predicate transient_infra_failure_reason() -> str | None (default
  None, so other launchers are unaffected).
- kubernetes_launchers.py: implement the predicate for the gcsfuse-sidecar wedge
  (sidecar exit 255 + transient reason + bucket-access-check message, guarded by
  the main container not having started). No pod-spec surgery.
- orchestrator_sql.py: when a running execution reports a transient infra
  failure, terminate the wedged pod, mark its ContainerExecution SYSTEM_ERROR
  (which records the failed attempt and excludes it from cache reuse), and
  re-queue the same ExecutionNode within the same run so the canonical launch
  path builds a fresh pod. Capped at 2 retries (tracked on the node's
  extra_data); beyond that the node fails and downstream is skipped so the run
  fails fast. Same run, same node, same provenance — no new run, no pod surgery.

Tests cover the launcher detection signature and the orchestrator re-queue /
retry-cap behaviour end to end against an in-memory DB.
@morgan-wowk morgan-wowk force-pushed the self-heal-wedged-gcsfuse-sidecar branch from 279723d to 13dd31c Compare June 24, 2026 21:08
@morgan-wowk morgan-wowk marked this pull request as ready for review June 24, 2026 21:16
@morgan-wowk morgan-wowk requested a review from Ark-kun as a code owner June 24, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant