Skip to content

fix(deploy): mark broken-image deploys failed on ProgressDeadlineExceeded#101

Merged
mastermanas805 merged 1 commit into
masterfrom
fix/deploy-progress-deadline-failed
Jun 8, 2026
Merged

fix(deploy): mark broken-image deploys failed on ProgressDeadlineExceeded#101
mastermanas805 merged 1 commit into
masterfrom
fix/deploy-progress-deadline-failed

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

Problem (found live, 2026-06-08)

6 instant-deploy-* pods were stuck in CreateContainerError (failed to generate spec: no command specified) for 13+ min, retrying the image pull 46×. The pulled image was 474 bytes — an empty image with no CMD/ENTRYPOINT. The deploys' DB rows were stuck at deploying forever.

Root cause: deploymentStatusFromK8s only checked DeploymentReplicaFailure + replica counts. A pod that is created but whose container can't start has UnavailableReplicas>0 and no DeploymentReplicaFailure condition → mapped to deploying. The build Job succeeded (it produced an image, just a broken one), so the Job-failed override (#65) never fired. The failure-autopsy is gated on newStatus==failed, so it never ran → no autopsy event, no deploy.failed audit, no failure email. This is the runtime twin of the build-side silent-deploy-failure fix (rule 27).

Confirmed on the live Deployment: Available=False (MinimumReplicasUnavailable), Progressing=False (ProgressDeadlineExceeded), container waiting.reason=CreateContainerError.

Fix

  • deploymentStatusFromK8s: Progressing=False + reason=ProgressDeadlineExceeded with no available replicafailed. Checked after the healthy branch, so a partially-failed redeploy whose previous ReplicaSet still serves stays healthy. Kept in sync with the api's deploymentStatus.
  • extractPodFailure: classify CreateContainerError / CreateContainerConfigError / RunContainerError → new StartFailed reason with a precise hint (was Unknown). Once the status flips to failed, the autopsy now emits the right reason + the deploy.failed audit → the user gets a failure email with an actionable cause.
  • New metric instant_deploy_runtime_failed_detected_total{reason} — the runtime twin of instant_deploy_job_failed_detected_total. Alert + dashboard tile + catalog row land in the infra repo (rule 25).

Tests

  • TestDeploymentStatusFromK8s_Matrix: +3 cases (progress-deadline→failed; healthy-wins-when-available; Progressing=True stays deploying).
  • TestExtractPodFailure_StartFailed: all three create/run-container-error reasons → StartFailed.
  • TestComputeNewStatus_ProgressDeadlineExceeded_FailsAndCountsMetric + _ReplicaFailure_DoesNotCountRuntimeMetric: status + metric attribution.
  • make gate green locally.

Pairs with api PR (same title) and infra observability PR.

🤖 Generated with Claude Code

…eded

A deploy whose built image cannot start (CreateContainerError "no command specified" from an empty image, ImagePullBackOff, CrashLoopBackOff) was reported "deploying" forever: deploymentStatusFromK8s only checked DeploymentReplicaFailure + replica counts, so a created-but-unstartable pod (UnavailableReplicas>0) mapped to "deploying". The failure-autopsy is gated on newStatus==failed, so it never fired — no autopsy event, no deploy.failed audit, no failure email. This is the runtime twin of the build-Job-failed override.

- deploymentStatusFromK8s: Progressing=False/ProgressDeadlineExceeded with no available replica -> failed (checked after the healthy branch so a partially-failed redeploy whose old ReplicaSet still serves stays healthy). Kept in sync with the api's deploymentStatus.

- extractPodFailure: classify CreateContainerError / CreateContainerConfigError / RunContainerError -> new StartFailed reason (precise hint instead of Unknown).

- new metric instant_deploy_runtime_failed_detected_total{reason} (twin of instant_deploy_job_failed_detected_total); alert + dashboard tile + catalog row land in the infra repo (rule 25).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 4eb1207 into master Jun 8, 2026
12 checks passed
@mastermanas805 mastermanas805 deleted the fix/deploy-progress-deadline-failed branch June 8, 2026 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant