You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(deploy): mark broken-image deploys failed on ProgressDeadlineExceeded (#101)
A deploy whose built image cannot start (CreateContainerError "no command specified" from an empty image, ImagePullBackOff, CrashLoopBackOff) was reported "deploying" forever: deploymentStatusFromK8s only checked DeploymentReplicaFailure + replica counts, so a created-but-unstartable pod (UnavailableReplicas>0) mapped to "deploying". The failure-autopsy is gated on newStatus==failed, so it never fired — no autopsy event, no deploy.failed audit, no failure email. This is the runtime twin of the build-Job-failed override.
- deploymentStatusFromK8s: Progressing=False/ProgressDeadlineExceeded with no available replica -> failed (checked after the healthy branch so a partially-failed redeploy whose old ReplicaSet still serves stays healthy). Kept in sync with the api's deploymentStatus.
- extractPodFailure: classify CreateContainerError / CreateContainerConfigError / RunContainerError -> new StartFailed reason (precise hint instead of Unknown).
- new metric instant_deploy_runtime_failed_detected_total{reason} (twin of instant_deploy_job_failed_detected_total); alert + dashboard tile + catalog row land in the infra repo (rule 25).
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: internal/metrics/metrics.go
+28Lines changed: 28 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -665,6 +665,34 @@ var (
665
665
Help: "Kaniko build Jobs detected in Failed state by deploy_status_reconcile (silent-deploy-failure fix, 2026-05-30). Labelled by Job Failed-condition reason.",
Help: "Runtime Deployments detected as failed-to-progress by deploy_status_reconcile (ProgressDeadlineExceeded with no available replica — broken-image silent-failure fix, 2026-06-08). Labelled by detection reason.",
0 commit comments