fix(deploy): mark broken-image deploys failed on ProgressDeadlineExceeded#101
Merged
Merged
Conversation
…eded
A deploy whose built image cannot start (CreateContainerError "no command specified" from an empty image, ImagePullBackOff, CrashLoopBackOff) was reported "deploying" forever: deploymentStatusFromK8s only checked DeploymentReplicaFailure + replica counts, so a created-but-unstartable pod (UnavailableReplicas>0) mapped to "deploying". The failure-autopsy is gated on newStatus==failed, so it never fired — no autopsy event, no deploy.failed audit, no failure email. This is the runtime twin of the build-Job-failed override.
- deploymentStatusFromK8s: Progressing=False/ProgressDeadlineExceeded with no available replica -> failed (checked after the healthy branch so a partially-failed redeploy whose old ReplicaSet still serves stays healthy). Kept in sync with the api's deploymentStatus.
- extractPodFailure: classify CreateContainerError / CreateContainerConfigError / RunContainerError -> new StartFailed reason (precise hint instead of Unknown).
- new metric instant_deploy_runtime_failed_detected_total{reason} (twin of instant_deploy_job_failed_detected_total); alert + dashboard tile + catalog row land in the infra repo (rule 25).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (found live, 2026-06-08)
6
instant-deploy-*pods were stuck inCreateContainerError(failed to generate spec: no command specified) for 13+ min, retrying the image pull 46×. The pulled image was 474 bytes — an empty image with no CMD/ENTRYPOINT. The deploys' DB rows were stuck atdeployingforever.Root cause:
deploymentStatusFromK8sonly checkedDeploymentReplicaFailure+ replica counts. A pod that is created but whose container can't start hasUnavailableReplicas>0and noDeploymentReplicaFailurecondition → mapped todeploying. The build Job succeeded (it produced an image, just a broken one), so the Job-failed override (#65) never fired. The failure-autopsy is gated onnewStatus==failed, so it never ran → no autopsy event, nodeploy.failedaudit, no failure email. This is the runtime twin of the build-side silent-deploy-failure fix (rule 27).Confirmed on the live Deployment:
Available=False (MinimumReplicasUnavailable),Progressing=False (ProgressDeadlineExceeded), containerwaiting.reason=CreateContainerError.Fix
deploymentStatusFromK8s:Progressing=False+reason=ProgressDeadlineExceededwith no available replica →failed. Checked after the healthy branch, so a partially-failed redeploy whose previous ReplicaSet still serves stayshealthy. Kept in sync with the api'sdeploymentStatus.extractPodFailure: classifyCreateContainerError/CreateContainerConfigError/RunContainerError→ newStartFailedreason with a precise hint (wasUnknown). Once the status flips to failed, the autopsy now emits the right reason + thedeploy.failedaudit → the user gets a failure email with an actionable cause.instant_deploy_runtime_failed_detected_total{reason}— the runtime twin ofinstant_deploy_job_failed_detected_total. Alert + dashboard tile + catalog row land in the infra repo (rule 25).Tests
TestDeploymentStatusFromK8s_Matrix: +3 cases (progress-deadline→failed; healthy-wins-when-available; Progressing=True stays deploying).TestExtractPodFailure_StartFailed: all three create/run-container-error reasons → StartFailed.TestComputeNewStatus_ProgressDeadlineExceeded_FailsAndCountsMetric+_ReplicaFailure_DoesNotCountRuntimeMetric: status + metric attribution.make gategreen locally.Pairs with api PR (same title) and infra observability PR.
🤖 Generated with Claude Code