Problem
A restart OpsRequest can be marked Failed when a recreated Pod temporarily reports a kubelet waiting message but later becomes Ready.
In the ES preserved case, the Pod event was:
Error: failed to sync configmap cache: timed out waiting for the condition
The Pod later completed init/main startup and the ES cluster was healthy, but the OpsRequest progress had already been finalized as Failed.
Suspected controller path
IsPodFailedAndTimedOut() treats any container Waiting state with a non-empty message as a failed container after PodContainerFailedTimeout.
That transient failure condition is propagated through InstanceSet InstanceFailure and then Ops progress finalization.
Expected behavior
Recoverable kubelet startup wait states should not be treated as terminal instance failures. Ops progress should only fail for terminal waiting reasons or unrecovered pod failures.
Problem
A restart OpsRequest can be marked Failed when a recreated Pod temporarily reports a kubelet waiting message but later becomes Ready.
In the ES preserved case, the Pod event was:
The Pod later completed init/main startup and the ES cluster was healthy, but the OpsRequest progress had already been finalized as Failed.
Suspected controller path
IsPodFailedAndTimedOut()treats any containerWaitingstate with a non-empty message as a failed container afterPodContainerFailedTimeout.That transient failure condition is propagated through InstanceSet
InstanceFailureand then Ops progress finalization.Expected behavior
Recoverable kubelet startup wait states should not be treated as terminal instance failures. Ops progress should only fail for terminal waiting reasons or unrecovered pod failures.