Skip to content

Transient pod waiting state can fail restart OpsRequest #10300

@weicao

Description

@weicao

Problem

A restart OpsRequest can be marked Failed when a recreated Pod temporarily reports a kubelet waiting message but later becomes Ready.

In the ES preserved case, the Pod event was:

Error: failed to sync configmap cache: timed out waiting for the condition

The Pod later completed init/main startup and the ES cluster was healthy, but the OpsRequest progress had already been finalized as Failed.

Suspected controller path

IsPodFailedAndTimedOut() treats any container Waiting state with a non-empty message as a failed container after PodContainerFailedTimeout.

That transient failure condition is propagated through InstanceSet InstanceFailure and then Ops progress finalization.

Expected behavior

Recoverable kubelet startup wait states should not be treated as terminal instance failures. Ops progress should only fail for terminal waiting reasons or unrecovered pod failures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions