Skip to content

feat: automatic retries when host is stuck in PollingBiosSetup state #1846

@sarchinnvidia

Description

@sarchinnvidia

Is this a new feature, an enhancement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

Ingested node was stuck in state HostInitializing/PollingBiosSetup for for ~17h+ (SLA is 30 min). I reset the BMC with Manager.Reset GracefulRestart which did not help. The eventual fix was clicking Machine Setup via UI and doing a Force Restart of the node, which allowed the machine to move to the next state. There should be automatic retries that does exactly this instead of requiring manual intervention.

Feature Description

From an operator perspective, I would like automatic retries for when machines are stuck in state HostInitializing/PollingBiosSetup so that manual intervention is not needed.

Describe your ideal solution

I would not need to intervene when the "fix" to this situation consisted only of staging settings and restarting, which are actions automatic retries, triggered upon a certain amount of time out of SLA, could take.

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • I agree to follow NCX Infra Controller's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Assignees

Labels

featureFeature (deprecated - use issue type, but it's needed for reporting now)
No fields configured for Enhancement.

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions