Skip to content

OpenShift Cluster Bare Metal Node Addition Failure in OCI #82

@nikhisin3001

Description

@nikhisin3001

Current Issue Overview

Currently, customers are able to launch OpenShift clusters successfully. However, when they try to add bare metal nodes to an OpenShift cluster in OCI, the installation flow fails during node boot. The issue appears after the bare metal node is launched and becomes visible in the Red Hat console for installation, but before the installation can complete successfully.This has impacted customer workflows and also affected our bare metal autoscaling use case, which is why it became a high- priority issue.

What changed

The issue appears to be tied to a change introduced in openshift/assisted-service PR #8837(openshift/assisted-service#8837 ). That change affected the early boot flow used during bare metal node installation for an OpenShift cluster and exposed a problem in our OCI environment.

How it affected us

The issue impacted bare metal node addition for OpenShift clusters in OCI. Customers could create an OpenShift cluster, but they could not successfully continue when adding bare metal nodes. This also affected our bare metal autoscaling use case for OpenShift clusters, which increased the urgency of the issue.

Why it only affected OCI

This issue only affected OCI because OCI exposed a unique NIC configuration in this flow. One NIC was configured with DHCP and another was not, which created the exact condition needed for the problem to appear during OpenShift cluster node installation.

Technical root cause

The root cause was a race condition between network configuration and volume mounting during boot. Both operations had 90-second timeouts. If the network was not fully ready before the mount step needed it, the node boot flow for the OpenShift cluster failed.

Temporary mitigation

For the temporary mitigation, the workaround is applied during the installation flow for bare metal nodes in an OpenShift cluster on OCI. After the bare metal node is launched and appears in the Red Hat console for installation, the workaround script must be run before starting the installation. Once the script finishes successfully, the installation can proceed. The temporary fix is documented in Adrien’s workaround gist (https://gist.github.com/adriengentil/8f25cd2c9a9618ba81ed779cf0ec2864 ).

Current status and long-term fix

The customer is currently unblocked through the temporary fix for existing classes. A long-term fix is under development and is expected next week. Once that fix is delivered and rolled out in the next Red Hat image, the related bare metal autoscaling issue for OpenShift clusters should also be resolved.

References

openshift/assisted-service PR #8837 (openshift/assisted-service#8837 )
Adrien’s workaround gist (https://gist.github.com/adriengentil/8f25cd2c9a9618ba81ed779cf0ec2864 )

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions