Current Issue Overview
Currently, customers are able to launch OpenShift clusters successfully. However, when they try to add bare metal nodes to an OpenShift cluster in OCI, the installation flow fails during node boot. The issue appears after the bare metal node is launched and becomes visible in the Red Hat console for installation, but before the installation can complete successfully.This has impacted customer workflows and also affected our bare metal autoscaling use case, which is why it became a high- priority issue.
What changed
The issue appears to be tied to a change introduced in openshift/assisted-service PR #8837(openshift/assisted-service#8837 ). That change affected the early boot flow used during bare metal node installation for an OpenShift cluster and exposed a problem in our OCI environment.
How it affected us
The issue impacted bare metal node addition for OpenShift clusters in OCI. Customers could create an OpenShift cluster, but they could not successfully continue when adding bare metal nodes. This also affected our bare metal autoscaling use case for OpenShift clusters, which increased the urgency of the issue.
Why it only affected OCI
This issue only affected OCI because OCI exposed a unique NIC configuration in this flow. One NIC was configured with DHCP and another was not, which created the exact condition needed for the problem to appear during OpenShift cluster node installation.
Technical root cause
The root cause was a race condition between network configuration and volume mounting during boot. Both operations had 90-second timeouts. If the network was not fully ready before the mount step needed it, the node boot flow for the OpenShift cluster failed.
Temporary mitigation
For the temporary mitigation, the workaround is applied during the installation flow for bare metal nodes in an OpenShift cluster on OCI. After the bare metal node is launched and appears in the Red Hat console for installation, the workaround script must be run before starting the installation. Once the script finishes successfully, the installation can proceed. The temporary fix is documented in Adrien’s workaround gist (https://gist.github.com/adriengentil/8f25cd2c9a9618ba81ed779cf0ec2864 ).
Current status and long-term fix
The customer is currently unblocked through the temporary fix for existing classes. A long-term fix is under development and is expected next week. Once that fix is delivered and rolled out in the next Red Hat image, the related bare metal autoscaling issue for OpenShift clusters should also be resolved.
References
openshift/assisted-service PR #8837 (openshift/assisted-service#8837 )
Adrien’s workaround gist (https://gist.github.com/adriengentil/8f25cd2c9a9618ba81ed779cf0ec2864 )
Current Issue Overview
Currently, customers are able to launch OpenShift clusters successfully. However, when they try to add bare metal nodes to an OpenShift cluster in OCI, the installation flow fails during node boot. The issue appears after the bare metal node is launched and becomes visible in the Red Hat console for installation, but before the installation can complete successfully.This has impacted customer workflows and also affected our bare metal autoscaling use case, which is why it became a high- priority issue.
What changed
The issue appears to be tied to a change introduced in openshift/assisted-service PR #8837(openshift/assisted-service#8837 ). That change affected the early boot flow used during bare metal node installation for an OpenShift cluster and exposed a problem in our OCI environment.
How it affected us
The issue impacted bare metal node addition for OpenShift clusters in OCI. Customers could create an OpenShift cluster, but they could not successfully continue when adding bare metal nodes. This also affected our bare metal autoscaling use case for OpenShift clusters, which increased the urgency of the issue.
Why it only affected OCI
This issue only affected OCI because OCI exposed a unique NIC configuration in this flow. One NIC was configured with DHCP and another was not, which created the exact condition needed for the problem to appear during OpenShift cluster node installation.
Technical root cause
The root cause was a race condition between network configuration and volume mounting during boot. Both operations had 90-second timeouts. If the network was not fully ready before the mount step needed it, the node boot flow for the OpenShift cluster failed.
Temporary mitigation
For the temporary mitigation, the workaround is applied during the installation flow for bare metal nodes in an OpenShift cluster on OCI. After the bare metal node is launched and appears in the Red Hat console for installation, the workaround script must be run before starting the installation. Once the script finishes successfully, the installation can proceed. The temporary fix is documented in Adrien’s workaround gist (https://gist.github.com/adriengentil/8f25cd2c9a9618ba81ed779cf0ec2864 ).
Current status and long-term fix
The customer is currently unblocked through the temporary fix for existing classes. A long-term fix is under development and is expected next week. Once that fix is delivered and rolled out in the next Red Hat image, the related bare metal autoscaling issue for OpenShift clusters should also be resolved.
References
openshift/assisted-service PR #8837 (openshift/assisted-service#8837 )
Adrien’s workaround gist (https://gist.github.com/adriengentil/8f25cd2c9a9618ba81ed779cf0ec2864 )