Skip to content

DNM: Reproduce placement http endpoint race#1457

Closed
gibizer wants to merge 1 commit into
openstack-k8s-operators:mainfrom
gibizer:placement-http-cell0-conductor
Closed

DNM: Reproduce placement http endpoint race#1457
gibizer wants to merge 1 commit into
openstack-k8s-operators:mainfrom
gibizer:placement-http-cell0-conductor

Conversation

@gibizer

@gibizer gibizer commented May 27, 2025

Copy link
Copy Markdown
Contributor

With the delay introduce to the placement tls reconciliation we can reproduce the following sequence of events.

  • PlacementAPI CR is created with the non tlse endpoints and being actively reconciled by the placement-operator
  • Nova CR is CR is created and being actively reconciled by the nova-operator
  • placement-operator deploys the service and exposes it in keystone via a KeystoneEndpoint CR with the http URL
  • nova-operator deploys nova-cell0-conductor and that service creates a placement client that discovers the http URL for placement
  • opentstack-operator finally updates the PlacementAPI with the tlse config and therefore placement-operator updates the KeystoneEndpoint CR with the https endpoints. This does not trigger a restart in nova-cell0-conductor deployment as that only depends on the KeystoneEndpoint/keystone
  • two edpm compute node is deployed and a nova instance is created on one of them then requested to be migrated to the other node.
  • nova-cell0-conductor-0 uses its placement client to move get the instance allocations from placement. The client uses the http URL and fails as the placement now only speaks https.
while true ; do date ; oc get pod | grep nova ; sleep 5 ; done ...
Tue May 27 10:54:25 AM CEST 2025
nova-api-0                                   0/2     ContainerCreating   0          1s
nova-api-8967-account-create-nqf8c           0/1     Completed           0          34s
nova-api-db-create-cj4ls                     0/1     Completed           0          44s
nova-cell0-cell-mapping-r4mcr                1/1     Running             0          1s
nova-cell0-conductor-0                       1/1     Running             0          12s
nova-cell0-conductor-db-sync-htqxh           0/1     Completed           0          29s
nova-cell0-db-create-7pqqx                   0/1     Completed           0          44s
nova-cell0-e647-account-create-f8nz4         0/1     Completed           0          34s
nova-cell1-9d60-account-create-r47q8         0/1     Completed           0          34s
nova-cell1-conductor-db-sync-sr2qp           1/1     Running             0          1s
nova-cell1-db-create-7kl84                   0/1     Completed           0          44s
nova-cell1-novncproxy-0                      0/1     ContainerCreating   0          1s
nova-metadata-0                              0/2     ContainerCreating   0          1s
nova-scheduler-0                             0/1     ContainerCreating   0          1s
while true ; do date ; oc get KeystoneEndpoint/placement -o yaml | yq ".spec.endpoints" ; sleep 5 ; done ...
Tue May 27 10:54:21 AM CEST 2025
internal: http://placement-internal.openstack.svc:8778 public: http://placement-public.openstack.svc:8778 
Tue May 27 10:54:26 AM CEST 2025
internal: http://placement-internal.openstack.svc:8778 public: http://placement-public.openstack.svc:8778 
Tue May 27 10:54:31 AM CEST 2025
internal: http://placement-internal.openstack.svc:8778 public: http://placement-public.openstack.svc:8778 
Tue May 27 10:54:36 AM CEST 2025
internal: http://placement-internal.openstack.svc:8778 public: http://placement-public.openstack.svc:8778 
Tue May 27 10:54:42 AM CEST 2025
internal: http://placement-internal.openstack.svc:8778 public: http://placement-public.openstack.svc:8778 
Tue May 27 10:54:47 AM CEST 2025
internal: https://placement-internal.openstack.svc:8778 public: https://placement-public-openstack.apps-crc.testing
❯ openstack server migrate test_0 --wait
oc logs nova-cell0-conductor-0
...
2025-05-27 09:15:26.911 1 WARNING nova.scheduler.utils [None req-180d4caa-8cdb-4e64-99ce-b1fae03e7ccf 1225f304b197465c9b7861edb324e225 bb30f64d028c4b93816803c9f9476c36 - - default default] Failed to compute_task_migrate_server: Failed to retrieve allocations for consumer af38a891-59d9-4aa1-9568-0d57a31a8b37: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
Reason: You're speaking plain HTTP to an SSL-enabled server port.<br />
 Instead use the HTTPS scheme to access this URL, please.<br />
</p>
</body></html>
: nova.exception.ConsumerAllocationRetrievalFailed: Failed to retrieve allocations for consumer af38a891-59d9-4aa1-9568-0d57a31a8b37: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

With the delay introduce to the placement tls reconciliation we can
reproduce the following sequence of events.

* PlacementAPI CR is created with the non tlse endpoints and being
  actively reconciled by the placement-operator
* Nova CR is CR is created and being actively reconciled by the
  nova-operator
* placement-operator deploys the service and exposes it in keystone via
  a KeystoneEndpoint CR with the http URL
* nova-operator deploys nova-cell0-conductor and that service creates a
  placement client that discovers the http URL for placement
* opentstack-operator finally updates the PlacementAPI with the tlse
  config and therefore placement-operator updates the KeystoneEndpoint
  CR with the https endpoints. This does not trigger a restart in
  nova-cell0-conductor deployment as that only depends on the
  KeystoneEndpoint/keystone
* two edpm compute node is deployed and a nova instance is created on
  one of them then requested to be migrated to the other node.
* nova-cell0-conductor-0 uses its placement client to move get the
  instance allocations from placement. The client uses the http URL and
  fails as the placement now only speaks https.

while true ; do date ; oc get pod | grep nova ; sleep 5 ; done
...
Tue May 27 10:54:25 AM CEST 2025
nova-api-0                                   0/2     ContainerCreating   0          1s
nova-api-8967-account-create-nqf8c           0/1     Completed           0          34s
nova-api-db-create-cj4ls                     0/1     Completed           0          44s
nova-cell0-cell-mapping-r4mcr                1/1     Running             0          1s
nova-cell0-conductor-0                       1/1     Running             0          12s
nova-cell0-conductor-db-sync-htqxh           0/1     Completed           0          29s
nova-cell0-db-create-7pqqx                   0/1     Completed           0          44s
nova-cell0-e647-account-create-f8nz4         0/1     Completed           0          34s
nova-cell1-9d60-account-create-r47q8         0/1     Completed           0          34s
nova-cell1-conductor-db-sync-sr2qp           1/1     Running             0          1s
nova-cell1-db-create-7kl84                   0/1     Completed           0          44s
nova-cell1-novncproxy-0                      0/1     ContainerCreating   0          1s
nova-metadata-0                              0/2     ContainerCreating   0          1s
nova-scheduler-0                             0/1     ContainerCreating   0          1s

while true ; do date ; oc get KeystoneEndpoint/placement -o yaml | yq ".spec.endpoints" ; sleep 5 ; done
...
Tue May 27 10:54:21 AM CEST 2025
internal: http://placement-internal.openstack.svc:8778
public: http://placement-public.openstack.svc:8778
Tue May 27 10:54:26 AM CEST 2025
internal: http://placement-internal.openstack.svc:8778
public: http://placement-public.openstack.svc:8778
Tue May 27 10:54:31 AM CEST 2025
internal: http://placement-internal.openstack.svc:8778
public: http://placement-public.openstack.svc:8778
Tue May 27 10:54:36 AM CEST 2025
internal: http://placement-internal.openstack.svc:8778
public: http://placement-public.openstack.svc:8778
Tue May 27 10:54:42 AM CEST 2025
internal: http://placement-internal.openstack.svc:8778
public: http://placement-public.openstack.svc:8778
Tue May 27 10:54:47 AM CEST 2025
internal: https://placement-internal.openstack.svc:8778
public: https://placement-public-openstack.apps-crc.testing

❯ openstack server migrate test_0 --wait

oc logs nova-cell0-conductor-0
...
2025-05-27 09:15:26.911 1 WARNING nova.scheduler.utils [None req-180d4caa-8cdb-4e64-99ce-b1fae03e7ccf 1225f304b197465c9b7861edb324e225 bb30f64d028c4b93816803c9f9476c36 - - default default] Failed to compute_task_migrate_server: Failed to retrieve allocations for consumer af38a891-59d9-4aa1-9568-0d57a31a8b37: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
Reason: You're speaking plain HTTP to an SSL-enabled server port.<br />
 Instead use the HTTPS scheme to access this URL, please.<br />
</p>
</body></html>
: nova.exception.ConsumerAllocationRetrievalFailed: Failed to retrieve allocations for consumer af38a891-59d9-4aa1-9568-0d57a31a8b37: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
@openshift-ci openshift-ci Bot requested review from olliewalsh and slagle May 27, 2025 09:28
@gibizer

gibizer commented May 27, 2025

Copy link
Copy Markdown
Contributor Author

/hold not intended to merge this

@openshift-ci

openshift-ci Bot commented May 27, 2025

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gibizer
Once this PR has been reviewed and has the lgtm label, please assign frenzyfriday for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@softwarefactory-project-zuul

Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/2916fae854964768ad9e5ebb696c4e47

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 24m 59s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 10m 34s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 41m 29s
✔️ adoption-standalone-to-crc-ceph-provider SUCCESS in 3h 10m 25s
openstack-operator-tempest-multinode RETRY_LIMIT in 3m 05s

@stuggi

stuggi commented Jun 25, 2025

Copy link
Copy Markdown
Contributor

closing this as the fixes merged

@stuggi stuggi closed this Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants