Skip to content

Wipe agent token on inspection completion to fix cleaning failures#434

Open
elfosardo wants to merge 1 commit into
openshift:release-4.15from
elfosardo:fix-token-wipe-4.15
Open

Wipe agent token on inspection completion to fix cleaning failures#434
elfosardo wants to merge 1 commit into
openshift:release-4.15from
elfosardo:fix-token-wipe-4.15

Conversation

@elfosardo

Copy link
Copy Markdown

The backport of f734efb ("Set node alive when inspection finished") causes a regression in 4.15 where ironic-inspector is still used. The patch makes nodes appear fast-trackable after inspection when the agent is dead and boot media is detached, causing the AgentConnectionFailed handler to park nodes in clean wait indefinitely.

This happens because 4.15 uses ironic-inspector, which powers off the node after inspection. The f734efb fix was designed for 4.16 where built-in agent inspection keeps IPA running across the inspection-to-cleaning transition.

Revert 0feaa17 and instead fix the root cause: the agent_secret_token persists from inspection into cleaning. When IPA reboots for cleaning, it cannot get a new token because Ironic refuses to generate one when one already exists, causing repeated lookup loops without heartbeats until clean_callback_timeout fires.

The fix adds wipe_token_and_url() when inspection finishes successfully, both in the synchronous path (inspect_hardware) and the ironic-inspector callback path (continue_inspection). This follows the same pattern already used on inspection start and abort.

@elfosardo

Copy link
Copy Markdown
Author

/hold
testing first

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 25, 2026
@openshift-ci openshift-ci Bot requested review from derekhiggins and zaneb June 25, 2026 09:42
@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elfosardo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 25, 2026
The backport of f734efb ("Set node alive when inspection finished")
causes a regression in 4.15 where ironic-inspector is still used.
The patch makes nodes appear fast-trackable after inspection when the
agent is dead and boot media is detached, causing the
AgentConnectionFailed handler to park nodes in clean wait indefinitely.

This happens because 4.15 uses ironic-inspector, which powers off the
node after inspection. The f734efb fix was designed for 4.16 where
built-in agent inspection keeps IPA running across the
inspection-to-cleaning transition.

Revert 0feaa17 and instead fix the root cause: the agent_secret_token
persists from inspection into cleaning. When IPA reboots for cleaning,
it cannot get a new token because Ironic refuses to generate one when
one already exists, causing repeated lookup loops without heartbeats
until clean_callback_timeout fires.

The fix adds wipe_token_and_url() when inspection finishes successfully,
both in the synchronous path (inspect_hardware) and the
ironic-inspector callback path (continue_inspection). This follows the
same pattern already used on inspection start and abort.
@elfosardo elfosardo force-pushed the fix-token-wipe-4.15 branch from 829b319 to 0bcb667 Compare June 25, 2026 13:32
@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

@elfosardo: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant