fix(linstor): pre-flight check destination is a LINSTOR satellite before live migration#13077
fix(linstor): pre-flight check destination is a LINSTOR satellite before live migration#13077jmsperu wants to merge 1 commit intoapache:mainfrom
Conversation
…ore live migration
LinstorDataMotionStrategy.copyAsync would call createResource on the
destination pool's controller without first verifying that the
destination KVM host is registered as a LINSTOR satellite there. Two
failure modes:
1. The resource group's auto-placement filter happens to match a
different node (a registered satellite that is NOT the migration
destination), and the resource is silently created on the wrong
node. The subsequent migrate then fails because the destination
KVM host has no DRBD device for the resource.
2. The auto-placement filter has no candidates and the LINSTOR API
returns an opaque error. The operator has to correlate the
migration failure with an unrelated controller log entry to
understand what happened.
This change adds verifyDestinationIsLinstorSatellite() called at the
top of copyAsync. For each LINSTOR-typed destination pool it:
- fetches the controller's node list via LinstorUtil.getLinstorNodeNames
- throws CloudRuntimeException with a clear actionable message
(lists known satellites) if destHost.getName() is missing from
that list
- silently skips on transient controller errors so a network blip
against the controller doesn't block an otherwise valid migration
Non-LINSTOR destination pools in the volumeDataStoreMap are skipped
(mixed-storage migrations are unaffected).
There was a problem hiding this comment.
Pull request overview
Adds a pre-flight validation to LINSTOR live migration to ensure the destination KVM host is registered as a LINSTOR satellite on the relevant controller(s), so migrations fail fast with an actionable error instead of failing later via LINSTOR auto-placement.
Changes:
- Introduces
verifyDestinationIsLinstorSatellite(...)to query LINSTOR node lists per destination pool/controller. - Calls the pre-flight check at the start of
copyAsync(...)to abort early with a clearCloudRuntimeExceptionwhen the destination host is not a satellite.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } catch (ApiException apiEx) { | ||
| // Don't block migration on a transient controller hiccup — log and let the | ||
| // downstream resource creation handle the real failure. | ||
| logger.warn("LINSTOR pre-flight check could not contact controller {}: {}; " + | ||
| "letting downstream resource creation proceed", | ||
| destStoragePool.getHostAddress(), apiEx.getBestMessage()); | ||
| return; | ||
| } |
There was a problem hiding this comment.
The catch (ApiException) path currently logs and then returns, which stops checking any additional destination pools/controllers in volumeDataStoreMap. To match the method Javadoc/PR description (“best-effort” per pool), this should usually continue so that other pools can still be validated and can fail fast if the destination host is not a satellite there.
| // Pre-flight: verify the destination KVM host is registered as a satellite on the | ||
| // LINSTOR controller backing each destination pool. Without this check, resource | ||
| // creation falls through to the resource-group's auto-placement filters and may | ||
| // either silently place the resource on the wrong node or fail with an opaque | ||
| // auto-place error from the LINSTOR API. Failing fast here gives operators a clear | ||
| // actionable message instead of having to correlate the live-migration failure with | ||
| // an unrelated LINSTOR controller log entry. | ||
| verifyDestinationIsLinstorSatellite(volumeDataStoreMap, destHost); | ||
|
|
There was a problem hiding this comment.
verifyDestinationIsLinstorSatellite(...) is invoked before the method’s try/catch/finally. If it throws CloudRuntimeException (the intended failure-fast path), copyAsync will exit without reaching the finally block and callback.complete(...) will never be called, potentially leaving the migration task hanging. Move this pre-flight call inside the existing try so failures are handled consistently and the callback is always completed.
| if (nodes == null) { | ||
| logger.warn("LINSTOR controller {} returned null node list; skipping pre-flight", | ||
| destStoragePool.getHostAddress()); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Inside the loop, the method returns when nodes == null. That exits the entire pre-flight check and skips verifying any remaining destination pools/volumes. If the intent is best-effort per pool, this should continue to the next entry (or better: de-duplicate pools/controllers and keep checking the rest).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #13077 +/- ##
============================================
- Coverage 18.02% 18.02% -0.01%
+ Complexity 16621 16620 -1
============================================
Files 6029 6029
Lines 542184 542210 +26
Branches 66451 66455 +4
============================================
- Hits 97740 97737 -3
- Misses 433428 433457 +29
Partials 11016 11016
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
LinstorDataMotionStrategy.copyAsync(added in #12830 / Jan 29 2026) callsLinstorUtil.createResourceon the destination pool's controller without first verifying the destination KVM host is registered as a LINSTOR satellite there. Two failure modes follow:We've hit (1) on host-5 → host-6 migrations where host-6 was not registered as a LINSTOR satellite. The migration logs were unhelpful.
Change
Adds
verifyDestinationIsLinstorSatellite()called at the top ofcopyAsync. For each LINSTOR-typed pool involumeDataStoreMap:LinstorUtil.getLinstorNodeNamesdestHost.getName()is missing from the list, throwsCloudRuntimeExceptionwith a clear actionable message that lists the known satellites and tells the operator to eitherlinstor node createor pick a different destinationNon-LINSTOR destination pools are skipped (mixed-storage migrations are unaffected).
Test plan