Skip to content

server: fix volume detach operation when no vm host#7526

Merged
DaanHoogland merged 3 commits intoapache:4.18from
shapeblue:fix-detvol-vmnohost
May 25, 2023
Merged

server: fix volume detach operation when no vm host#7526
DaanHoogland merged 3 commits intoapache:4.18from
shapeblue:fix-detvol-vmnohost

Conversation

@shwstppr
Copy link
Copy Markdown
Contributor

Description

Fixes #7525

CloudStack may not keep the VM associated to any host when there is offline migration across cluster or pod. In such cases, CloudStack should be able to send the AttachCommand or DetachCommand to any available host in the current VM cluster.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

CloudStack may not keep the VM associated to any host when there is offline migration across cluster or pod.
In such cases, CloudStack should be able to send the AttachCommand or DetachCommand to any available host in the current VM cluster.

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
@codecov
Copy link
Copy Markdown

codecov bot commented May 12, 2023

Codecov Report

Merging #7526 (d4f7f2a) into main (a0eb0aa) will increase coverage by 0.02%.
The diff coverage is 50.00%.

@@             Coverage Diff              @@
##               main    #7526      +/-   ##
============================================
+ Coverage     12.93%   12.96%   +0.02%     
- Complexity     8941     8992      +51     
============================================
  Files          2716     2728      +12     
  Lines        256108   256640     +532     
  Branches      39939    40021      +82     
============================================
+ Hits          33127    33266     +139     
- Misses       218822   219193     +371     
- Partials       4159     4181      +22     
Impacted Files Coverage Δ
...n/java/com/cloud/storage/VolumeApiServiceImpl.java 15.03% <50.00%> (+0.34%) ⬆️

... and 40 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
@shwstppr shwstppr marked this pull request as ready for review May 12, 2023 11:44
@shwstppr
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@shwstppr a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6070

@yadvr yadvr changed the base branch from main to 4.18 May 12, 2023 15:19
@yadvr yadvr added this to the 4.18.1.0 milestone May 12, 2023
@yadvr
Copy link
Copy Markdown
Member

yadvr commented May 12, 2023

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rohityadavcloud a [LL] Jenkins job has been kicked to build packages. It will be bundled with

SystemVM template(s). I'll keep you posted as I make progress.

@yadvr
Copy link
Copy Markdown
Member

yadvr commented May 12, 2023

cc @andrijapanicsb @rajujith @alexandremattioli @NuxRo have we enabled London lab?

@blueorangutan
Copy link
Copy Markdown

Packaging result [LL]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6032

@shwstppr
Copy link
Copy Markdown
Contributor Author

@blueorangutan test matrix

@blueorangutan
Copy link
Copy Markdown

@shwstppr a [LL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian Build Failed (tid-6513)

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-6515)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42360 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7526-t6515-kvm-centos7.zip
Smoke tests completed. 107 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_redundant_vpc_site2site_vpn Failure 516.56 test_vpc_vpn.py
test_01_vpc_site2site_vpn_multiple_options Error 1155.31 test_vpc_vpn.py
test_01_vpc_site2site_vpn Error 454.33 test_vpc_vpn.py

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-6514)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 112540 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7526-t6514-vmware-67u3.zip
Smoke tests completed. 104 look OK, 4 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_deploy_vm_on_specific_host Error 18.52 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 3602.33 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 2.21 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 1.21 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 4.22 test_vm_deployment_planner.py
test_09_expunge_vm Failure 424.51 test_vm_life_cycle.py
test_01_redundant_vpc_site2site_vpn Failure 3609.12 test_vpc_vpn.py
test_01_redundant_vpc_site2site_vpn Error 3609.20 test_vpc_vpn.py
test_01_vpc_site2site_vpn_multiple_options Failure 3609.74 test_vpc_vpn.py
test_01_vpc_site2site_vpn_multiple_options Error 3609.84 test_vpc_vpn.py
test_01_vpc_remote_access_vpn Failure 3608.51 test_vpc_vpn.py
test_01_vpc_site2site_vpn Failure 1807.13 test_vpc_vpn.py
test_01_cancel_host_maintenace_with_no_migration_jobs Error 1806.97 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 1973.97 test_host_maintenance.py
test_03_cancel_host_maintenace_with_migration_jobs_failure Error 1912.55 test_host_maintenance.py

@weizhouapache
Copy link
Copy Markdown
Member

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@weizhouapache a [LL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [LL]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6043

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6097

@shwstppr
Copy link
Copy Markdown
Contributor Author

@blueorangutan test matrix

@blueorangutan
Copy link
Copy Markdown

@shwstppr a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-6546)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 39457 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7526-t6546-xenserver-71.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-6548)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 44820 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7526-t6548-kvm-centos7.zip
Smoke tests completed. 107 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_vpc_site2site_vpn_multiple_options Failure 408.14 test_vpc_vpn.py

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-6547)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 45632 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7526-t6547-vmware-67u3.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@DaanHoogland
Copy link
Copy Markdown
Contributor

tested according to #7525 (comment) and verified no extra disks attached to the VM.

Copy link
Copy Markdown
Member

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwstppr @DaanHoogland

The following scenario is failing

  1. Deploy an environment with 2 clusters (Cluster A , Cluster B) in a single pod

  2. Deploy a vm with a data disk in cluster A

  3. Stop the vm from CS, the host id will be set to null

  4. Migrate the vm to cluster B via Vcenter

Screenshot 2023-05-25 at 7 55 48 PM
  1. Try to detach the data volume to Vm >> failure

It works fine if the migration of the vm is done by CS

  1. Deploy a environment with 2 clusters (Cluster A , Cluster B) in a single pod

  2. Deploy a vm with a data disk in cluster A

  3. Stop the vm in CS

  4. Migrate the vm to cluster B via CS

  5. Try to detach data volume to Vm >> Success

@DaanHoogland
Copy link
Copy Markdown
Contributor

@shwstppr is this something we can/must support?

1. Deploy an environment with 2 clusters (Cluster A , Cluster B) in a single pod

2. Deploy a vm with a data disk in cluster A

3. Stop the vm from CS, the host id will be set to null

4. Migrate the vm to cluster B via Vcenter

5. Try to detach the data volume to Vm >> failure

@shwstppr
Copy link
Copy Markdown
Contributor Author

@DaanHoogland to support some extensive changes would be required in the VMware plugin. And even then I don't think we can cover all cases of such out-of-band operations. So, in my opinion, this can be an improvement request but I don't think there is a major use-case for this.

@DaanHoogland
Copy link
Copy Markdown
Contributor

@kiranchavala , if you feel this scenario should be supported as well, please open a new issue. I think it is out of scope for this fix and a stretch to ask in general; If people do things to cloud resources out of band we cannot guarantee recovery in all circumstances.

@kiranchavala
Copy link
Copy Markdown
Member

Thanks @DaanHoogland I agree with @shwstppr view that we cannot cover all cases of out-of-band operations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Offline VM migration in VMware with disk detach causes VM snapshots issues

6 participants