Skip to content

snapshot: don't schedule next snapshot job for a removed volume#8735

Merged
JoaoJandre merged 2 commits into
apache:4.18from
shapeblue:snapshot-http-503
Mar 19, 2024
Merged

snapshot: don't schedule next snapshot job for a removed volume#8735
JoaoJandre merged 2 commits into
apache:4.18from
shapeblue:snapshot-http-503

Conversation

@yadvr

@yadvr yadvr commented Mar 1, 2024

Copy link
Copy Markdown
Member

When management server starts, it starts the snapshot scheduler. In case there is a volume snapshot policy which exists for a volume which does not exist, it can cause SQL constraint issue and cause the management server to break from starting its various components and cause HTTP 503 error.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

When management server starts, it starts the snapshot scheduler. In case
there is a volume snapshot policy which exists for a volume which does
not exist, it can cause SQL constraint issue and cause the management
server to break from starting its various components and cause HTTP 503
error.

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@yadvr

yadvr commented Mar 1, 2024

Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@DaanHoogland DaanHoogland left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@weizhouapache

Copy link
Copy Markdown
Member

what if the volume is removed manually after snapshot policy is created ?

it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

@codecov

codecov Bot commented Mar 1, 2024

Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 13.16%. Comparing base (9bd359a) to head (5ceddcd).
Report is 6 commits behind head on 4.18.

Files Patch % Lines
.../cloud/storage/snapshot/SnapshotSchedulerImpl.java 0.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.18    #8735      +/-   ##
============================================
- Coverage     13.16%   13.16%   -0.01%     
- Complexity     9203     9205       +2     
============================================
  Files          2724     2724              
  Lines        258137   258153      +16     
  Branches      40235    40236       +1     
============================================
- Hits          33989    33987       -2     
- Misses       219841   219860      +19     
+ Partials       4307     4306       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blueorangutan

Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8831

@yadvr yadvr marked this pull request as draft March 1, 2024 17:10
@DaanHoogland

Copy link
Copy Markdown
Contributor

what if the volume is removed manually after snapshot policy is created ?

it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

I agree, but this code is still good and anybody could use it in older versions (backport it) I don't think it hurts in any way even with the constraint.

If that works a cascading delete would be usefull in this case ;)

@weizhouapache

Copy link
Copy Markdown
Member

what if the volume is removed manually after snapshot policy is created ?
it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

I agree, but this code is still good and anybody could use it in older versions (backport it) I don't think it hurts in any way even with the constraint.

If that works a cascading delete would be usefull in this case ;)

@DaanHoogland
yes, with PR, the mgmt server will be started without any issue , even if volume is removed manually from database.

@rohityadavcloud
code lgtm

@weizhouapache weizhouapache left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@weizhouapache

Copy link
Copy Markdown
Member

@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@rajujith

rajujith commented Mar 4, 2024

Copy link
Copy Markdown
Contributor

what if the volume is removed manually after snapshot policy is created ?

it may be better to add constraint to the table snapshot_policy, to prevent the volume to be removed manually, if there are snapshot policies linked to it.

Makes sense to add it.

@blueorangutan

Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8841

@yadvr yadvr marked this pull request as ready for review March 13, 2024 17:18
@yadvr yadvr requested a review from JoaoJandre March 13, 2024 17:19
@yadvr

yadvr commented Mar 13, 2024

Copy link
Copy Markdown
Member Author

@weizhouapache I like your idea, could you propose that as a separate PR.

For now I've added changes to remove the schedule when volume is missing. Requesting re-review cc @weizhouapache @DaanHoogland @sureshanaparti @JoaoJandre

@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@weizhouapache weizhouapache left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@blueorangutan

Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8928

@weizhouapache

Copy link
Copy Markdown
Member

@blueorangutan test

@blueorangutan

Copy link
Copy Markdown

@weizhouapache a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan

Copy link
Copy Markdown

[SF] Trillian test result (tid-9467)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 39891 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8735-t9467-kvm-centos7.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_08_arping_in_ssvm Failure 5.18 test_diagnostics.py

@JoaoJandre JoaoJandre left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM

@DaanHoogland DaanHoogland left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@yadvr

yadvr commented Mar 18, 2024

Copy link
Copy Markdown
Member Author

@JoaoJandre since you're the RM who has called for a freeze, could you merge this?

@JoaoJandre

Copy link
Copy Markdown
Contributor

I tried to test this manually but was not able to reproduce the issue, @rohityadavcloud could you share the steps to reproduce this?

@yadvr

yadvr commented Mar 19, 2024

Copy link
Copy Markdown
Member Author

This is an edge case not easy to reproduce but found in a prod customer env @JoaoJandre. To reproduce you can delete a volume from the db manually which has a snapshot policy.

@JoaoJandre

Copy link
Copy Markdown
Contributor

This is an edge case not easy to reproduce but found in a prod customer env @JoaoJandre. To reproduce you can delete a volume from the db manually which has a snapshot policy.

@rohityadavcloud I was able to test it by deleting the volume on the DB while the MGMT was down. This really is an edge case, but the PR does not introduce any regressions and is adding a layer of safety to the snapshot policy process, so I don't see any issues in merging.

@JoaoJandre JoaoJandre merged commit 720407b into apache:4.18 Mar 19, 2024
@yadvr yadvr deleted the snapshot-http-503 branch March 20, 2024 10:18
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Mar 21, 2024
…he#8735)

* snapshot: don't schedule next snapshot job for a removed volume

When management server starts, it starts the snapshot scheduler. In case
there is a volume snapshot policy which exists for a volume which does
not exist, it can cause SQL constraint issue and cause the management
server to break from starting its various components and cause HTTP 503
error.

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

* remove schedule on missing volume

---------

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants