Handle locks better in the pantry.#1949
Closed
leftwo wants to merge 2 commits into
Closed
Conversation
Improve the lock handling around pantry actions that require outside activity to make progress. We were holding a lock and then activating or deactivating a volume, and this can take unknown amount of time. Now we take a lock and leave a state to indicate someone is working on the activation but give back the lock so other actions can continue. A new PantryEntryState enum tracks whether a volume is Attaching (construct/activate in progress) or Attached (ready for use). This prevents duplicate concurrent attach attempts while allowing other volumes or other actions to proceed independently. entry_get() now returns 503 for Attaching volumes so callers retry. This might require some updates on the Nexus/Omicron side. detach() rejects Attaching volumes until attach completes, but this was the same behavior as before, but just because we held the lock for the entire attach. More unit tests added, as well as a specific pantry test, but I'm not sure if this tools/test_pantry_hang.sh is worth keeping or not.
Contributor
Author
|
Fix for #1945 |
Contributor
|
I disagree with having GET for entry return a 503 during activation, it seems wrong to me. Previously the endpoint was returning a 200 and the caller would look at the return struct to see what the activation status is - a 503 could mean anything. I agree with the shape of the changes here, in that we hold the entries lock for as short a time as we can, and would like to chat next week: I think we can get to the same spot without the new enum. |
Contributor
Author
|
This PR obsoleted by #1952 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improve the lock handling around pantry actions that require outside activity to make progress. We were holding a lock and then activating or deactivating a volume, and this can take unknown amount of time.
Now we take a lock and leave a state to indicate someone is working on the activation but give back the lock so other actions can continue. A new PantryEntryState enum tracks whether a volume is Attaching (construct/activate in progress) or Attached (ready for use). This prevents duplicate concurrent attach attempts while allowing other volumes or other actions to proceed independently.
entry_get() now returns 503 for Attaching volumes so callers retry.
This might require some updates on the Nexus/Omicron side.
detach() rejects Attaching volumes until attach completes, but this was the same behavior as before, but just because we held the lock for the entire attach.
More unit tests added, as well as a specific pantry test, but I'm not sure if this tools/test_pantry_hang.sh is worth keeping or not. I think growing the pantest into a more functional/stress test of the pantry might be a better way to go. I also might consider putting something into the CI tests for the pantry, maybe this crude script is good enough?