Fix cancellation safety of LruCachingStore::write_batch (using cache invalidation)#6051
Closed
ma2bd wants to merge 4 commits into
Closed
Fix cancellation safety of LruCachingStore::write_batch (using cache invalidation)#6051ma2bd wants to merge 4 commits into
ma2bd wants to merge 4 commits into
Conversation
The previous implementation wrote to the DB first, then updated the LRU cache. If the future was cancelled between the two (e.g. by the RollbackGuard introduced in linera-io#5790), the DB would have the new data but the cache would retain stale entries. Subsequent reads would hit the stale cache, and the next save would overwrite the DB with old data, causing silent data loss. Fix: invalidate cache entries BEFORE writing to the DB, then repopulate after success. If cancelled at any point after invalidation, subsequent reads go directly to the DB and see the correct state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
afck
approved these changes
Apr 18, 2026
Contributor
afck
left a comment
There was a problem hiding this comment.
Good catch!
(I approve, but would be good if @MathieuDutSik could also take a look.)
deuszx
reviewed
Apr 18, 2026
Contributor
Author
This seems nontrivial. We could have unit tests for the new methods though. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
After fixing some issues, the PR is way more complicated, and perhaps not as interesting any more compared to the obvious quick fix. |
Contributor
Author
|
Closing in favor of #6056 |
ma2bd
added a commit
that referenced
this pull request
Apr 19, 2026
…ing task::spawn) (#6056) ## Motivation The `LruCachingStore::write_batch` method was not cancellation-safe. It wrote to the DB first, then updated the LRU cache. If the caller's future was cancelled between the two steps (e.g. by a gRPC timeout or runtime shutdown), the DB would have the new data but the cache would retain stale entries. The subsequent `RollbackGuard` would then reset the in-memory view state, and the next `save()` would overwrite the DB with old data -- causing silent data loss. This was identified as the likely root cause of the missing outbox bucket data on validator 4 (`The front bucket is always loaded` panic). ## Proposal Wrap the DB write and cache update in a `tokio::task::spawn` so they run to completion even if the caller's future is cancelled. On web targets, the task runs inline since cancellation safety is not a concern there (no `RollbackGuard` / concurrent chain workers). Adds `Clone + Send + 'static` bounds on the inner store type parameter, which are already satisfied by all stores in the chain (ScyllaDB, RocksDB, journaling, value-splitting). ## Test Plan CI ## Links - Alternative to #6051 (invalidate-first approach) - Related: #5790 (Remove chain actors -- introduced `RollbackGuard`) - Related: #6015 (Fix `BucketQueueView::delete_front` load failure handling) - Related: #6046 (Fix storage cache reads dropping Arc before use) - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ma2bd
added a commit
that referenced
this pull request
Apr 19, 2026
## Motivation Fixes #6052. Previously only journal resolution failures poisoned the chain worker. Other write errors (e.g. possible ScyllaDB timeouts) left the worker loaded with potentially stale state. Stacked on #6051. ## Proposal - **Poison on any save failure**: `ChainWorkerState::save()` now always sets `poisoned = true` and returns `WorkerError::PoisonedWorker` on any error, triggering immediate eviction and a fresh reload on the next request. - **Keep `must_reload_view`** for metrics and as way to document that journaling really requires a notion of view poisoning. ## Test Plan CI ## Links - Fixes #6052 - Stacked on #6051 - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ma2bd
added a commit
that referenced
this pull request
Apr 19, 2026
## Motivation Fixes #6052. Previously only journal resolution failures poisoned the chain worker. Other write errors (e.g. possible ScyllaDB timeouts) left the worker loaded with potentially stale state. Stacked on #6051. ## Proposal - **Poison on any save failure**: `ChainWorkerState::save()` now always sets `poisoned = true` and returns `WorkerError::PoisonedWorker` on any error, triggering immediate eviction and a fresh reload on the next request. - **Keep `must_reload_view`** for metrics and as way to document that journaling really requires a notion of view poisoning. ## Test Plan CI ## Links - Fixes #6052 - Stacked on #6051 - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ma2bd
added a commit
that referenced
this pull request
Apr 19, 2026
Fixes #6052. Previously only journal resolution failures poisoned the chain worker. Other write errors (e.g. possible ScyllaDB timeouts) left the worker loaded with potentially stale state. Stacked on #6051. - **Poison on any save failure**: `ChainWorkerState::save()` now always sets `poisoned = true` and returns `WorkerError::PoisonedWorker` on any error, triggering immediate eviction and a fresh reload on the next request. - **Keep `must_reload_view`** for metrics and as way to document that journaling really requires a notion of view poisoning. CI - Fixes #6052 - Stacked on #6051 - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
afck
pushed a commit
to afck/linera-protocol
that referenced
this pull request
Apr 20, 2026
…ing task::spawn) (linera-io#6056) ## Motivation The `LruCachingStore::write_batch` method was not cancellation-safe. It wrote to the DB first, then updated the LRU cache. If the caller's future was cancelled between the two steps (e.g. by a gRPC timeout or runtime shutdown), the DB would have the new data but the cache would retain stale entries. The subsequent `RollbackGuard` would then reset the in-memory view state, and the next `save()` would overwrite the DB with old data -- causing silent data loss. This was identified as the likely root cause of the missing outbox bucket data on validator 4 (`The front bucket is always loaded` panic). ## Proposal Wrap the DB write and cache update in a `tokio::task::spawn` so they run to completion even if the caller's future is cancelled. On web targets, the task runs inline since cancellation safety is not a concern there (no `RollbackGuard` / concurrent chain workers). Adds `Clone + Send + 'static` bounds on the inner store type parameter, which are already satisfied by all stores in the chain (ScyllaDB, RocksDB, journaling, value-splitting). ## Test Plan CI ## Links - Alternative to linera-io#6051 (invalidate-first approach) - Related: linera-io#5790 (Remove chain actors -- introduced `RollbackGuard`) - Related: linera-io#6015 (Fix `BucketQueueView::delete_front` load failure handling) - Related: linera-io#6046 (Fix storage cache reads dropping Arc before use) - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
afck
pushed a commit
to afck/linera-protocol
that referenced
this pull request
Apr 20, 2026
## Motivation Fixes linera-io#6052. Previously only journal resolution failures poisoned the chain worker. Other write errors (e.g. possible ScyllaDB timeouts) left the worker loaded with potentially stale state. Stacked on linera-io#6051. ## Proposal - **Poison on any save failure**: `ChainWorkerState::save()` now always sets `poisoned = true` and returns `WorkerError::PoisonedWorker` on any error, triggering immediate eviction and a fresh reload on the next request. - **Keep `must_reload_view`** for metrics and as way to document that journaling really requires a notion of view poisoning. ## Test Plan CI ## Links - Fixes linera-io#6052 - Stacked on linera-io#6051 - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The
LruCachingStore::write_batchmethod was not cancellation-safe. It wrote to the DB first, then updated the LRU cache. If the future was cancelled between the two steps (e.g. by theRollbackGuardintroduced in #5790), the DB would have the new data but the cache would retain stale entries. Subsequent reads would hit the stale cache, and the nextsave()would overwrite the DB with old data -- causing silent data loss.This was identified as the likely root cause of the missing outbox bucket data on validator 4 (
The front bucket is always loadedpanic).Proposal
Invalidate cache entries BEFORE writing to the DB, then repopulate the cache after success. If cancelled at any point after invalidation, subsequent reads bypass the cache and go directly to the DB, seeing the correct state.
New
LruPrefixCachemethods:invalidate_key/invalidate_prefix-- remove entries without recording negative results (DoesNotExist). Used for pre-write invalidation.record_key_deletion_unchecked/record_prefix_deletion_unchecked-- record deletion outcomes. Precondition: the correspondinginvalidate_*was already called.delete_key/delete_prefix-- convenience methods that call both.Note:
write_batch. #4954 to fix LRU cache discrepancies in case of errors, but we didn't think much about this corner case regarding cancelation safety.poisonthe chain state on everywrite_batcherror (Review impacts of potential error-throwing successful DB writes #6052).Test Plan
Links
RollbackGuardand cancellation)BucketQueueView::delete_frontload failure handling #6015 (FixBucketQueueView::delete_frontload failure handling)