Skip to content

Fix cancellation safety of LruCachingStore::write_batch (using cache invalidation)#6051

Closed
ma2bd wants to merge 4 commits into
linera-io:testnet_conwayfrom
ma2bd:fix_lru_caching
Closed

Fix cancellation safety of LruCachingStore::write_batch (using cache invalidation)#6051
ma2bd wants to merge 4 commits into
linera-io:testnet_conwayfrom
ma2bd:fix_lru_caching

Conversation

@ma2bd
Copy link
Copy Markdown
Contributor

@ma2bd ma2bd commented Apr 17, 2026

Motivation

The LruCachingStore::write_batch method was not cancellation-safe. It wrote to the DB first, then updated the LRU cache. If the future was cancelled between the two steps (e.g. by the RollbackGuard introduced in #5790), the DB would have the new data but the cache would retain stale entries. Subsequent reads would hit the stale cache, and the next save() would overwrite the DB with old data -- causing silent data loss.

This was identified as the likely root cause of the missing outbox bucket data on validator 4 (The front bucket is always loaded panic).

Proposal

Invalidate cache entries BEFORE writing to the DB, then repopulate the cache after success. If cancelled at any point after invalidation, subsequent reads bypass the cache and go directly to the DB, seeing the correct state.

New LruPrefixCache methods:

  • invalidate_key / invalidate_prefix -- remove entries without recording negative results (DoesNotExist). Used for pre-write invalidation.
  • record_key_deletion_unchecked / record_prefix_deletion_unchecked -- record deletion outcomes. Precondition: the corresponding invalidate_* was already called.
  • delete_key / delete_prefix -- convenience methods that call both.

Note:

Test Plan

  • CI

Links

ma2bd and others added 2 commits April 17, 2026 18:28
The previous implementation wrote to the DB first, then updated the LRU
cache. If the future was cancelled between the two (e.g. by the
RollbackGuard introduced in linera-io#5790), the DB would have the new data but
the cache would retain stale entries. Subsequent reads would hit the
stale cache, and the next save would overwrite the DB with old data,
causing silent data loss.

Fix: invalidate cache entries BEFORE writing to the DB, then repopulate
after success. If cancelled at any point after invalidation, subsequent
reads go directly to the DB and see the correct state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ma2bd ma2bd marked this pull request as draft April 18, 2026 00:20
@ma2bd ma2bd marked this pull request as ready for review April 18, 2026 00:22
Copy link
Copy Markdown
Contributor

@afck afck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

(I approve, but would be good if @MathieuDutSik could also take a look.)

Copy link
Copy Markdown
Contributor

@deuszx deuszx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for this?

  1. Populate cache and db
  2. try inserting but cancel before completing
  3. confirm that cache is empty and db has the new value

@ma2bd
Copy link
Copy Markdown
Contributor Author

ma2bd commented Apr 18, 2026

Can we add a test for this?

  1. Populate cache and db
  2. try inserting but cancel before completing
  3. confirm that cache is empty and db has the new value

This seems nontrivial. We could have unit tests for the new methods though.

@ma2bd ma2bd force-pushed the fix_lru_caching branch from c43bde5 to 7535e39 Compare April 18, 2026 22:53
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ma2bd ma2bd force-pushed the fix_lru_caching branch from 7535e39 to 685f1a4 Compare April 18, 2026 22:58
@ma2bd ma2bd marked this pull request as draft April 18, 2026 22:59
@ma2bd
Copy link
Copy Markdown
Contributor Author

ma2bd commented Apr 18, 2026

After fixing some issues, the PR is way more complicated, and perhaps not as interesting any more compared to the obvious quick fix.

@ma2bd ma2bd changed the title Fix cancellation safety of LruCachingStore::write_batch Fix cancellation safety of LruCachingStore::write_batch (using cache invalidation) Apr 18, 2026
@ma2bd
Copy link
Copy Markdown
Contributor Author

ma2bd commented Apr 18, 2026

Closing in favor of #6056

@ma2bd ma2bd closed this Apr 18, 2026
ma2bd added a commit that referenced this pull request Apr 19, 2026
…ing task::spawn) (#6056)

## Motivation

The `LruCachingStore::write_batch` method was not cancellation-safe. It
wrote to the DB first, then updated the LRU cache. If the caller's
future was cancelled between the two steps (e.g. by a gRPC timeout or
runtime shutdown), the DB would have the new data but the cache would
retain stale entries. The subsequent `RollbackGuard` would then reset
the in-memory view state, and the next `save()` would overwrite the DB
with old data -- causing silent data loss.

This was identified as the likely root cause of the missing outbox
bucket data on validator 4 (`The front bucket is always loaded` panic).

## Proposal

Wrap the DB write and cache update in a `tokio::task::spawn` so they run
to completion even if the caller's future is cancelled. On web targets,
the task runs inline since cancellation safety is not a concern there
(no `RollbackGuard` / concurrent chain workers).

Adds `Clone + Send + 'static` bounds on the inner store type parameter,
which are already satisfied by all stores in the chain (ScyllaDB,
RocksDB, journaling, value-splitting).

## Test Plan

CI

## Links

- Alternative to #6051 (invalidate-first approach)
- Related: #5790 (Remove chain actors -- introduced `RollbackGuard`)
- Related: #6015 (Fix `BucketQueueView::delete_front` load failure
handling)
- Related: #6046 (Fix storage cache reads dropping Arc before use)
- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ma2bd added a commit that referenced this pull request Apr 19, 2026
## Motivation

Fixes #6052.

Previously only journal resolution failures poisoned the chain worker.
Other write errors (e.g. possible ScyllaDB timeouts) left the worker
loaded with potentially stale state.

Stacked on #6051.

## Proposal

- **Poison on any save failure**: `ChainWorkerState::save()` now always
sets `poisoned = true` and returns `WorkerError::PoisonedWorker` on any
error, triggering immediate eviction and a fresh reload on the next
request.

- **Keep `must_reload_view`** for metrics and as way to document that
journaling really requires a notion of view poisoning.

## Test Plan

CI

## Links

- Fixes #6052
- Stacked on #6051
- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ma2bd added a commit that referenced this pull request Apr 19, 2026
## Motivation

Fixes #6052.

Previously only journal resolution failures poisoned the chain worker.
Other write errors (e.g. possible ScyllaDB timeouts) left the worker
loaded with potentially stale state.

Stacked on #6051.

## Proposal

- **Poison on any save failure**: `ChainWorkerState::save()` now always
sets `poisoned = true` and returns `WorkerError::PoisonedWorker` on any
error, triggering immediate eviction and a fresh reload on the next
request.

- **Keep `must_reload_view`** for metrics and as way to document that
journaling really requires a notion of view poisoning.

## Test Plan

CI

## Links

- Fixes #6052
- Stacked on #6051
- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ma2bd added a commit that referenced this pull request Apr 19, 2026
Fixes #6052.

Previously only journal resolution failures poisoned the chain worker.
Other write errors (e.g. possible ScyllaDB timeouts) left the worker
loaded with potentially stale state.

Stacked on #6051.

- **Poison on any save failure**: `ChainWorkerState::save()` now always
sets `poisoned = true` and returns `WorkerError::PoisonedWorker` on any
error, triggering immediate eviction and a fresh reload on the next
request.

- **Keep `must_reload_view`** for metrics and as way to document that
journaling really requires a notion of view poisoning.

CI

- Fixes #6052
- Stacked on #6051
- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
afck pushed a commit to afck/linera-protocol that referenced this pull request Apr 20, 2026
…ing task::spawn) (linera-io#6056)

## Motivation

The `LruCachingStore::write_batch` method was not cancellation-safe. It
wrote to the DB first, then updated the LRU cache. If the caller's
future was cancelled between the two steps (e.g. by a gRPC timeout or
runtime shutdown), the DB would have the new data but the cache would
retain stale entries. The subsequent `RollbackGuard` would then reset
the in-memory view state, and the next `save()` would overwrite the DB
with old data -- causing silent data loss.

This was identified as the likely root cause of the missing outbox
bucket data on validator 4 (`The front bucket is always loaded` panic).

## Proposal

Wrap the DB write and cache update in a `tokio::task::spawn` so they run
to completion even if the caller's future is cancelled. On web targets,
the task runs inline since cancellation safety is not a concern there
(no `RollbackGuard` / concurrent chain workers).

Adds `Clone + Send + 'static` bounds on the inner store type parameter,
which are already satisfied by all stores in the chain (ScyllaDB,
RocksDB, journaling, value-splitting).

## Test Plan

CI

## Links

- Alternative to linera-io#6051 (invalidate-first approach)
- Related: linera-io#5790 (Remove chain actors -- introduced `RollbackGuard`)
- Related: linera-io#6015 (Fix `BucketQueueView::delete_front` load failure
handling)
- Related: linera-io#6046 (Fix storage cache reads dropping Arc before use)
- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
afck pushed a commit to afck/linera-protocol that referenced this pull request Apr 20, 2026
## Motivation

Fixes linera-io#6052.

Previously only journal resolution failures poisoned the chain worker.
Other write errors (e.g. possible ScyllaDB timeouts) left the worker
loaded with potentially stale state.

Stacked on linera-io#6051.

## Proposal

- **Poison on any save failure**: `ChainWorkerState::save()` now always
sets `poisoned = true` and returns `WorkerError::PoisonedWorker` on any
error, triggering immediate eviction and a fresh reload on the next
request.

- **Keep `must_reload_view`** for metrics and as way to document that
journaling really requires a notion of view poisoning.

## Test Plan

CI

## Links

- Fixes linera-io#6052
- Stacked on linera-io#6051
- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants