fix(merge): bound dangling block cache growth#11426
Conversation
|
@claude review |
|
Claude finished @LukaszRozmej's task in 5m 11s —— View job PR Review: fix(merge): bound dangling block cache growth
SummaryThe PR adds a bounded FindingsMedium — Pruning strategy may sever the beacon sync ancestry chain
A safer strategy would prune the highest-numbered block (furthest from the anchor), or track insertion order to evict the oldest-inserted entry. Medium —
|
| } | ||
|
|
||
| private bool TryRemoveOldestBlock() | ||
| { |
There was a problem hiding this comment.
Medium — Pruning strategy evicts the wrong end of the chain for beacon sync
TryRemoveOldestBlock finds the block with the lowest Number and evicts it. During beacon sync, blocks arrive tip-first (highest number first). TryInsertDanglingBlock walks parent pointers downward from the new block to find a known anchor. Evicting the lowest-numbered block removes the entry closest to the anchor — exactly the block that completes the chain walk.
Consider evicting the highest-numbered block instead (furthest from the sync anchor), or tracking insertion order via a parallel Queue<Hash256AsKey> and evicting the oldest-inserted entry regardless of block number.
| public void Clear() => BlockCache.Clear(); | ||
|
|
||
| private void PruneIfNeeded() | ||
| { |
There was a problem hiding this comment.
Medium — Non-atomic prune loop can bail early under concurrent access
If two threads both call TryAddBlock concurrently, both enter PruneIfNeeded. Both may identify the same oldestHash during their scan. The first TryRemove succeeds; the second returns false, causing this loop to exit early while the count is still over the limit.
In practice the Engine API is serialized, so this is low risk today. Consider documenting the threading expectation or guarding with a lock / Interlocked.CompareExchange on a prune-in-progress flag.
| @@ -12,4 +12,6 @@ public interface IBlockCacheService | |||
| public ConcurrentDictionary<Hash256AsKey, Block> BlockCache { get; } | |||
There was a problem hiding this comment.
Medium — BlockCache exposes mutable ConcurrentDictionary, allowing callers to bypass TryAddBlock
Any caller can still do BlockCache.TryAdd(hash, block) or BlockCache[key] = block, bypassing the bounding logic. The PR routes all current insert sites through TryAddBlock (good), but the interface offers no enforcement.
Options:
- Change to
IReadOnlyDictionary<Hash256AsKey, Block>for reads (all current external usages are reads or removes, not inserts). - Or add a
/// <remarks>Writes must go through <see cref="TryAddBlock"/>.</remarks>to document the contract.
| } | ||
|
|
||
| [Test] | ||
| public void block_cache_service_prunes_oldest_unprotected_blocks() |
There was a problem hiding this comment.
Low — Tests for BlockCacheService are misplaced in InvalidChainTrackerTest
These two new tests exercise BlockCacheService pruning behavior, not InvalidChainTracker. They should live in a dedicated BlockCacheServiceTest.cs to keep the test class focused and easy to discover.
Also missing: a test for the all-protected overflow scenario — when every cached block is either FinalizedHash or HeadBlockHash, pruning silently bails and the cache exceeds its limit. A test documenting that behavior would prevent surprises.
|
@LukaszRozmej Updated, please take another look, thanks for your review! |
|
@claude re-review |
|
Claude finished @LukaszRozmej's task in 3m 32s —— View job Re-review: fix(merge): bound dangling block cache growth
SummaryAll Medium findings from the original review have been resolved across 4 follow-up commits. The implementation is correct and well-tested. One minor informational note remains. Previous findings — status
Remaining note (informational)Low — If the newly added block has the highest number among all unprotected entries, the prune loop selects and removes it on the same call, leaving it absent from the cache while the method returns No current callers check the return value ( VerdictAll Medium issues are resolved. The fix is sound: the bounded cache correctly evicts the highest-numbered (tip-side) blocks to preserve the anchor-proximal blocks that complete the chain walk, all writes go through the bounded API, concurrency is properly serialized, and the edge cases are covered by tests. Mergeable. ✅ |
LukaszRozmej
left a comment
There was a problem hiding this comment.
@claude post suggestions how to de-dup tests in src/Nethermind/Nethermind.Merge.Plugin.Test/BlockCacheServiceTest.cs
Could we just use LruCache?
| public bool TryAddBlock(Block block) | ||
| { | ||
| Hash256 blockHash = block.Hash ?? block.CalculateHash(); | ||
| lock (_pruneLock) |
There was a problem hiding this comment.
lock + ConcurrentDictionary seems redundant
| } | ||
| } | ||
|
|
||
| return added; |
|
@claude post suggestions how to de-dup tests in src/Nethermind/Nethermind.Merge.Plugin.Test/BlockCacheServiceTest.cs Could we just use LruCache? |
|
Claude finished @LukaszRozmej's task in 2m 56s —— View job Suggestions: test de-dup +
|
| Requirement | Current implementation | LruCache |
|---|---|---|
Never evict protected keys (FinalizedHash / HeadBlockHash) |
✅ Explicit skip in TryGetHighestNumberedUnprotectedBlock |
❌ No pinning concept |
| Evict highest-numbered block (furthest from sync anchor) | ✅ Max-scan by block.Number |
❌ Evicts LRU regardless of block number |
Forcing LruCache to work here would require adding both a pin-key API and reversing the eviction ordering — at which point you've re-implemented what's already here. The current bespoke approach is the right fit.
2. lock + ConcurrentDictionary — redundancy (line 53 comment)
The reviewer is correct. _pruneLock already serializes all mutating paths (TryAddBlock, TryRemoveBlock, Clear), so ConcurrentDictionary's internal per-bucket locking is pure overhead on every write. However, there is one catch before switching to a plain Dictionary:
BlockCache is exposed as IReadOnlyDictionary<Hash256AsKey, Block> and external callers iterate it unsynchronized. The most notable example is UnsafeStartingSyncPivotUpdater.cs:73:
foreach (Block block in _blockCacheService.BlockCache.Values)A plain Dictionary would not be safe to iterate while another thread holds _pruneLock and mutates it. ConcurrentDictionary provides snapshot-consistent iteration for free.
Two viable paths:
Path A — remove ConcurrentDictionary, expose a snapshot method (cleaner long-term):
private readonly Dictionary<Hash256AsKey, Block> _blockCache = new();
// New interface method replacing raw BlockCache.Values iteration:
public IReadOnlyList<Block> GetCachedBlocks()
{
lock (_pruneLock) { return [.._blockCache.Values]; }
}All BlockCache.Values / BlockCache.TryGetValue callsites then go through the interface. BlockCache can be removed from IBlockCacheService or narrowed further. This is the right direction but is a bigger diff.
Path B — keep ConcurrentDictionary, drop _pruneLock, accept non-atomic add+prune (pragmatic):
Since Engine API calls are already externally serialized, the add+prune race is theoretical rather than practical. Removing _pruneLock and relying solely on ConcurrentDictionary would eliminate the redundancy with minimal risk. The while loop's TryRemove returning false on a concurrent miss is harmless — the next TryAddBlock will prune again.
3. Test de-duplication in BlockCacheServiceTest.cs
The three tests share a boilerplate pattern — create a block, immediately compute its hash:
Block block1 = Build.A.Block.WithNumber(1).TestObject;
Hash256 block1Hash = block1.GetOrCalculateHash();
// ...repeated 2 more timesSuggestion — extract a factory helper (cuts 3 lines per test):
private static (Block block, Hash256 hash) MakeBlock(long number)
{
Block b = Build.A.Block.WithNumber(number).TestObject;
return (b, b.GetOrCalculateHash());
}Tests 1 and 2 then become:
var (b1, h1) = MakeBlock(1);
var (b2, h2) = MakeBlock(2);
var (b3, h3) = MakeBlock(3);Tests 1 and 2 also share the same structural skeleton (capacity=2, add 3 blocks, assert count=2, assert 3 membership results). They could be collapsed into a single [TestCaseSource] test, but the setup differs enough (protected hashes) that the parameterization would likely reduce readability. The factory helper alone is the right trade-off.
Summary of recommended actions:
| Item | Recommendation |
|---|---|
LruCache |
Keep current implementation; LruCache doesn't support pinned keys or number-based eviction |
lock + ConcurrentDictionary |
Path A: switch to Dictionary + GetCachedBlocks() snapshot; or Path B: drop _pruneLock, keep ConcurrentDictionary — either removes the redundancy |
| Test de-dup | Extract MakeBlock(long) helper to eliminate the repeated block+hash creation lines |
|
Path B please |
|
sure, will do. |
|
@LukaszRozmej Done. |
|
@claude re-review |
|
Claude finished @LukaszRozmej's task in 2m 36s —— View job Re-review: fix(merge): bound dangling block cache growth
SummaryThe three post-approval commits correctly implement the final round of changes requested by LukaszRozmej. The implementation is clean and ready. Recent commits reviewed
LukaszRozmej's 12:37 inline comments — verified resolvedThe two inline comments ("if added here" / "can be removed here — should return All-callers check — no mutable
|
| Caller | Access |
|---|---|
StartingSyncPivotUpdater.cs:135 |
TryGetValue |
UnsafeStartingSyncPivotUpdater.cs:73 |
.Values iteration (safe on ConcurrentDictionary) |
ForkchoiceUpdatedHandler.cs:122 |
TryGetValue |
NewPayloadHandler.cs:474 |
TryGetValue |
InvalidChainTracker.cs:105 |
TryGetValue |
IBlockCacheService.BlockCache is typed as IReadOnlyDictionary<Hash256AsKey, Block>, so the type system prevents callers from bypassing TryAddBlock at compile time.
Informational note
TryAddBlock returns true in the all-protected overflow case — when every cached block is protected (FinalizedHash / HeadBlockHash), pruning bails and the method returns true even though the count exceeded _maxCachedBlocks. This is the correct safe behaviour (never evict the chain tip or finalized block), and the preserves_protected_blocks_when_all_candidates_are_protected test documents it explicitly.
Verdict: Mergeable. ✅ All Medium findings from previous rounds are resolved. Path B is implemented correctly, return value semantics are accurate, and all remaining BlockCache accesses are read-only.
| /// Initializes a block cache with the default merge sync bound. | ||
| /// </summary> | ||
| public BlockCacheService() | ||
| : this((int)(Reorganization.MaxDepth * 2 + 16)) |
There was a problem hiding this comment.
@LukaszRozmej isn't this going to make cache less effective when we support longer reorgs soon? Where is this number coming from?
Also how frequently can we have incoming chain from CL bigger than 144 blocks? I think this number needs more thought for all the chains we support - maybe for ethereum mainnet is fine.
Maybe change approach and evict by fork? Maybe make it configurable per chain?
@claude any ideas?
Changes
TryAddBlockAPI.Types of changes
What types of changes does your code introduce?
Testing
Requires testing
If yes, did you write tests?
Notes on testing
InvalidChainTrackerTest.git diff --check.TryAddBlock..NET build/testintentionally skipped by user request.Documentation
Requires documentation update
Requires explanation in Release Notes