Skip to content

fix(merge): bound dangling block cache growth#11426

Open
JayeTurn wants to merge 9 commits into
NethermindEth:masterfrom
JayeTurn:fix/merge-block-cache-bound
Open

fix(merge): bound dangling block cache growth#11426
JayeTurn wants to merge 9 commits into
NethermindEth:masterfrom
JayeTurn:fix/merge-block-cache-bound

Conversation

@JayeTurn
Copy link
Copy Markdown
Contributor

Changes

  • Route Merge dangling block cache inserts through a bounded TryAddBlock API.
  • Prune the oldest unprotected cached blocks when the cache exceeds its configured limit.
  • Preserve cached head/finalized blocks during pruning and remove dangling blocks after inserting them into the block tree.
  • Add regression coverage for cache pruning behavior.

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)
  • New feature (a non-breaking change that adds functionality)
  • Breaking change (a change that causes existing functionality not to work as expected)
  • Optimization
  • Refactoring
  • Documentation update
  • Build-related changes
  • Other: Description

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

Notes on testing

  • Added pruning coverage in InvalidChainTrackerTest.
  • Ran git diff --check.
  • Verified production cache insertion sites route through TryAddBlock.
  • .NET build/test intentionally skipped by user request.

Documentation

Requires documentation update

  • Yes
  • No

Requires explanation in Release Notes

  • Yes
  • No

@LukaszRozmej
Copy link
Copy Markdown
Member

@claude review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 30, 2026

Claude finished @LukaszRozmej's task in 5m 11s —— View job


PR Review: fix(merge): bound dangling block cache growth

  • Read relevant rule files
  • Read changed files
  • Analyze correctness and edge cases
  • Check performance implications
  • Post findings

Summary

The PR adds a bounded TryAddBlock API to BlockCacheService with LRU-by-block-number pruning to prevent unbounded memory growth from dangling blocks during beacon sync. The direction is correct and the existing insert sites are properly routed through the new API. However, there are several issues that need addressing before merge.


Findings

Medium — Pruning strategy may sever the beacon sync ancestry chain

TryRemoveOldestBlock evicts the block with the lowest block number. During beacon sync, blocks arrive tip-first (highest number first) and TryInsertDanglingBlock walks downward through parent pointers in the cache looking for a known anchor. Evicting the lowest-numbered block removes the entry closest to the sync anchor — exactly the block that completes the chain walk. If that block is pruned, the walk from the tip fails at the gap, all subsequent blocks re-enter the cache as unanchored, and they will not be connected until the pruned block is re-requested.

A safer strategy would prune the highest-numbered block (furthest from the anchor), or track insertion order to evict the oldest-inserted entry.

Fix this →

Medium — BlockCache property is still publicly writable, bypassing bounds

IBlockCacheService.BlockCache exposes the raw ConcurrentDictionary<Hash256AsKey, Block>, allowing any caller to call BlockCache.TryAdd(...) or BlockCache[key] = value without going through TryAddBlock. The invariant is easy to violate accidentally in future code. Consider either:

  • Narrowing the interface to IReadOnlyDictionary<Hash256AsKey, Block> for reads, or
  • Keeping the existing dict but adding a code comment on the interface that writes must go through TryAddBlock.

Current callers are all fixed, so this is a design concern rather than an active bug.

Medium — PruneIfNeeded bails silently when all cached blocks are protected

If every block in the cache is protected (FinalizedHash or HeadBlockHash), TryRemoveOldestBlock returns false and PruneIfNeeded exits the while loop without reducing the count below _maxCachedBlocks. This is the correct choice to avoid an infinite loop, but there is no test for this scenario and no log warning. An adversary who can set the FinalizedHash/HeadBlockHash to blocks it keeps replaying could keep the cache permanently above the limit (though in practice those hashes come from forkchoiceUpdated, not untrusted data).

A test covering this edge case (all-protected overflow) would document the behavior.

Fix this →

Medium — TryRemoveOldestBlock can spuriously abort pruning under concurrent access

The prune loop is not atomic: TryAdd + PruneIfNeeded are two separate operations on the ConcurrentDictionary. If two threads call TryAddBlock concurrently, both may succeed TryAdd, both enter PruneIfNeeded, both find the same oldestHash during their respective scans, the first TryRemove succeeds, and the second returns false — which causes PruneIfNeeded to return even though the count is still above the limit. The Engine API is effectively serialized today, so this is low risk in practice, but the invariant is fragile for future callers.

Low — Misplaced tests

Tests for BlockCacheService pruning behavior are added to InvalidChainTrackerTest.cs and its namespace/class. They should live in a dedicated test class (e.g. BlockCacheServiceTest.cs in the same directory) to follow the single-responsibility convention of the rest of the test suite.

Low — O(n) full scan per prune invocation

TryRemoveOldestBlock iterates the entire BlockCache dictionary on every pruning call. At the default cap of Reorganization.MaxDepth * 2 + 16 = 144 this is negligible, but the formula and its rationale should be documented (a single code comment would suffice).

Low — Missing XML documentation on the new public API member

IBlockCacheService.TryAddBlock is a public interface member. Per the coding style rules, public APIs should carry <summary> / <returns> documentation. The current implementation has no doc comment.


Minor nit

BeaconSync.StopSyncing previously called _blockCacheService.BlockCache.Clear() directly; changing it to _blockCacheService.Clear() is the right cleanup and is correct.


Verdict: The fix addresses a real resource-exhaustion bug. The core mechanism (bound + prune) is sound, but the pruning strategy (evict lowest-numbered block) is likely backwards for the main beacon sync use case and warrants rethinking before this ships.

}

private bool TryRemoveOldestBlock()
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Pruning strategy evicts the wrong end of the chain for beacon sync

TryRemoveOldestBlock finds the block with the lowest Number and evicts it. During beacon sync, blocks arrive tip-first (highest number first). TryInsertDanglingBlock walks parent pointers downward from the new block to find a known anchor. Evicting the lowest-numbered block removes the entry closest to the anchor — exactly the block that completes the chain walk.

Consider evicting the highest-numbered block instead (furthest from the sync anchor), or tracking insertion order via a parallel Queue<Hash256AsKey> and evicting the oldest-inserted entry regardless of block number.

public void Clear() => BlockCache.Clear();

private void PruneIfNeeded()
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Non-atomic prune loop can bail early under concurrent access

If two threads both call TryAddBlock concurrently, both enter PruneIfNeeded. Both may identify the same oldestHash during their scan. The first TryRemove succeeds; the second returns false, causing this loop to exit early while the count is still over the limit.

In practice the Engine API is serialized, so this is low risk today. Consider documenting the threading expectation or guarding with a lock / Interlocked.CompareExchange on a prune-in-progress flag.

@@ -12,4 +12,6 @@ public interface IBlockCacheService
public ConcurrentDictionary<Hash256AsKey, Block> BlockCache { get; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — BlockCache exposes mutable ConcurrentDictionary, allowing callers to bypass TryAddBlock

Any caller can still do BlockCache.TryAdd(hash, block) or BlockCache[key] = block, bypassing the bounding logic. The PR routes all current insert sites through TryAddBlock (good), but the interface offers no enforcement.

Options:

  • Change to IReadOnlyDictionary<Hash256AsKey, Block> for reads (all current external usages are reads or removes, not inserts).
  • Or add a /// <remarks>Writes must go through <see cref="TryAddBlock"/>.</remarks> to document the contract.

}

[Test]
public void block_cache_service_prunes_oldest_unprotected_blocks()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — Tests for BlockCacheService are misplaced in InvalidChainTrackerTest

These two new tests exercise BlockCacheService pruning behavior, not InvalidChainTracker. They should live in a dedicated BlockCacheServiceTest.cs to keep the test class focused and easy to discover.

Also missing: a test for the all-protected overflow scenario — when every cached block is either FinalizedHash or HeadBlockHash, pruning silently bails and the cache exceeds its limit. A test documenting that behavior would prevent surprises.

@github-actions github-actions Bot added the sync label Apr 30, 2026
@JayeTurn
Copy link
Copy Markdown
Contributor Author

@LukaszRozmej Updated, please take another look, thanks for your review!

@LukaszRozmej
Copy link
Copy Markdown
Member

@claude re-review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 30, 2026

Claude finished @LukaszRozmej's task in 3m 32s —— View job


Re-review: fix(merge): bound dangling block cache growth

  • Read relevant rule files
  • Read changed files
  • Analyze correctness and edge cases
  • Check previous findings were addressed
  • Post findings

Summary

All Medium findings from the original review have been resolved across 4 follow-up commits. The implementation is correct and well-tested. One minor informational note remains.


Previous findings — status

# Severity Finding Status
1 Medium Pruning strategy evicted lowest-numbered block, severing beacon sync chain walk ✅ Fixed — 68d2b3e changed to evict highest-numbered unprotected block
2 Medium BlockCache property exposed mutable ConcurrentDictionary, allowing writes to bypass bounds ✅ Fixed — 79867ea narrowed to IReadOnlyDictionary<Hash256AsKey, Block>; all direct BlockCache[k]=v writes replaced with TryAddBlock
3 Medium All-protected overflow scenario untested (prune silently bails) ✅ Addressed — preserves_protected_blocks_when_all_candidates_are_protected test added in BlockCacheServiceTest.cs
4 Medium Non-atomic add+prune under concurrent access could leave count above limit ✅ Fixed — 019f1eb added Lock _pruneLock serializing TryAddBlock, TryRemoveBlock, and Clear
5 Low Pruning tests misplaced in InvalidChainTrackerTest ✅ Fixed — dedicated BlockCacheServiceTest.cs added
6 Low O(n) scan rationale undocumented ✅ Addressed — comment added explaining the small bound makes a second index unnecessary
7 Low Missing XML docs on TryAddBlock ✅ Fixed — <summary> and <returns> docs added to interface and implementation

Remaining note (informational)

Low — TryAddBlock returns true when the added block is immediately self-evicted

If the newly added block has the highest number among all unprotected entries, the prune loop selects and removes it on the same call, leaving it absent from the cache while the method returns true. The prunes_highest_unprotected_block test already documents this: TryAddBlock(block3).Should().BeTrue() yet ContainsKey(block3Hash).Should().BeFalse().

No current callers check the return value (NewPayloadHandler lines 174 and 481 both discard it), so this is not a correctness issue. It is worth noting for future callers.


Verdict

All Medium issues are resolved. The fix is sound: the bounded cache correctly evicts the highest-numbered (tip-side) blocks to preserve the anchor-proximal blocks that complete the chain walk, all writes go through the bounded API, concurrency is properly serialized, and the edge cases are covered by tests.

Mergeable.

Copy link
Copy Markdown
Member

@LukaszRozmej LukaszRozmej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude post suggestions how to de-dup tests in src/Nethermind/Nethermind.Merge.Plugin.Test/BlockCacheServiceTest.cs

Could we just use LruCache?

public bool TryAddBlock(Block block)
{
Hash256 blockHash = block.Hash ?? block.CalculateHash();
lock (_pruneLock)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lock + ConcurrentDictionary seems redundant

}
}

return added;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep could be just removed

@LukaszRozmej
Copy link
Copy Markdown
Member

@claude post suggestions how to de-dup tests in src/Nethermind/Nethermind.Merge.Plugin.Test/BlockCacheServiceTest.cs

Could we just use LruCache?

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 30, 2026

Claude finished @LukaszRozmej's task in 2m 56s —— View job


Suggestions: test de-dup + LruCache viability


1. Can we use LruCache?

No — it doesn't fit the business logic here. The existing LruCache<TKey, TValue> (Nethermind.Core/Caching/LruCache.cs) evicts the least-recently-accessed entry via a doubly-linked list. This conflicts with two requirements:

Requirement Current implementation LruCache
Never evict protected keys (FinalizedHash / HeadBlockHash) ✅ Explicit skip in TryGetHighestNumberedUnprotectedBlock ❌ No pinning concept
Evict highest-numbered block (furthest from sync anchor) ✅ Max-scan by block.Number ❌ Evicts LRU regardless of block number

Forcing LruCache to work here would require adding both a pin-key API and reversing the eviction ordering — at which point you've re-implemented what's already here. The current bespoke approach is the right fit.


2. lock + ConcurrentDictionary — redundancy (line 53 comment)

The reviewer is correct. _pruneLock already serializes all mutating paths (TryAddBlock, TryRemoveBlock, Clear), so ConcurrentDictionary's internal per-bucket locking is pure overhead on every write. However, there is one catch before switching to a plain Dictionary:

BlockCache is exposed as IReadOnlyDictionary<Hash256AsKey, Block> and external callers iterate it unsynchronized. The most notable example is UnsafeStartingSyncPivotUpdater.cs:73:

foreach (Block block in _blockCacheService.BlockCache.Values)

A plain Dictionary would not be safe to iterate while another thread holds _pruneLock and mutates it. ConcurrentDictionary provides snapshot-consistent iteration for free.

Two viable paths:

Path A — remove ConcurrentDictionary, expose a snapshot method (cleaner long-term):

private readonly Dictionary<Hash256AsKey, Block> _blockCache = new();

// New interface method replacing raw BlockCache.Values iteration:
public IReadOnlyList<Block> GetCachedBlocks()
{
    lock (_pruneLock) { return [.._blockCache.Values]; }
}

All BlockCache.Values / BlockCache.TryGetValue callsites then go through the interface. BlockCache can be removed from IBlockCacheService or narrowed further. This is the right direction but is a bigger diff.

Path B — keep ConcurrentDictionary, drop _pruneLock, accept non-atomic add+prune (pragmatic):
Since Engine API calls are already externally serialized, the add+prune race is theoretical rather than practical. Removing _pruneLock and relying solely on ConcurrentDictionary would eliminate the redundancy with minimal risk. The while loop's TryRemove returning false on a concurrent miss is harmless — the next TryAddBlock will prune again.


3. Test de-duplication in BlockCacheServiceTest.cs

The three tests share a boilerplate pattern — create a block, immediately compute its hash:

Block block1 = Build.A.Block.WithNumber(1).TestObject;
Hash256 block1Hash = block1.GetOrCalculateHash();
// ...repeated 2 more times

Suggestion — extract a factory helper (cuts 3 lines per test):

private static (Block block, Hash256 hash) MakeBlock(long number)
{
    Block b = Build.A.Block.WithNumber(number).TestObject;
    return (b, b.GetOrCalculateHash());
}

Tests 1 and 2 then become:

var (b1, h1) = MakeBlock(1);
var (b2, h2) = MakeBlock(2);
var (b3, h3) = MakeBlock(3);

Tests 1 and 2 also share the same structural skeleton (capacity=2, add 3 blocks, assert count=2, assert 3 membership results). They could be collapsed into a single [TestCaseSource] test, but the setup differs enough (protected hashes) that the parameterization would likely reduce readability. The factory helper alone is the right trade-off.


Summary of recommended actions:

Item Recommendation
LruCache Keep current implementation; LruCache doesn't support pinned keys or number-based eviction
lock + ConcurrentDictionary Path A: switch to Dictionary + GetCachedBlocks() snapshot; or Path B: drop _pruneLock, keep ConcurrentDictionary — either removes the redundancy
Test de-dup Extract MakeBlock(long) helper to eliminate the repeated block+hash creation lines

@LukaszRozmej
Copy link
Copy Markdown
Member

Path B please

@JayeTurn
Copy link
Copy Markdown
Contributor Author

sure, will do.

@JayeTurn
Copy link
Copy Markdown
Contributor Author

@LukaszRozmej Done.

Comment thread src/Nethermind/Nethermind.Merge.Plugin/Handlers/BlockCacheService.cs Outdated
Comment thread src/Nethermind/Nethermind.Merge.Plugin/Handlers/BlockCacheService.cs Outdated
@LukaszRozmej
Copy link
Copy Markdown
Member

@claude re-review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

Claude finished @LukaszRozmej's task in 2m 36s —— View job


Re-review: fix(merge): bound dangling block cache growth

  • Read rule files and gather context
  • Read current state of changed files
  • Analyze recent commits (Path B + return-value fixes)
  • Check LukaszRozmej's 12:37 inline review comments were addressed
  • Post findings

Summary

The three post-approval commits correctly implement the final round of changes requested by LukaszRozmej. The implementation is clean and ready.


Recent commits reviewed

Commit Description Status
13caefd Remove _pruneLock (Path B) ✅ Correct — ConcurrentDictionary retained for safe concurrent iteration
d358e02 Track blockRemainsCached; return false when added block is immediately self-evicted ✅ Correct
595626a Refactor: early-return on TryAdd failure, rename addedblockRemainsCached for clarity ✅ Correct

LukaszRozmej's 12:37 inline comments — verified resolved

The two inline comments ("if added here" / "can be removed here — should return false then") asked for the pattern where TryAdd fails to use an early return, and the added variable to be removed in favour of a blockRemainsCached flag that is set to false only when the newly-added block gets pruned. Both are now in place (BlockCacheService.cs:51-53 and 62-65).


All-callers check — no mutable BlockCache access remaining

Every remaining reference to BlockCache in non-test production code is read-only:

Caller Access
StartingSyncPivotUpdater.cs:135 TryGetValue
UnsafeStartingSyncPivotUpdater.cs:73 .Values iteration (safe on ConcurrentDictionary)
ForkchoiceUpdatedHandler.cs:122 TryGetValue
NewPayloadHandler.cs:474 TryGetValue
InvalidChainTracker.cs:105 TryGetValue

IBlockCacheService.BlockCache is typed as IReadOnlyDictionary<Hash256AsKey, Block>, so the type system prevents callers from bypassing TryAddBlock at compile time.


Informational note

TryAddBlock returns true in the all-protected overflow case — when every cached block is protected (FinalizedHash / HeadBlockHash), pruning bails and the method returns true even though the count exceeded _maxCachedBlocks. This is the correct safe behaviour (never evict the chain tip or finalized block), and the preserves_protected_blocks_when_all_candidates_are_protected test documents it explicitly.


Verdict: Mergeable. ✅ All Medium findings from previous rounds are resolved. Path B is implemented correctly, return value semantics are accurate, and all remaining BlockCache accesses are read-only.

/// Initializes a block cache with the default merge sync bound.
/// </summary>
public BlockCacheService()
: this((int)(Reorganization.MaxDepth * 2 + 16))
Copy link
Copy Markdown
Contributor

@svlachakis svlachakis May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukaszRozmej isn't this going to make cache less effective when we support longer reorgs soon? Where is this number coming from?

Also how frequently can we have incoming chain from CL bigger than 144 blocks? I think this number needs more thought for all the chains we support - maybe for ethereum mainnet is fine.

Maybe change approach and evict by fork? Maybe make it configurable per chain?

@claude any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants