fix(merge): bound dangling block cache growth by JayeTurn · Pull Request #11426 · NethermindEth/nethermind

JayeTurn · 2026-04-30T06:34:05Z

Changes

Route Merge dangling block cache inserts through a bounded TryAddBlock API.
Prune the oldest unprotected cached blocks when the cache exceeds its configured limit.
Preserve cached head/finalized blocks during pruning and remove dangling blocks after inserting them into the block tree.
Add regression coverage for cache pruning behavior.

Types of changes

What types of changes does your code introduce?

Bugfix (a non-breaking change that fixes an issue)
New feature (a non-breaking change that adds functionality)
Breaking change (a change that causes existing functionality not to work as expected)
Optimization
Refactoring
Documentation update
Build-related changes
Other: Description

Testing

Requires testing

Yes
No

If yes, did you write tests?

Yes
No

Notes on testing

Added pruning coverage in InvalidChainTrackerTest.
Ran git diff --check.
Verified production cache insertion sites route through TryAddBlock.
.NET build/test intentionally skipped by user request.

Documentation

Requires documentation update

Yes
No

Requires explanation in Release Notes

Yes
No

LukaszRozmej · 2026-04-30T07:46:11Z

@claude review

claude · 2026-04-30T07:46:28Z

Claude finished @LukaszRozmej's task in 5m 11s —— View job

PR Review: fix(merge): bound dangling block cache growth

Summary

The PR adds a bounded TryAddBlock API to BlockCacheService with LRU-by-block-number pruning to prevent unbounded memory growth from dangling blocks during beacon sync. The direction is correct and the existing insert sites are properly routed through the new API. However, there are several issues that need addressing before merge.

Findings

Medium — Pruning strategy may sever the beacon sync ancestry chain

TryRemoveOldestBlock evicts the block with the lowest block number. During beacon sync, blocks arrive tip-first (highest number first) and TryInsertDanglingBlock walks downward through parent pointers in the cache looking for a known anchor. Evicting the lowest-numbered block removes the entry closest to the sync anchor — exactly the block that completes the chain walk. If that block is pruned, the walk from the tip fails at the gap, all subsequent blocks re-enter the cache as unanchored, and they will not be connected until the pruned block is re-requested.

A safer strategy would prune the highest-numbered block (furthest from the anchor), or track insertion order to evict the oldest-inserted entry.

Fix this →

Medium — `BlockCache` property is still publicly writable, bypassing bounds

IBlockCacheService.BlockCache exposes the raw ConcurrentDictionary<Hash256AsKey, Block>, allowing any caller to call BlockCache.TryAdd(...) or BlockCache[key] = value without going through TryAddBlock. The invariant is easy to violate accidentally in future code. Consider either:

Narrowing the interface to IReadOnlyDictionary<Hash256AsKey, Block> for reads, or
Keeping the existing dict but adding a code comment on the interface that writes must go through TryAddBlock.

Current callers are all fixed, so this is a design concern rather than an active bug.

Medium — `PruneIfNeeded` bails silently when all cached blocks are protected

If every block in the cache is protected (FinalizedHash or HeadBlockHash), TryRemoveOldestBlock returns false and PruneIfNeeded exits the while loop without reducing the count below _maxCachedBlocks. This is the correct choice to avoid an infinite loop, but there is no test for this scenario and no log warning. An adversary who can set the FinalizedHash/HeadBlockHash to blocks it keeps replaying could keep the cache permanently above the limit (though in practice those hashes come from forkchoiceUpdated, not untrusted data).

A test covering this edge case (all-protected overflow) would document the behavior.

Fix this →

Medium — `TryRemoveOldestBlock` can spuriously abort pruning under concurrent access

The prune loop is not atomic: TryAdd + PruneIfNeeded are two separate operations on the ConcurrentDictionary. If two threads call TryAddBlock concurrently, both may succeed TryAdd, both enter PruneIfNeeded, both find the same oldestHash during their respective scans, the first TryRemove succeeds, and the second returns false — which causes PruneIfNeeded to return even though the count is still above the limit. The Engine API is effectively serialized today, so this is low risk in practice, but the invariant is fragile for future callers.

Low — Misplaced tests

Tests for BlockCacheService pruning behavior are added to InvalidChainTrackerTest.cs and its namespace/class. They should live in a dedicated test class (e.g. BlockCacheServiceTest.cs in the same directory) to follow the single-responsibility convention of the rest of the test suite.

Low — O(n) full scan per prune invocation

TryRemoveOldestBlock iterates the entire BlockCache dictionary on every pruning call. At the default cap of Reorganization.MaxDepth * 2 + 16 = 144 this is negligible, but the formula and its rationale should be documented (a single code comment would suffice).

Low — Missing XML documentation on the new public API member

IBlockCacheService.TryAddBlock is a public interface member. Per the coding style rules, public APIs should carry <summary> / <returns> documentation. The current implementation has no doc comment.

Minor nit

BeaconSync.StopSyncing previously called _blockCacheService.BlockCache.Clear() directly; changing it to _blockCacheService.Clear() is the right cleanup and is correct.

Verdict: The fix addresses a real resource-exhaustion bug. The core mechanism (bound + prune) is sound, but the pruning strategy (evict lowest-numbered block) is likely backwards for the main beacon sync use case and warrants rethinking before this ships.

claude · 2026-04-30T07:51:22Z

+    }
+
+    private bool TryRemoveOldestBlock()
+    {


Medium — Pruning strategy evicts the wrong end of the chain for beacon sync

TryRemoveOldestBlock finds the block with the lowest Number and evicts it. During beacon sync, blocks arrive tip-first (highest number first). TryInsertDanglingBlock walks parent pointers downward from the new block to find a known anchor. Evicting the lowest-numbered block removes the entry closest to the anchor — exactly the block that completes the chain walk.

Consider evicting the highest-numbered block instead (furthest from the sync anchor), or tracking insertion order via a parallel Queue<Hash256AsKey> and evicting the oldest-inserted entry regardless of block number.

claude · 2026-04-30T07:51:27Z

+    public void Clear() => BlockCache.Clear();
+
+    private void PruneIfNeeded()
+    {


Medium — Non-atomic prune loop can bail early under concurrent access

If two threads both call TryAddBlock concurrently, both enter PruneIfNeeded. Both may identify the same oldestHash during their scan. The first TryRemove succeeds; the second returns false, causing this loop to exit early while the count is still over the limit.

In practice the Engine API is serialized, so this is low risk today. Consider documenting the threading expectation or guarding with a lock / Interlocked.CompareExchange on a prune-in-progress flag.

claude · 2026-04-30T07:51:34Z

@@ -12,4 +12,6 @@ public interface IBlockCacheService
    public ConcurrentDictionary<Hash256AsKey, Block> BlockCache { get; }


Medium — BlockCache exposes mutable ConcurrentDictionary, allowing callers to bypass TryAddBlock

Any caller can still do BlockCache.TryAdd(hash, block) or BlockCache[key] = block, bypassing the bounding logic. The PR routes all current insert sites through TryAddBlock (good), but the interface offers no enforcement.

Options:

Change to IReadOnlyDictionary<Hash256AsKey, Block> for reads (all current external usages are reads or removes, not inserts).

Or add a /// <remarks>Writes must go through <see cref="TryAddBlock"/>.</remarks> to document the contract.

claude · 2026-04-30T07:51:41Z

    }

+    [Test]
+    public void block_cache_service_prunes_oldest_unprotected_blocks()


Low — Tests for BlockCacheService are misplaced in InvalidChainTrackerTest

These two new tests exercise BlockCacheService pruning behavior, not InvalidChainTracker. They should live in a dedicated BlockCacheServiceTest.cs to keep the test class focused and easy to discover.

Also missing: a test for the all-protected overflow scenario — when every cached block is either FinalizedHash or HeadBlockHash, pruning silently bails and the cache exceeds its limit. A test documenting that behavior would prevent surprises.

JayeTurn · 2026-04-30T09:00:45Z

@LukaszRozmej Updated, please take another look, thanks for your review!

LukaszRozmej · 2026-04-30T10:22:54Z

@claude re-review

claude · 2026-04-30T10:23:20Z

Claude finished @LukaszRozmej's task in 3m 32s —— View job

Re-review: fix(merge): bound dangling block cache growth

Summary

All Medium findings from the original review have been resolved across 4 follow-up commits. The implementation is correct and well-tested. One minor informational note remains.

Previous findings — status

#	Severity	Finding	Status
1	Medium	Pruning strategy evicted lowest-numbered block, severing beacon sync chain walk	✅ Fixed — `68d2b3e` changed to evict highest-numbered unprotected block
2	Medium	`BlockCache` property exposed mutable `ConcurrentDictionary`, allowing writes to bypass bounds	✅ Fixed — `79867ea` narrowed to `IReadOnlyDictionary<Hash256AsKey, Block>`; all direct `BlockCache[k]=v` writes replaced with `TryAddBlock`
3	Medium	All-protected overflow scenario untested (prune silently bails)	✅ Addressed — `preserves_protected_blocks_when_all_candidates_are_protected` test added in `BlockCacheServiceTest.cs`
4	Medium	Non-atomic add+prune under concurrent access could leave count above limit	✅ Fixed — `019f1eb` added `Lock _pruneLock` serializing `TryAddBlock`, `TryRemoveBlock`, and `Clear`
5	Low	Pruning tests misplaced in `InvalidChainTrackerTest`	✅ Fixed — dedicated `BlockCacheServiceTest.cs` added
6	Low	O(n) scan rationale undocumented	✅ Addressed — comment added explaining the small bound makes a second index unnecessary
7	Low	Missing XML docs on `TryAddBlock`	✅ Fixed — `<summary>` and `<returns>` docs added to interface and implementation

Remaining note (informational)

Low — TryAddBlock returns true when the added block is immediately self-evicted

If the newly added block has the highest number among all unprotected entries, the prune loop selects and removes it on the same call, leaving it absent from the cache while the method returns true. The prunes_highest_unprotected_block test already documents this: TryAddBlock(block3).Should().BeTrue() yet ContainsKey(block3Hash).Should().BeFalse().

No current callers check the return value (NewPayloadHandler lines 174 and 481 both discard it), so this is not a correctness issue. It is worth noting for future callers.

Verdict

All Medium issues are resolved. The fix is sound: the bounded cache correctly evicts the highest-numbered (tip-side) blocks to preserve the anchor-proximal blocks that complete the chain walk, all writes go through the bounded API, concurrency is properly serialized, and the edge cases are covered by tests.

Mergeable. ✅

LukaszRozmej

@claude post suggestions how to de-dup tests in src/Nethermind/Nethermind.Merge.Plugin.Test/BlockCacheServiceTest.cs

Could we just use LruCache?

LukaszRozmej · 2026-04-30T10:33:11Z

+    public bool TryAddBlock(Block block)
+    {
+        Hash256 blockHash = block.Hash ?? block.CalculateHash();
+        lock (_pruneLock)


lock + ConcurrentDictionary seems redundant

LukaszRozmej · 2026-04-30T10:33:49Z

+                }
+            }
+
+            return added;


yep could be just removed

LukaszRozmej · 2026-04-30T10:41:06Z

@claude post suggestions how to de-dup tests in src/Nethermind/Nethermind.Merge.Plugin.Test/BlockCacheServiceTest.cs

Could we just use LruCache?

claude · 2026-04-30T10:41:29Z

Claude finished @LukaszRozmej's task in 2m 56s —— View job

Suggestions: test de-dup + `LruCache` viability

1. Can we use `LruCache`?

No — it doesn't fit the business logic here. The existing LruCache<TKey, TValue> (Nethermind.Core/Caching/LruCache.cs) evicts the least-recently-accessed entry via a doubly-linked list. This conflicts with two requirements:

Requirement	Current implementation	`LruCache`
Never evict protected keys (`FinalizedHash` / `HeadBlockHash`)	✅ Explicit skip in `TryGetHighestNumberedUnprotectedBlock`	❌ No pinning concept
Evict highest-numbered block (furthest from sync anchor)	✅ Max-scan by `block.Number`	❌ Evicts LRU regardless of block number

Forcing LruCache to work here would require adding both a pin-key API and reversing the eviction ordering — at which point you've re-implemented what's already here. The current bespoke approach is the right fit.

2. `lock` + `ConcurrentDictionary` — redundancy (line 53 comment)

The reviewer is correct. _pruneLock already serializes all mutating paths (TryAddBlock, TryRemoveBlock, Clear), so ConcurrentDictionary's internal per-bucket locking is pure overhead on every write. However, there is one catch before switching to a plain Dictionary:

BlockCache is exposed as IReadOnlyDictionary<Hash256AsKey, Block> and external callers iterate it unsynchronized. The most notable example is UnsafeStartingSyncPivotUpdater.cs:73:

foreach (Block block in _blockCacheService.BlockCache.Values)

A plain Dictionary would not be safe to iterate while another thread holds _pruneLock and mutates it. ConcurrentDictionary provides snapshot-consistent iteration for free.

Two viable paths:

Path A — remove ConcurrentDictionary, expose a snapshot method (cleaner long-term):

private readonly Dictionary<Hash256AsKey, Block> _blockCache = new();

// New interface method replacing raw BlockCache.Values iteration:
public IReadOnlyList<Block> GetCachedBlocks()
{
    lock (_pruneLock) { return [.._blockCache.Values]; }
}

All BlockCache.Values / BlockCache.TryGetValue callsites then go through the interface. BlockCache can be removed from IBlockCacheService or narrowed further. This is the right direction but is a bigger diff.

Path B — keep ConcurrentDictionary, drop _pruneLock, accept non-atomic add+prune (pragmatic):
Since Engine API calls are already externally serialized, the add+prune race is theoretical rather than practical. Removing _pruneLock and relying solely on ConcurrentDictionary would eliminate the redundancy with minimal risk. The while loop's TryRemove returning false on a concurrent miss is harmless — the next TryAddBlock will prune again.

3. Test de-duplication in `BlockCacheServiceTest.cs`

The three tests share a boilerplate pattern — create a block, immediately compute its hash:

Block block1 = Build.A.Block.WithNumber(1).TestObject;
Hash256 block1Hash = block1.GetOrCalculateHash();
// ...repeated 2 more times

Suggestion — extract a factory helper (cuts 3 lines per test):

private static (Block block, Hash256 hash) MakeBlock(long number)
{
    Block b = Build.A.Block.WithNumber(number).TestObject;
    return (b, b.GetOrCalculateHash());
}

Tests 1 and 2 then become:

var (b1, h1) = MakeBlock(1);
var (b2, h2) = MakeBlock(2);
var (b3, h3) = MakeBlock(3);

Tests 1 and 2 also share the same structural skeleton (capacity=2, add 3 blocks, assert count=2, assert 3 membership results). They could be collapsed into a single [TestCaseSource] test, but the setup differs enough (protected hashes) that the parameterization would likely reduce readability. The factory helper alone is the right trade-off.

Summary of recommended actions:

Item	Recommendation
`LruCache`	Keep current implementation; `LruCache` doesn't support pinned keys or number-based eviction
`lock` + `ConcurrentDictionary`	Path A: switch to `Dictionary` + `GetCachedBlocks()` snapshot; or Path B: drop `_pruneLock`, keep `ConcurrentDictionary` — either removes the redundancy
Test de-dup	Extract `MakeBlock(long)` helper to eliminate the repeated block+hash creation lines

LukaszRozmej · 2026-04-30T10:48:27Z

Path B please

JayeTurn · 2026-04-30T11:40:50Z

sure, will do.

JayeTurn · 2026-04-30T11:47:13Z

@LukaszRozmej Done.

LukaszRozmej · 2026-05-01T09:33:14Z

@claude re-review

claude · 2026-05-01T09:34:49Z

Claude finished @LukaszRozmej's task in 2m 36s —— View job

Re-review: fix(merge): bound dangling block cache growth

Read rule files and gather context
Read current state of changed files
Analyze recent commits (Path B + return-value fixes)
Check LukaszRozmej's 12:37 inline review comments were addressed
Post findings

Summary

The three post-approval commits correctly implement the final round of changes requested by LukaszRozmej. The implementation is clean and ready.

Recent commits reviewed

Commit	Description	Status
`13caefd`	Remove `_pruneLock` (Path B)	✅ Correct — `ConcurrentDictionary` retained for safe concurrent iteration
`d358e02`	Track `blockRemainsCached`; return `false` when added block is immediately self-evicted	✅ Correct
`595626a`	Refactor: early-return on `TryAdd` failure, rename `added` → `blockRemainsCached` for clarity	✅ Correct

LukaszRozmej's 12:37 inline comments — verified resolved

The two inline comments ("if added here" / "can be removed here — should return false then") asked for the pattern where TryAdd fails to use an early return, and the added variable to be removed in favour of a blockRemainsCached flag that is set to false only when the newly-added block gets pruned. Both are now in place (BlockCacheService.cs:51-53 and 62-65).

All-callers check — no mutable `BlockCache` access remaining

Every remaining reference to BlockCache in non-test production code is read-only:

Caller	Access
`StartingSyncPivotUpdater.cs:135`	`TryGetValue`
`UnsafeStartingSyncPivotUpdater.cs:73`	`.Values` iteration (safe on `ConcurrentDictionary`)
`ForkchoiceUpdatedHandler.cs:122`	`TryGetValue`
`NewPayloadHandler.cs:474`	`TryGetValue`
`InvalidChainTracker.cs:105`	`TryGetValue`

IBlockCacheService.BlockCache is typed as IReadOnlyDictionary<Hash256AsKey, Block>, so the type system prevents callers from bypassing TryAddBlock at compile time.

Informational note

TryAddBlock returns true in the all-protected overflow case — when every cached block is protected (FinalizedHash / HeadBlockHash), pruning bails and the method returns true even though the count exceeded _maxCachedBlocks. This is the correct safe behaviour (never evict the chain tip or finalized block), and the preserves_protected_blocks_when_all_candidates_are_protected test documents it explicitly.

Verdict: Mergeable. ✅ All Medium findings from previous rounds are resolved. Path B is implemented correctly, return value semantics are accurate, and all remaining BlockCache accesses are read-only.

svlachakis · 2026-05-02T09:10:28Z

+    /// Initializes a block cache with the default merge sync bound.
+    /// </summary>
+    public BlockCacheService()
+        : this((int)(Reorganization.MaxDepth * 2 + 16))


@LukaszRozmej isn't this going to make cache less effective when we support longer reorgs soon? Where is this number coming from?

Also how frequently can we have incoming chain from CL bigger than 144 blocks? I think this number needs more thought for all the chains we support - maybe for ethereum mainnet is fine.

Maybe change approach and evict by fork? Maybe make it configurable per chain?

@claude any ideas?

fix(merge): bound dangling block cache growth

895a27f

JayeTurn requested review from LukaszRozmej, Marchhill, MarekM25, benaadams and flcl42 as code owners April 30, 2026 06:34

github-actions Bot added the bug fix + reliability label Apr 30, 2026

LukaszRozmej requested a review from asdacap April 30, 2026 07:45

claude Bot reviewed Apr 30, 2026

View reviewed changes

JayeTurn added 2 commits April 30, 2026 16:10

fix(merge): prune furthest cached blocks

68d2b3e

fix(merge): expose block cache reads only

79867ea

github-actions Bot added the sync label Apr 30, 2026

JayeTurn added 2 commits April 30, 2026 16:28

fix(merge): serialize block cache pruning

019f1eb

fix(merge): clarify block cache pruning scan

73f1df1

LukaszRozmej reviewed Apr 30, 2026

View reviewed changes

LukaszRozmej requested changes Apr 30, 2026

View reviewed changes

fix(merge): remove block cache prune lock

13caefd

LukaszRozmej reviewed Apr 30, 2026

View reviewed changes

Comment thread src/Nethermind/Nethermind.Merge.Plugin/Handlers/BlockCacheService.cs Outdated

Comment thread src/Nethermind/Nethermind.Merge.Plugin/Handlers/BlockCacheService.cs Outdated

fix(merge): return false for pruned cached blocks

d358e02

LukaszRozmej approved these changes Apr 30, 2026

View reviewed changes

fix(merge): clarify cached block add result

595626a

fix(merge): use block hash extension

968a147

svlachakis reviewed May 2, 2026

View reviewed changes

		@@ -12,4 +12,6 @@ public interface IBlockCacheService
		public ConcurrentDictionary<Hash256AsKey, Block> BlockCache { get; }

Conversation

JayeTurn commented Apr 30, 2026

Changes

Types of changes

What types of changes does your code introduce?

Testing

Requires testing

If yes, did you write tests?

Notes on testing

Documentation

Requires documentation update

Requires explanation in Release Notes

Uh oh!

LukaszRozmej commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(merge): bound dangling block cache growth

Summary

Findings

Medium — Pruning strategy may sever the beacon sync ancestry chain

Medium — BlockCache property is still publicly writable, bypassing bounds

Medium — PruneIfNeeded bails silently when all cached blocks are protected

Medium — TryRemoveOldestBlock can spuriously abort pruning under concurrent access

Low — Misplaced tests

Low — O(n) full scan per prune invocation

Low — Missing XML documentation on the new public API member

Minor nit

Uh oh!

claude Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

JayeTurn commented Apr 30, 2026

Uh oh!

LukaszRozmej commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: fix(merge): bound dangling block cache growth

Summary

Previous findings — status

Remaining note (informational)

Verdict

Uh oh!

LukaszRozmej left a comment

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Suggestions: test de-dup + LruCache viability

1. Can we use LruCache?

2. lock + ConcurrentDictionary — redundancy (line 53 comment)

3. Test de-duplication in BlockCacheServiceTest.cs

Uh oh!

LukaszRozmej commented Apr 30, 2026

Uh oh!

JayeTurn commented Apr 30, 2026

Uh oh!

JayeTurn commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

LukaszRozmej commented May 1, 2026

claude Bot commented Apr 30, 2026 •

edited

Loading

Medium — `BlockCache` property is still publicly writable, bypassing bounds

Medium — `PruneIfNeeded` bails silently when all cached blocks are protected

Medium — `TryRemoveOldestBlock` can spuriously abort pruning under concurrent access

claude Bot commented Apr 30, 2026 •

edited

Loading

claude Bot commented Apr 30, 2026 •

edited

Loading

Suggestions: test de-dup + `LruCache` viability

1. Can we use `LruCache`?

2. `lock` + `ConcurrentDictionary` — redundancy (line 53 comment)

3. Test de-duplication in `BlockCacheServiceTest.cs`

claude Bot commented May 1, 2026 •

edited

Loading

All-callers check — no mutable `BlockCache` access remaining

svlachakis May 2, 2026 •

edited

Loading