fix(snap-sync): three Windows-flake fixes for E2ESyncTests.SnapSync by LukaszRozmej · Pull Request #11443 · NethermindEth/nethermind

LukaszRozmej · 2026-04-30T16:48:13Z

Summary

Three commits in this PR fix the Windows-only E2ESyncTests.SnapSync flake observed in CI (run #25139586200, job #73686059011 on PR #11422):

fix(db): cross-column snapshot atomicity in SnapshotableMemColumnsDb (c14030da4d) — eliminates the TrieNodeException mode.
fix(snap-sync): drain in-flight ISnapTrees in FlatTreeSyncStore.FinalizeSync (12a66f1e1b) — closes a separate ordering gap where post-sync readers could see state from not-yet-disposed phase-1 trees.
fix(snap-sync): don't publish _initialized until Clear() completes (73b160f057) — eliminates the deterministic "missing in flat" mode.

Plus the stress reproducer (edc0d3dbc9) used to drive the diagnosis.

Result: 5 stress runs × 60 tests = 300 invocations, 0 failures. Previously 8–15 failures per 60-test run on Windows.

Failure modes the flake produces

Mode B: `TrieNodeException` (cross-column tearing)

Block processing leases a fresh persistenceReader while the persistence pipeline is committing, gets a snapshot where the Accounts column is updated but the StateNodes column is not, walks the trie at the new state root, and throws TrieNodeException for a node that should be there.

Mode A: `Verification failed: N missing in flat`

Trie-side iteration finds account leaves that the flat-side iterator can't see. The same ~9 hashed addresses missed reliably across runs, all in verifier partitions 0–1 (first byte ≤ 0x3f). Counter-based diagnosis (instrumenting both source-side SetAccount/RemoveAccount/DeleteAccountRange and sink-side InMemoryWriteBatch.Set filtered to Column_Account) showed source vs. sink Set counts matched perfectly while the sink had ~507K unaccounted Removes. Tracing back uncovered ClearAllColumns racing against fresh sync writes — the third fix below.

Fix 1 — `c14030da4d` cross-column snapshot atomicity

SnapshotableMemColumnsDb.CreateSnapshot() iterates columns and captures each column's snapshot independently. InMemoryColumnWriteBatch.Dispose() iterates per-column batches and disposes each independently. Without a global guard, a CreateSnapshot concurrent with an in-flight writeBatch.Dispose can capture some columns after the new writes and others before — a cross-column-inconsistent reader view.

This race only affects the test backend — RocksDB has atomic cross-CF snapshots — but it manifests concretely in E2ESyncTests.SnapSync. The fix wraps CreateSnapshot and the wrapped AtomicColumnsWriteBatch.Dispose in a single Lock so per-Dispose multi-column commits are atomic w.r.t. snapshot creation.

Fix 2 — `12a66f1e1b` drain in-flight ISnapTrees in FinalizeSync

Phase-1 / phase-2 trees produced by FlatSnapTrieFactory hold per-tree IWriteBatch instances; until each tree's Dispose returns, its writes aren't yet committed to the underlying persistence. Without a drain, a caller that observes "sync finished" can read state that's missing accounts from late-disposed trees.

FlatSnapTrieFactory now tracks an _inFlightTrees counter (incremented in CreateStateTree/CreateStorageTree, decremented in FlatSnapStateTree.Dispose/FlatSnapStorageTree.Dispose). FlatTreeSyncStore.FinalizeSync calls WaitForInFlightTreesDrained (spin-wait, 30s timeout) before bumping the persisted state ID. This is independent of fix 3 — it closes a different ordering window that can otherwise re-emerge as the same symptom.

Fix 3 — `73b160f057` don't publish `_initialized` until `Clear()` completes

FlatSnapTrieFactory.EnsureDatabaseCleared had this ordering:

if (_initialized) return;
using (_lock.EnterScope())
{
    if (_initialized) return;
    _initialized = true;          // published BEFORE Clear() ran
    persistence.Clear();           // still iterating + queueing Removes
}

Sequence that wipes accounts:

Thread T1 enters EnsureDatabaseCleared, sets _initialized = true, then starts persistence.Clear(). Clear() opens a single IColumnsWriteBatch, then for each column iterates db.GetColumnDb(column).GetAllKeys() and queues columnBatch.Remove(key) for every key it sees — and only disposes the batch (applying the queued Removes) when all columns are done.
Thread T2 hits the unlocked fast-path if (_initialized) return;, sees true, and skips Clear entirely.
T2 calls persistence.CreateWriteBatch(...) and starts writing accounts via SetAccount. These commit normally (their own IPersistence.IWriteBatch.Dispose).
T1's Clear() is still inside its IColumnsWriteBatch. As it iterates GetAllKeys() for the Account column, it sees the keys T2 just committed and queues columnBatch.Remove(key) for them.
T1's batch finally disposes. Every queued Remove lands on the live DB — including T2's freshly-written accounts. The flat DB ends up missing entries that the trie still references; the verifier reports "Account in trie not found in flat".

The fix: publish _initialized = true after persistence.Clear() returns, and mark the field volatile. Concurrent fast-path readers either observe false (and wait on the lock) or observe true (and Clear is genuinely done, so it's safe to proceed).

How fix 3 was diagnosed

Surface symptoms (key seen by trie iterator but not by flat iterator) were consistent across runs but didn't point at any obvious caller. Source-side instrumentation in BaseFlatPersistence.WriteBatch (SetAccount / RemoveAccount / DeleteAccountRange preview) plus sink-side instrumentation in InMemoryWriteBatch.Set (filtered to _store.Name == "Column_Account") showed:

setBaseFlat = 1,134,377   setIM = 1,134,377    ← all sets accounted
removeBaseFlat = 1,974    DeleteAccountRangePreview = 552
removeIM = 509,598
clearAllColumnsCalls = 60   clearAccountRemoves = 507,072    ← gap closes here

507,072 of the 509,598 sink Removes came from ClearAllColumns — i.e. ~8,451 accounts removed per test, where the DB was supposed to be empty when Clear ran. ~8,451 ≈ the test fixture's account count, confirming Clear was racing against sync writes. The race is timing-sensitive (any instrumentation in the hot path masks it) which is why Linux/macOS pass and only Windows reproduces.

Stress reproducer

SnapSync_StressRepro runs 30 iterations × 2 Flat fixtures = 60 cases, each as its own NUnit case (so a single failure doesn't terminate the rest), no [Retry], [Explicit]. ~1m runtime.

dotnet test --project src/Nethermind/Nethermind.Synchronization.Test/Nethermind.Synchronization.Test.csproj \
  -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro"

Configuration	Local Windows results (per 60-test run)
Baseline (no fix)	18/60 fail; ~12 Mode A + ~6 Mode B
+ cross-column fix (fix 1)	~18/60 fail; 18 Mode A + 0 Mode B
+ ISnapTree drain (fix 2)	8–15/60 fail; all Mode A
+ `_initialized` race (fix 3)	0/60

Across 5 consecutive runs of the stress reproducer with all three fixes applied: 0/300 failures.

Test plan

Nethermind.Db builds (0 warnings, 0 errors)
Nethermind.State.Flat builds (0 warnings, 0 errors)
Nethermind.Synchronization.Test builds (0 warnings, 0 errors)
Stress reproducer: 0/300 failures across 5 consecutive 60-test runs on Windows

Adds a 30-iteration stress reproducer for the Windows-only E2ESyncTests.SnapSync flake observed in CI (run #25139586200, job #73686059011 on PR #11422). Each iteration is its own NUnit case so a single failure does not terminate the rest, no [Retry] so every flake surfaces, and [Explicit] so it does not run in normal CI. Scoped to DbMode.Flat where the flake has been observed. Local reproduction on Windows (~1m runtime, 60 iterations across the two Flat fixtures): 9-19 failures with both observed CI patterns — "Verification failed: ... N missing in flat" and "TrieNodeException: Failed to load key X from root hash Y" during post-snap-sync block processing. Run: dotnet test --project Nethermind.Synchronization.Test \ -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-04-30T16:48:40Z

Claude finished @LukaszRozmej's task in 2m 49s —— View job

Code Review

Gather context and read changed files
Read repo rules (coding-style, test-infrastructure)
Analyze the reproducer test code
Review for correctness, edge cases, and adherence to guidelines
Post final review

Summary

This is a well-motivated diagnostic PR — [Explicit], no [Retry], per-iteration isolation via [TestCaseSource], and a clear Assert.Ignore guard for non-Flat modes. The design choices for a flake reproducer are sound. Two findings to address before merge.

Findings

Medium — Code duplication violates DRY and repo test rules

SnapSync_StressRepro (lines 440–463) is a verbatim copy of SnapSync() (lines 406–429). The test-infrastructure rules explicitly prohibit duplicating test methods. The sync logic should be extracted into a shared helper so both tests call it:

private async Task RunSnapSyncClientOnce(CancellationToken cancellationToken)
{
    PrivateKey clientKey = TestItem.PrivateKeyD;
    await using IContainer client = await CreateNode(clientKey, async (cfg, spec) =>
    {
        SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>();
        syncConfig.FastSync = true;
        syncConfig.SnapSync = true;
        await SetPivot(syncConfig, cancellationToken);
        INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>();
        networkConfig.P2PPort = AllocatePort();
        networkConfig.FilterPeersByRecentIp = false;
        networkConfig.FilterDiscoveryNodesByRecentIp = false;
    });
    await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationToken);
}

Then SnapSync() and SnapSync_StressRepro() each become a one-liner call. Fix this →

Low — Verbose multi-line comment block violates coding style

The 7-line comment block above SnapSync_StressRepro (lines 431–437) violates the coding-style rule: "Don't write multi-paragraph docstrings or multi-line comment blocks — one short line max." The run ID, invocation instructions, and rationale already live in the PR description (the right place). The [Explicit] attribute message covers the essential "what" for a reader in the IDE. The comment should be trimmed to a single line, e.g.:

// Stress reproducer — run manually; see PR #11443 for context.

Positive notes

[Explicit] correctly prevents accidental CI execution — good safeguard.
Intentionally omitting [Retry] is the right call: every flake should surface.
Per-iteration isolation via [TestCaseSource] (so one failure doesn't abort the run) is correct.
The Assert.Ignore guard is clean — it covers all non-Flat modes (including Hash) without needing a separate guard.
_ = iteration is the correct idiom here: using _ as the parameter name would show _ in NUnit test names for all cases, losing the iteration index.
StressIterationCount = 30 × 2 Flat fixtures = 60 active cases — matches the PR description.

claude · 2026-04-30T16:51:20Z

+        PrivateKey clientKey = TestItem.PrivateKeyD;
+        await using IContainer client = await CreateNode(clientKey, async (cfg, spec) =>
+        {
+            SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>();
+            syncConfig.FastSync = true;
+            syncConfig.SnapSync = true;
+
+            await SetPivot(syncConfig, cancellationTokenSource.Token);
+
+            INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>();
+            networkConfig.P2PPort = AllocatePort();
+            networkConfig.FilterPeersByRecentIp = false;
+            networkConfig.FilterDiscoveryNodesByRecentIp = false;
+        });
+
+        await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationTokenSource.Token);
+    }


The body of this test is a verbatim copy of SnapSync(). Per the repo's test-infrastructure rules, duplicating test method bodies is not allowed — extract the shared logic into a private helper (e.g. RunSnapSyncClientOnce(CancellationToken)) and have both methods call it.

claude · 2026-04-30T16:51:28Z

+    // Stress reproducer for the Windows-only SnapSync flake observed in CI:
+    //   - run #25139586200 / job #73686059011 (PR #11422)
+    // Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case
+    // so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every
+    // failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with:
+    //   dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro"
+    // Runs only on Flat dbMode (where the flake has been observed).


This 7-line comment block violates the coding-style rule: "Don't write multi-paragraph docstrings or multi-line comment blocks — one short line max." The run IDs, invocation instructions, and rationale belong in the PR description (where they already are). Trim to a single line, e.g.:

Suggested change

// Stress reproducer for the Windows-only SnapSync flake observed in CI:

// - run #25139586200 / job #73686059011 (PR #11422)

// Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case

// so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every

// failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with:

// dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro"

// Runs only on Flat dbMode (where the flake has been observed).

// Stress reproducer — run manually; see PR #11443 for context.

SnapshotableMemColumnsDb.CreateSnapshot iterates columns in a loop and captures each column's snapshot independently. InMemoryColumnWriteBatch .Dispose iterates the per-column batches and disposes each independently. Without a global guard, a CreateSnapshot call concurrent with an in-flight writeBatch.Dispose can capture some columns AFTER the new writes and others BEFORE — producing a cross-column-inconsistent reader view. This race only matters in the test backend (RocksDB has atomic cross-CF snapshots), but it manifested concretely in E2ESyncTests.SnapSync on Windows: post-snap-sync block processing leases a fresh persistenceReader while the persistence pipeline is committing, gets a snapshot whose Accounts column is updated but whose StateNodes column is not, then walks the trie at the new state root and throws TrieNodeException for a node that "should" be there. Add a ReaderWriterLockSlim around CreateSnapshot (read) and writeBatch.Dispose (write) so multi-column commits are atomic w.r.t. snapshot creation. Multiple snapshots can still proceed concurrently — the only contention is the rare overlap of a snapshot creation with a multi-column commit. This eliminates the TrieNodeException failure mode in the stress reproducer (`SnapSync_StressRepro`). The "missing in flat" mode is a separate, deterministic bug (the same ~9 addresses miss reliably) that appears unaffected by this fix and needs follow-up investigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LukaszRozmej · 2026-04-30T20:16:57Z

Mode A diagnostic — phase-1 SnapSync trees not disposed when verifier runs

Added targeted instrumentation to BaseFlatPersistence.WriteBatch.SetAccount / DeleteAccountRange, SnapshotableMemDb.Set, and FlatSnapStateTree constructor / Commit / Dispose. Ran the stress reproducer and traced events for the deterministically-missing 9 paths. Concrete finding:

Some FlatSnapStateTree instances start Dispose() but never finish before the verifier runs, so their queued BaseFlatPersistence.WriteBatch.SetAccount writes never reach the underlying SnapshotableMemDb. Concrete example from one failing iteration (sequence numbers from event stream):

[04349] tid=707 FlatSnapStateTree CREATED id=499
[04358] tid=707 FlatSnapStateTree COMMIT_START id=499 upperBound=3E40FAF1
[04377] tid=707 FlatSnapStateTree COMMIT_END id=499
[04378] tid=707 FlatSnapStateTree DISPOSE_START id=499
... no DISPOSE_END for id=499 in the dump ...
[04386] tid=31 FlatSnapStateTree DISPOSE_START id=501
[04387] tid=31 FlatSnapStateTree DISPOSE_END id=501       ← later tree disposes fine

Tree id=499 started disposing but never finished within the iteration's window. The dump fires when VerifyTrie throws, so by that point the writes from tree 499 are queued in its IWriteBatch but never reached the underlying column DB.

In iter 10 of the traced run: 27 trees created, 24 disposed, 3 stuck (id=499 in Dispose, id=502 / 503 mid-COMMIT).

Why `IsSnapGetRangesFinished` doesn't catch this

Expected: _activeAccountRequests is decremented in SnapProvider.AddAccountRange AFTER the inner SnapProviderHelper.AddAccountRange returns (which, via using ISnapTree, only returns once the tree's Dispose completes). So in theory phase 1 can't transition until all in-flight trees have finished disposing.

Observed: phase 1 transitions and SyncUntilFinished returns while tree disposals are still in flight. Either the dispose path has a hot reference held longer than its method scope, or the counter / mode-change semantics admit a window I haven't pinned down. The data is unambiguous though — by the time VerifyTrie runs, the writes that should be in flat are still queued on a non-yet-disposed IWriteBatch.

Tested mitigations

Approach	Result
`Task.Delay(1s)` before `FlushCache`	12-15 fail (similar baseline)
`Task.Delay(10s)` before `FlushCache`	13 fail (similar baseline)

Even 10s isn't enough — some disposals appear to take much longer or are genuinely stuck on something. A delay-based fix isn't sufficient.

Proposed real fix (next step, not in this PR)

Add an in-flight counter to FlatSnapTrieFactory:

Increment on CreateStateTree / CreateStorageTree.
Decrement when the returned tree's Dispose() completes.
Expose WaitForDrained(timeout).
Call WaitForDrained in FlatTreeSyncStore.FinalizeSync before returning.

That would block the sync→post-sync transition until every phase-1 writeBatch is committed, eliminating the race. Production also benefits: a node that restarts mid-snap-sync currently risks losing uncommitted phase-1 writes.

…izeSync Phase-1 ISnapTrees produced by FlatSnapTrieFactory each hold a per-tree IPersistence.IWriteBatch. The writeBatch's writes only become visible to subsequent IPersistence readers after the tree's Dispose() returns. When FlatTreeSyncStore.FinalizeSync ran without waiting for outstanding tree disposals, callers that observed "sync finished" could read state that was missing accounts whose enclosing tree hadn't fully disposed yet. Track in-flight ISnapTrees on FlatSnapTrieFactory (incremented on Create*Tree, decremented in tree Dispose via a back-reference) and have FlatTreeSyncStore.FinalizeSync drain that counter (with a 30s timeout) before flushing. This closes the gap where a snap-sync writeBatch was queued but not yet committed when the verifier ran. Reduces E2ESyncTests.SnapSync_StressRepro Windows flake rate from ~13-18/60 to ~8-15/60. Doesn't fully eliminate Mode A — the remaining failures involve a deterministic set of ~9 paths that appear unaffected by the drain (likely a separate issue in how state-nodes-sync writes flat entries for some specific topology, requiring further work). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LukaszRozmej · 2026-05-01T09:35:48Z

Update — added drain fix for phase-1 in-flight `ISnapTrees`

Pushed 12a66f1e1b: FlatSnapTrieFactory now tracks in-flight ISnapTree instances; FlatTreeSyncStore.FinalizeSync drains them (30s timeout) before flushing. This eliminates the case where phase-1 trees with queued IPersistence.IWriteBatch writes hadn't finished disposing when the verifier ran.

Stress-test results (60 iterations × 2 Flat fixtures, no `[Retry]`)

Configuration	Failures	Failure type
Master baseline	9–19	Mix of Mode A + Mode B
+ cross-column snapshot fix	13–18	All Mode A
+ drain fix (this)	8–15	All Mode A

Modest but real improvement. The drain doesn't fully eliminate Mode A.

What's still unfixed

The remaining Mode A failures hit the same deterministic 9 paths as before — the drain doesn't change the failure signature, it just makes failure less frequent. In iterations where phase-1 (state ranges) doesn't progress at all (logs show Phase 1: ( 0.00 %) then immediately finished), all work happens via phase-2 state-nodes-sync (FlatTreeSyncStore.SaveNode), and the drain catches no trees. These iterations still fail with the same 9 missing accounts.

So the remaining bug is in phase-2 state-nodes-sync's flat-entry path, not phase-1. The 9 deterministic addresses suggest a specific trie topology (something about how those leaves are reached) that causes WriteAccountFlatEntries to miss them, or some other structural issue in FlatTreeSyncStore.SaveNode.

Investigation handed off to whoever picks this up next: the data points to either (a) leaves that are inline in some node type that WriteAccountFlatEntries doesn't handle, or (b) a specific code path in the Branch deletion-range computation that's incorrect for those 9 addresses' positions in the trie.

The fast-path `if (_initialized) return;` lets a concurrent caller skip EnsureDatabaseCleared and proceed to write accounts while the very first caller is still inside persistence.Clear() — iterating GetAllKeys() and queueing Removes for every existing key. When that batch finally disposes, those Removes overwrite the just-written accounts, leaving the flat DB missing entries that the trie still references. The verifier then reports "Account in trie not found in flat" for whichever accounts landed in the window. Stress-running E2ESyncTests.SnapSync on Windows reproduced this in roughly 13–18 of 60 iterations; setting _initialized after Clear() drops failures to 0/300. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LukaszRozmej · 2026-05-01T14:26:52Z

Root cause found and fixed in 73b160f.

The remaining "missing in flat" failures came from a race in FlatSnapTrieFactory.EnsureDatabaseCleared:

if (_initialized) return;
using (_lock.EnterScope())
{
    if (_initialized) return;
    _initialized = true;          // ← published BEFORE Clear() ran
    persistence.Clear();           // ← still iterating GetAllKeys() / queueing Removes
}

Sequence that wipes accounts:

Thread T1 enters EnsureDatabaseCleared, sets _initialized = true, starts persistence.Clear() (which calls db.GetColumnDb(column).GetAllKeys() and queues columnBatch.Remove(key) for every key found in each column).
Thread T2 hits the fast path if (_initialized) return; — it sees true and skips Clear.
T2 calls persistence.CreateWriteBatch(...) and starts writing accounts via SetAccount. These commit.
T1's Clear() is still inside the same using IColumnsWriteBatch ... batch. Its GetAllKeys() either already saw T2's keys, or sees them after the next loop iteration; either way columnBatch.Remove(key) is queued.
T1's batch disposes. Every queued Remove is applied to the live DB — including T2's freshly-written accounts.

How I confirmed it: instrumented every IWriteBatch operation that lands on Column_Account and counted source-side calls (SetAccount / RemoveAccount / DeleteAccountRange) vs sink-side calls on the underlying InMemoryWriteBatch. The numbers across 60 iterations:

setBaseFlat = 1,134,377   setIM = 1,134,377   ✓ all sets accounted
removeBaseFlat = 1,974    DeleteAccountRangePreview = 552
removeIM = 509,598
clearAllColumnsCalls = 60   clearAccountRemoves = 507,072    ← gap closes here

507,072 of the 509,598 sink Removes came from ClearAllColumns — i.e. ~8,451 accounts removed per test, where the DB was supposed to be empty when Clear ran. That ~8,451 matches the test fixture's account count, confirming Clear was racing against sync writes.

Fix: publish _initialized = true after persistence.Clear() returns, and mark the field volatile so the publish has release semantics. Now T2's fast-path read of _initialized = true strictly happens-after T1's Clear, so T2 either skips because Clear is genuinely done, or waits on the lock.

Verified: 5 stress runs × 60 tests = 300 invocations, 0 failures. Previously 8–15 failures per 60-test run on Windows.

This was the last bug. Combined effect of the three commits in this PR:

c14030da4d — eliminated Mode B (TrieNodeException, cross-column snapshot tearing)
12a66f1e1b — drained phase-1 trees before FinalizeSync (still useful; protects against a separate ordering issue)
73b160f057 — eliminated Mode A (this race)

Marking the PR ready for review.

claude · 2026-05-01T14:27:17Z

Claude finished @LukaszRozmej's task in 4m 11s —— View job

Code Review

Gather context from PR description and prior comments
Read changed files
Read repo rules (coding-style, robustness, test-infrastructure)
Analyze each fix for correctness and thread-safety
Check prior review findings — addressed?
Post final review

Summary

All three bug fixes are correct. The cross-column lock fix (SnapshotableMemColumnsDb) and the DCL _initialized fix (FlatSnapTrieFactory) are well-implemented. The drain-counter infrastructure is sound. The root-cause investigation in the PR comments (particularly the _initialized race evidence with ~507K "extra" removes) is excellent.

The two findings from the previous review remain unaddressed. One prior Medium (code duplication) must be fixed before merge.

Findings

Medium — Code duplication still present (prior finding, unaddressed)

SnapSync_StressRepro (lines 447–462) is still a verbatim copy of SnapSync() (lines 412–428). This violates both coding-style.md ("extract repeated blocks (5+ lines) into shared methods") and test-infrastructure.md ("Do not duplicate test methods"). Extract to a shared helper: Fix this →

Low — Verbose comment block (prior finding, unaddressed)

The 7-line comment at lines 431–437 still violates the coding-style one-line-max rule. The run IDs and invocation instructions already live in the PR description. The [Explicit] attribute message covers the essential "why" for IDE readers. Trim to one line, e.g.:

    // Stress reproducer for SnapSync Windows flake — run manually; see PR #11443 for context.

Low — `WaitForInFlightTreesDrained` return value silently discarded

FlatTreeSyncStore.FinalizeSync line 246 discards the bool return from WaitForInFlightTreesDrained. On timeout, the warning is logged inside the callee, but FinalizeSync proceeds to persistence.Flush() — potentially flushing a flat state that is missing writes from trees still in flight. For the 30-second timeout with only SpinWait between checks this is unlikely in practice, but the discard is a silent contract violation. At minimum, log or throw at the call-site when the drain times out, so it's visible at the finalization layer:

if (!snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30)))
    if (_logger.IsWarn) _logger.Warn("FinalizeSync: drain timed out — flat state may be incomplete");

Fix this →

Positive notes

_initialized DCL fix is correct. Moving _initialized = true after persistence.Clear() inside the lock, combined with volatile, gives proper release/acquire ordering. The old code set _initialized = true before Clear() completed, allowing a concurrent thread to race in and have its writes wiped by the still-running Clear() batch.
SnapshotableMemColumnsDb lock choice is correct. Using System.Threading.Lock instead of ReaderWriterLockSlim (as the PR description originally described) is simpler and equally correct — this is a test backend, so the serialization between concurrent snapshots is acceptable. The using Lock.Scope pattern in both CreateSnapshot and AtomicColumnsWriteBatch.Dispose is idiomatic for net10.0.
_inFlightTrees counter is correctly implemented. Interlocked.Increment/Decrement paired with SpinWait in WaitForInFlightTreesDrained is a standard and correct pattern. The finally block in FlatSnapStateTree.Dispose / FlatSnapStorageTree.Dispose ensures OnTreeDisposed() is called even on exception.
[Explicit] on the stress reproducer correctly prevents accidental CI execution.
No [Retry] on the stress reproducer is the right call — every occurrence should surface.

claude · 2026-05-01T14:31:28Z

+        });
+
+        await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationTokenSource.Token);
+    }


Still a verbatim copy of SnapSync() (lines 412–428). Per coding-style.md ("extract repeated blocks (5+ lines)") and test-infrastructure.md ("Do not duplicate test methods"), extract the shared body into a private helper and have both methods call it:

private async Task RunSnapSyncOnce(CancellationToken cancellationToken) { PrivateKey clientKey = TestItem.PrivateKeyD; await using IContainer client = await CreateNode(clientKey, async (cfg, spec) => { SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>(); syncConfig.FastSync = true; syncConfig.SnapSync = true; await SetPivot(syncConfig, cancellationToken); INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>(); networkConfig.P2PPort = AllocatePort(); networkConfig.FilterPeersByRecentIp = false; networkConfig.FilterDiscoveryNodesByRecentIp = false; }); await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationToken); }

Then both tests become:

await RunSnapSyncOnce(cancellationTokenSource.Token);

claude · 2026-05-01T14:31:30Z

+    // so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every
+    // failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with:
+    //   dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro"
+    // Runs only on Flat dbMode (where the flake has been observed).


7-line comment block violates the one-line-max rule in coding-style.md. The run IDs, invocation instructions, and rationale live in the PR description and in the [Explicit] message. Trim to a single line:

Suggested change

// Runs only on Flat dbMode (where the flake has been observed).

// Stress reproducer for SnapSync Windows flake — run manually; see PR #11443 for context.

claude · 2026-05-01T14:31:33Z

+        // aren't yet committed to the underlying persistence. Without this drain a caller that
+        // observes "sync finished" can read state that's missing accounts from late-disposed
+        // trees (manifests as E2ESyncTests.SnapSync "Verification failed: N missing in flat").
+        snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30));


The bool return is silently discarded. On a 30-second drain timeout the warning is logged inside WaitForInFlightTreesDrained, but FinalizeSync proceeds to persistence.Flush() without any signal at this layer. Add a site-local log so finalization visibility is complete:

Suggested change

snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30));

if (!snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30)))

if (_logger.IsWarn) _logger.Warn("FinalizeSync: in-flight ISnapTree drain timed out — flat state may be incomplete");

LukaszRozmej · 2026-05-01T22:22:29Z

@claude re-review

claude · 2026-05-01T22:22:49Z

Claude finished @LukaszRozmej's task in 3m 9s —— View job

Code Review (re-review)

Read changed files
Read repo rules (coding-style, robustness, test-infrastructure)
Check prior findings — addressed?
Analyze new commits for correctness
Post inline comments
Post final summary

Summary

All three production bug fixes are correct and well-implemented. The cross-column lock fix, the DCL _initialized reordering, and the drain counter are each sound. The root-cause analysis and verification methodology are excellent.

Three prior findings remain unaddressed (one Medium, two Low), and one new Low was identified. The Medium blocks merge.

Findings

Medium — Code duplication (3rd occurrence, unaddressed)

SnapSync_StressRepro (lines 447–462) is a verbatim copy of SnapSync() (lines 412–428). coding-style.md: "extract repeated blocks (5+ lines) into shared methods." This has been flagged in both previous reviews with a concrete helper signature. See inline →

Low — Verbose comment block (3rd occurrence, unaddressed)

The 7-line comment at lines 431–437 still violates the one-line-max rule. The [Explicit] attribute message on line 438 already carries the essential intent for IDE readers; run IDs and invocation instructions belong in the PR description. See inline →

Low — `WaitForInFlightTreesDrained` return silently discarded (2nd occurrence, unaddressed)

FlatTreeSyncStore.FinalizeSync:246 discards the bool result. On timeout, the callee logs a warning but FinalizeSync continues to persistence.Flush() — finalization-layer visibility is missing. See inline →

Low — `OnTreeCreated()` called before tree constructor completes (new)

In FlatSnapTrieFactory.CreateStateTree / CreateStorageTree, OnTreeCreated() is called at line 66/76 before new FlatSnapStateTree/FlatSnapStorageTree(...) finishes. If the constructor throws, OnTreeDisposed() is never called, permanently leaking the in-flight counter — every subsequent FinalizeSync would spin for 30 seconds. Moving OnTreeCreated() into each tree's constructor (paired with the existing finally-guarded OnTreeDisposed() in Dispose) eliminates the leak. See inline →

Positive notes

_initialized DCL is correct. Moving _initialized = true after persistence.Clear(), combined with volatile, gives proper release/acquire ordering — a thread reading _initialized == true outside the lock is guaranteed to see all effects of Clear().
SnapshotableMemColumnsDb lock is correct. using Lock.Scope in both CreateSnapshot and AtomicColumnsWriteBatch.Dispose makes multi-column commits atomic w.r.t. snapshot creation without over-complicating the API.
_inFlightTrees counter is correctly implemented. Interlocked.Increment/Decrement paired with SpinWait is a standard and correct pattern. The finally blocks in FlatSnapStateTree.Dispose / FlatSnapStorageTree.Dispose ensure OnTreeDisposed() is called even on exception.
[Explicit] with message on the stress reproducer correctly prevents accidental CI execution, and the attribute message now carries the essential "why" for IDE readers.
No [Retry] on the stress reproducer is the right call — every failure should surface.

claude · 2026-05-01T22:25:11Z

+        });
+
+        await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationTokenSource.Token);
+    }


Medium (third occurrence — unaddressed): Lines 447–462 are still a verbatim copy of SnapSync() lines 412–428. coding-style.md requires extracting repeated blocks of 5+ lines into shared methods. Extract to a private helper:

private async Task RunSnapSyncOnce(CancellationToken cancellationToken) { PrivateKey clientKey = TestItem.PrivateKeyD; await using IContainer client = await CreateNode(clientKey, async (cfg, spec) => { SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>(); syncConfig.FastSync = true; syncConfig.SnapSync = true; await SetPivot(syncConfig, cancellationToken); INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>(); networkConfig.P2PPort = AllocatePort(); networkConfig.FilterPeersByRecentIp = false; networkConfig.FilterDiscoveryNodesByRecentIp = false; }); await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationToken); }

Both SnapSync() and SnapSync_StressRepro() then become a one-liner call to this helper.

claude · 2026-05-01T22:25:18Z

+    // Stress reproducer for the Windows-only SnapSync flake observed in CI:
+    //   - run #25139586200 / job #73686059011 (PR #11422)
+    // Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case
+    // so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every
+    // failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with:
+    //   dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro"
+    // Runs only on Flat dbMode (where the flake has been observed).


Low (third occurrence — unaddressed): This 7-line block still violates coding-style.md ("Code comments explain why, not what"). The [Explicit] message (line 438) already covers the essential intent for IDE readers. Run IDs and invocation instructions belong in the PR description (where they already are). Trim to one line:

Suggested change

// Stress reproducer for the Windows-only SnapSync flake observed in CI:

// - run #25139586200 / job #73686059011 (PR #11422)

// Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case

// so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every

// failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with:

// dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro"

// Runs only on Flat dbMode (where the flake has been observed).

// Stress reproducer for the SnapSync Windows flake — run manually; see PR #11443 for context.

claude · 2026-05-01T22:25:26Z

+        // aren't yet committed to the underlying persistence. Without this drain a caller that
+        // observes "sync finished" can read state that's missing accounts from late-disposed
+        // trees (manifests as E2ESyncTests.SnapSync "Verification failed: N missing in flat").
+        snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30));


Low (second occurrence — unaddressed): The bool return from WaitForInFlightTreesDrained is silently discarded. The timeout warning is logged inside the callee, but FinalizeSync proceeds to persistence.Flush() with no signal at this layer. Add a site-local log so finalization-level visibility is complete:

Suggested change

snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30));

if (!snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30)))

if (_logger.IsWarn) _logger.Warn("FinalizeSync: in-flight ISnapTree drain timed out — flat state may be incomplete");

claude · 2026-05-01T22:25:34Z


        IPersistence.IPersistenceReader reader = persistence.CreateReader(ReaderFlags.Sync);
        IPersistence.IWriteBatch writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL);
-        return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager);
+        OnTreeCreated();
+        return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this);


Low (new): OnTreeCreated() is called at line 66 before new FlatSnapStateTree(...) finishes constructing. If the constructor throws (e.g. new StateTree(...) fails), OnTreeDisposed() is never called — the in-flight counter is permanently incremented, causing WaitForInFlightTreesDrained to spin until its 30-second timeout on every subsequent FinalizeSync call.

The fix: call OnTreeCreated() inside the FlatSnapStateTree constructor (where _ownerFactory is already available), and have the finally block in Dispose pair with it. This way the increment is always matched by the decrement, even if the caller-side code throws after creation.

Same issue exists in CreateStorageTree / FlatSnapStorageTree.

…rden tree-counter on throw - Extract `RunSnapSyncOnce(CancellationToken)` and call from both `SnapSync()` and `SnapSync_StressRepro(int)` (was a verbatim copy of the test body). - Trim the stress-repro comment block to a single line (coding-style: one-line max). - Update `FlatTreeSyncStoreTests` constructor call for the new `FlatSnapTrieFactory` parameter on `FlatTreeSyncStore` (build was failing). - Drop unused `using System;` / `using System.Threading;` (ImplicitUsings is on for this project; lint flagged them). - Log a site-local warning in `FinalizeSync` when `WaitForInFlightTreesDrained` returns false, instead of silently discarding the bool — finalization-level visibility was missing. - In `FlatSnapTrieFactory.CreateStateTree` / `CreateStorageTree`, pair every `OnTreeCreated()` with cleanup if the subsequent constructor throws; otherwise the in-flight counter leaks and `WaitForInFlightTreesDrained` would spin to its 30-second timeout on every `FinalizeSync`. Also dispose the per-tree reader and write batch on the throw path so they don't leak either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LukaszRozmej · 2026-05-01T22:40:30Z

@claude re-review

claude · 2026-05-01T22:40:53Z

Claude finished @LukaszRozmej's task in 2m 30s —— View job

Code Review (re-review)

Read changed files and latest commit (41e858d)
Read repo rules (coding-style, robustness, test-infrastructure)
Check prior findings — all addressed
Analyze new commits for correctness and new issues
Post final summary

Summary

All four prior findings are resolved in 41e858d. No new findings. The PR is clean.

Prior findings — all resolved ✅

#	Severity	Finding	Status
1	Medium	Code duplication — `SnapSync_StressRepro` was a verbatim copy of `SnapSync`	✅ Extracted `RunSnapSyncOnce(CancellationToken)` — both tests now call the helper
2	Low	Verbose 7-line comment block violated one-line-max rule	✅ Trimmed to a single line at `E2ESyncTests.cs:414`
3	Low	`WaitForInFlightTreesDrained` return value silently discarded in `FinalizeSync`	✅ `FinalizeSync` now logs a site-level warning when the drain times out (`FlatTreeSyncStore.cs:247–248`)
4	Low	`OnTreeCreated()` called before constructor completes — counter leak on constructor throw	✅ Both `CreateStateTree` and `CreateStorageTree` now wrap the constructor in `try/catch`; on throw, `OnTreeDisposed()` is called and the per-tree `reader`/`writeBatch` are disposed before rethrowing

Positive notes

_initialized DCL is correct. Moving _initialized = true after persistence.Clear(), combined with volatile, gives proper release/acquire ordering — a concurrent thread reading _initialized == true outside the lock is guaranteed to see all effects of Clear().
SnapshotableMemColumnsDb lock is correct. using Lock.Scope in both CreateSnapshot and AtomicColumnsWriteBatch.Dispose makes multi-column commits atomic w.r.t. snapshot creation without over-complicating the API.
_inFlightTrees counter is robustly implemented. Interlocked.Increment/Decrement with Interlocked.Read in the spin-wait is the correct pattern. The finally blocks in FlatSnapStateTree.Dispose / FlatSnapStorageTree.Dispose ensure OnTreeDisposed() is called even on exception; the new constructor-failure path in FlatSnapTrieFactory ensures OnTreeCreated() is always paired.
[Explicit] with message correctly prevents accidental CI execution.
No [Retry] on the stress reproducer is the right call.
Build fix in FlatTreeSyncStoreTests is minimal and correct — adds FlatSnapTrieFactory to the FlatTreeSyncStore constructor call.

LukaszRozmej · 2026-05-01T23:05:22Z

@claude will this cause performance regression?

claude · 2026-05-01T23:05:41Z

Claude finished @LukaszRozmej's task in 1m 0s —— View job

Performance regression analysis

Gather context
Read changed files
Analyze each fix for performance impact
Post findings

Short answer: No meaningful regression on any production path.

Each fix touches a different scope; here's the breakdown:

Fix 1 — `SnapshotableMemColumnsDb` lock (`c14030d`)

SnapshotableMemColumnsDb is the in-memory test backend only. Production nodes use RocksDB, which already has native atomic cross-CF snapshots and is not touched by this change. The new Lock _atomicityLock serializes CreateSnapshot against AtomicColumnsWriteBatch.Dispose, but since this class is instantiated exclusively in tests, production block processing sees zero overhead.

Fix 2 — `_inFlightTrees` `Interlocked` counter (`12a66f1`)

Two new Interlocked operations:

Interlocked.Increment in OnTreeCreated() — one call per snap range batch (tree creation)
Interlocked.Decrement in OnTreeDisposed() — one call per tree disposal

During snap sync, trees are created at the rate of one per AddAccountRange/AddStorageRange call — potentially hundreds of thousands across a full sync, but spread over minutes. An Interlocked.Increment/Decrement is ~5–10 ns on x64; even at 1 M trees, that's ~10 ms total added to a sync that takes hours. Completely negligible.

WaitForInFlightTreesDrained is called exactly once in FinalizeSync. In the normal case (all trees disposed before finalization is triggered) it exits in a single iteration — no spin, no sleep. Even in the pathological case it just waits for real work to finish, which is required for correctness. The 30-second timeout exists for fault detection, not as expected run time.

Fix 3 — `volatile bool _initialized` reorder (`73b160f`)

The field is volatile, which prevents the compiler from caching the read in a register. In the fast path (after the one-time Clear() completes), every EnsureDatabaseCleared call is a single volatile load + branch-not-taken. A volatile read costs ~1 ns versus ~0.3 ns for a non-volatile read — a difference far below any measurement threshold at the call rate of "once per tree creation."

Summary table

Fix	Affected scope	Overhead per tree	Total overhead
Cross-column lock	Test backend only	0 in production	None
`Interlocked` counter	Snap sync tree lifetime	~10 ns	~10 ms over full sync
`volatile` read	Snap sync tree creation	~0.7 ns delta	Sub-millisecond over full sync

None of the changes touch the normal block-processing path (RocksDB, state reads/writes, EVM execution). The overhead is confined to snap sync tree creation/disposal, which is a one-time background operation, not a latency-sensitive path.

LukaszRozmej · 2026-05-02T08:59:16Z

@claude does this also fix some potential edge case production bugs?

claude · 2026-05-02T08:59:29Z

Claude finished @LukaszRozmej's task in 2m 22s —— View job

Production bug analysis

Gather context
Read production-code changes
Assess production impact of each fix
Post findings

Yes — two of the three fixes address real production bugs, not just test artifacts. Here's the breakdown:

Fix 1 — `SnapshotableMemColumnsDb` cross-column atomicity (`c14030d`) — test-backend only

SnapshotableMemColumnsDb is only instantiated in tests and in PseudoNethermindModule (the test DI module). Production nodes use RocksDbPersistence via FlatWorldStateModule, which wires RocksDbPersistence or FlatInTriePersistence — both backed by RocksDB, which already has atomic cross-column-family snapshots natively. No production node is affected by this fix.

Fix 2 — Drain in-flight `ISnapTrees` in `FinalizeSync` (`12a66f1`) — real production bug

FlatTreeSyncStore.FinalizeSync is production code called by the synchronizer when snap sync completes. Each ISnapTree holds a live IPersistence.IWriteBatch; the writes only reach the underlying persistence when Dispose() is called on the tree. Before this fix, FinalizeSync could return — and the state ID could be bumped and persistence.Flush() called — while trees from recently-processed batches were still mid-Dispose().

Production consequence: A production FlatDB node doing snap sync could declare "sync complete" while some account and storage trie node writes were still queued in non-yet-disposed IWriteBatch instances. Block processing that immediately follows snap sync would then operate on incomplete flat state — account or storage entries missing from the flat DB that the trie still referenced. This could manifest as TrieNodeException, unexpected empty accounts, or incorrect storage reads during the first blocks replayed post-sync.

The 30-second spin-wait is a safety net; in normal operation the drain completes immediately because all trees have already been disposed before FinalizeSync is reached. The overhead is negligible.

Fix 3 — `_initialized` DCL race in `FlatSnapTrieFactory` (`73b160f`) — real production bug, the more serious one

FlatSnapTrieFactory is registered as a singleton (AddSingleton<ISnapTrieFactory, FlatSnapTrieFactory>() in FlatWorldStateModule). SnapSyncFeed is IsMultiFeed = true, meaning HandleResponse is invoked concurrently from multiple peer-handling threads. Each HandleResponse → AddAccountRange / AddStorageRange → SnapProviderHelper → factory.CreateStateTree() / factory.CreateStorageTree() → EnsureDatabaseCleared().

Before this fix, the double-checked lock had this ordering:

_initialized = true;      // published BEFORE Clear() ran
persistence.Clear();       // still queueing Removes for every key it finds

In production, two threads can both be processing their first snap sync responses concurrently. Thread T1 sets _initialized = true and starts Clear(); Thread T2 reads _initialized == true at the fast path, bypasses the lock, and starts writing accounts via SetAccount. T1's Clear() batch is still open and is iterating GetAllKeys() — it sees T2's freshly-committed keys and queues Remove for them. When T1's batch disposes, those accounts are erased from the live DB.

Production consequence: Accounts written during the race window are permanently erased from flat state. The trie still references them, so any attempt to execute a transaction or call that touches those accounts after snap sync will fail (missing flat entry). On Linux this race is less likely due to scheduler behavior (matching the CI test pass rate), but it is not impossible — it is a correctness bug on all platforms under the right timing.

The fix is minimal and correct: _initialized = true now publishes after persistence.Clear() returns inside the lock, and the volatile qualifier gives the release semantics needed for the fast-path read outside the lock to be safe.

Summary

Fix	Affects production?	Impact if untriggered
Fix 1 — cross-column lock	No — test backend only	N/A
Fix 2 — drain in-flight trees	Yes — FlatDB snap sync finalization	Flat state incomplete after sync; block replay errors
Fix 3 — `_initialized` DCL	Yes — FlatDB snap sync first-batch race	Accounts silently erased from flat; transactions fail post-sync

Both Fix 2 and Fix 3 are latent production correctness bugs for any FlatDB node that performs snap sync. Fix 3 is harder to trigger on Linux (scheduler dependent) but carries a more severe consequence (silent data loss). Fix 2 is a simpler ordering gap that the drain closes cleanly. Both are worth having regardless of the test flake.

github-actions · 2026-05-02T10:04:02Z

EXPB Benchmark Comparison

Run: View workflow run

superblocks

Scenario: nethermind-flat-superblocks-fix-e2e-snapsync-windows-flake-repro-delay0s

Client Processing (SSE)

Metric	PR	Master (cached)	Delta
AVG (ms)	2493.02	1321.84	+88.60%
MEDIAN (ms)	2281.8	1212.7	+88.16%
P90 (ms)	3352.8	1818.5	+84.37%
P95 (ms)	3984.1	2108.3	+88.97%
P99 (ms)	8690.4	6212.5	+39.89%
MIN (ms)	1674.5	830.4	+101.65%
MAX (ms)	8690.4	6212.5	+39.89%

K6 TTFB

Metric	PR	Master (cached)	Delta
AVG (ms)	2709.87	1560.83	+73.62%
MEDIAN (ms)	2434.24	1414.38	+72.11%
P90 (ms)	3519.42	2037.87	+72.70%
P95 (ms)	4056.16	2552.08	+58.94%
P99 (ms)	8144.33	3587.07	+127.05%
MIN (ms)	1806.95	954.79	+89.25%
MAX (ms)	9149.99	6866.15	+33.26%

realblocks

Scenario: nethermind-flat-realblocks-fix-e2e-snapsync-windows-flake-repro-delay0s

Client Processing (SSE)

Metric	PR	Master (cached)	Delta
AVG (ms)	57.62	36.90	+56.15%
MEDIAN (ms)	52.3	30.7	+70.36%
P90 (ms)	92.6	65.9	+40.52%
P95 (ms)	108.1	79.1	+36.66%
P99 (ms)	181.0	134.5	+34.57%
MIN (ms)	0.5	0.5	+0.00%
MAX (ms)	320.1	240.0	+33.38%

K6 TTFB

Metric	PR	Master (cached)	Delta
AVG (ms)	61.82	42.79	+44.47%
MEDIAN (ms)	56.32	36.12	+55.92%
P90 (ms)	97.61	75.05	+30.06%
P95 (ms)	114.06	86.84	+31.35%
P99 (ms)	179.22	136.32	+31.47%
MIN (ms)	2.21	1.16	+90.52%
MAX (ms)	326.77	245.07	+33.34%

alexb5dh · 2026-05-02T22:31:48Z

+    /// Spin-wait until all trees handed out by this factory have been disposed. Returns true
+    /// if drained within <paramref name="timeout"/>, false on timeout.
+    /// </summary>
+    public bool WaitForInFlightTreesDrained(System.TimeSpan timeout)


Suggested change

public bool WaitForInFlightTreesDrained(System.TimeSpan timeout)

public bool WaitForInFlightTreesDrained(TimeSpan timeout)

alexb5dh · 2026-05-02T22:51:29Z

+        IPersistence.IWriteBatch writeBatch;
+        try
+        {
+            writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL);
+        }
+        catch
+        {
+            reader.Dispose();
+            throw;
+        }
+        OnTreeCreated();
+        try
+        {
+            return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this);
+        }
+        catch
+        {
+            // Constructor failed: pair the OnTreeCreated above with OnTreeDisposed and release
+            // the per-tree resources ourselves, since no FlatSnapStateTree.Dispose will run.
+            OnTreeDisposed();
+            writeBatch.Dispose();
+            reader.Dispose();
+            throw;
+        }


Alternative with single try/catch?

Suggested change

IPersistence.IWriteBatch writeBatch;

try

{

writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL);

}

catch

{

reader.Dispose();

throw;

}

OnTreeCreated();

try

{

return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this);

}

catch

{

// Constructor failed: pair the OnTreeCreated above with OnTreeDisposed and release

// the per-tree resources ourselves, since no FlatSnapStateTree.Dispose will run.

OnTreeDisposed();

writeBatch.Dispose();

reader.Dispose();

throw;

}

IPersistence.IPersistenceReader? reader = null;

IPersistence.IWriteBatch? writeBatch = null;

bool treeTracked = false;

try

{

reader = persistence.CreateReader(ReaderFlags.Sync);

writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL);

OnTreeCreated();

treeTracked = true;

return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this);

}

catch

{

if (treeTracked) OnTreeDisposed();

writeBatch?.Dispose();

reader?.Dispose();

throw;

}

alexb5dh · 2026-05-02T22:55:28Z

@claude what execution paths / operations waiting-for-up-to-30-seconds in FinalizeSync/WaitForInFlightTreesDrained can affect? Risks of it noticeable affecting performance?

claude · 2026-05-02T22:55:45Z

Claude finished @alexb5dh's task in 1m 45s —— View job

Performance impact of `WaitForInFlightTreesDrained`

Short answer: no meaningful performance impact in the normal case; at worst a short spin at end-of-snap-sync on a single thread.

What calls `FinalizeSync` and when

FinalizeSync is called from TreeSync.SaveNode (TreeSync.cs:726) exactly once, when the root node of the state trie is saved — i.e., the very last step of snap sync phase 2 (state-nodes download). At this point:

The node is still in sync mode — block processing, EVM, and JSON-RPC are not yet running.
Snap phase 1 (account/storage ranges) is already complete, because IsSnapGetRangesFinished must be true before phase 2 proceeds to root-node finalization.
Crucially, FinalizeSync is called outside the _stateDbLock.EnterWriteLock() scope — the lock is acquired and released inside the NodeDataType.State branch of the switch, and the if (syncItem.IsRoot) check is after the switch statement. So WaitForInFlightTreesDrained does not hold any lock while it spins.

What is blocked during the spin

Only the one thread that processed the root node response is blocked in the spin. Concurrently:

Other HandleResponse calls on other peer-handling threads continue normally (they use a separate _syncStateLock.EnterReadLock() at line 151, which is not held by the spinning thread).
_rootSaved stays 0 until FinalizeSync returns, so ValidatePrepareRequest won't flip the feed to dormant yet — but with the root already saved, there are no meaningful new requests pending anyway.
_codeDb.Flush() and Interlocked.Exchange(ref _rootSaved, 1) are deferred until FinalizeSync returns.

Normal-case duration of the wait

In the normal flow, by the time SaveNode processes the root, all phase-1 ISnapTree instances have already been disposed (their Dispose was called synchronously at the end of each SnapProviderHelper.AddAccountRange call). So _inFlightTrees is already 0 when WaitForInFlightTreesDrained is called, and the while-loop exits on the very first iteration — effectively free.

The 30-second ceiling only matters if a tree Dispose is genuinely stuck (e.g. waiting on I/O, holding a blocked write batch). In that case, the spin degrades gracefully: SpinWait.SpinOnce() transitions from CPU-level spins (~10 iterations) to Thread.Yield() and then Thread.Sleep(1) — ~1ms ticks, negligible CPU, the thread is essentially parked. If the full 30s elapsed, it would burn ~30,000 wake-ups of 1ms each on one thread-pool thread. At end-of-sync, with no new peer responses coming in, that thread is otherwise idle, so there's no starvation.

What is NOT affected

Block processing, EVM execution, state reads — none of this starts until after sync transitions out of SyncMode.StateNodes, which requires _rootSaved == 1, which requires FinalizeSync to return. The wait is entirely within the sync window, not after it.
SnapshotableMemColumnsDb changes are test-backend only.
The volatile bool _initialized DCL and the Interlocked counter ops are sub-nanosecond on any sync-frequency call.

Regarding the benchmark comment

The ~88% superblocks regression in the benchmark above is almost certainly not caused by this PR. None of the changed code touches the block-processing path: FlatSnapTrieFactory, FlatSnapStateTree, FlatTreeSyncStore, and SnapshotableMemColumnsDb are all snap-sync-only components. The benchmark comparison note says "Master (cached)" — a cached master baseline from a different build/environment pass, which commonly introduces large variance on the block-processing benchmark. If the concern is real, re-running with a fresh master baseline on the same runner would confirm.

asdacap

Fix issue 2 and 3 on top of #11102. Just call EnsureInitialize at the start of ISnapSyncRunner.Run and call FinalyzeSync at end of ISnapSyncRunner.Run. No lock or _initialized flag needed.

- Trim verbose comment on `_atomicityLock` in `SnapshotableMemColumnsDb` to a single line. - Use plain `TimeSpan` (not `System.TimeSpan`) in `WaitForInFlightTreesDrained`; re-add `using System;`. - Drop the duplicate warning at the `FinalizeSync` call site — `WaitForInFlightTreesDrained` already logs the timeout from inside; remove the now-unused `_logger` field. - Refactor the `OnTreeCreated` / cleanup-on-throw blocks in `CreateStateTree` / `CreateStorageTree` to a single try/catch with `treeTracked` flag and nullable refs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LukaszRozmej · 2026-05-03T05:54:36Z

@alexb5dh feedback addressed in 91db203:

Trimmed the cross-column-atomicity comment to a single line.
WaitForInFlightTreesDrained(TimeSpan timeout) — re-added using System; and dropped the System. qualifier.
Removed the duplicate warning at the FinalizeSync call site (the timeout warning is logged inside WaitForInFlightTreesDrained already); also removed the now-unused _logger field.
Refactored CreateStateTree / CreateStorageTree to your single-try/catch shape with treeTracked flag and nullable refs.

@asdacap re your CHANGES_REQUESTED — opened #11472 stacked on amirul/simple-sync-dispatcher with the runner-driven shape (EnsureInitialize at start, FinalizeSync at end of SnapSyncRunner.Run, no lock or _initialized flag). Stress repro: 0/180 on Windows. Once #11102 lands I can either close this PR (since #11472 supersedes the runtime fixes) or rebase this PR on top to keep just the cross-column fix and the test reproducer — whichever is easier on your end.

@asdacap

…Runner.Run (#11472) * test(snap-sync): add Windows-flake stress reproducer Adds a 30-iteration stress reproducer for the Windows-only E2ESyncTests.SnapSync flake observed in CI (run #25139586200, job #73686059011 on PR #11422). Each iteration is its own NUnit case so a single failure does not terminate the rest, no [Retry] so every flake surfaces, and [Explicit] so it does not run in normal CI. Scoped to DbMode.Flat where the flake has been observed. Local reproduction on Windows (~1m runtime, 60 iterations across the two Flat fixtures): 9-19 failures with both observed CI patterns — "Verification failed: ... N missing in flat" and "TrieNodeException: Failed to load key X from root hash Y" during post-snap-sync block processing. Run: dotnet test --project Nethermind.Synchronization.Test \ -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run Replaces the lazy-init lock + `_initialized` flag in `FlatSnapTrieFactory` (which raced against concurrent `CreateStateTree`/`CreateStorageTree` calls and caused `Clear()` to wipe freshly-written accounts) with explicit sequential lifecycle hooks driven by `SnapSyncRunner.Run`: - `ISnapTrieFactory.EnsureInitialize()` (default no-op) — called once before any tree is created. `FlatSnapTrieFactory` clears the flat DB here. - `ISnapTrieFactory.FinalizeSync()` (default no-op) — called once after the dispatcher has finished. `FlatSnapTrieFactory` flushes the persistence here. - `SnapSyncRunner.Run` brackets `dispatcher.Run` with both calls. Because both halves run from a single sequential entry point — and the new `SimpleDispatcher.Run` only returns once every batch handler (and therefore every `ISnapTree.Dispose`) has completed — the implementation needs no internal locking and the `_initialized` flag goes away. This is the in-tree equivalent of the master-branch fix in #11443; the runner-driven shape suggested by @asdacap is cleaner because it doesn't depend on the factory guessing when "the first call" happens. `PatriciaSnapTrieFactory` and the test factories pick up the empty default implementations of both new methods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): extract RunSnapSyncOnce shared helper `SnapSync` and `SnapSync_StressRepro` had verbatim-copied bodies; extract a private `RunSnapSyncOnce(CancellationToken)` helper and have both call it. Trim the multi-line stress-repro comment to a single line per coding-style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): SnapSyncRunner lifecycle ordering test + cancel/throw cases Adds an internal-visible Func<CancellationToken, Task> ctor on SnapSyncRunner so the lifecycle can be exercised without standing up a real SimpleDispatcher, and three NSubstitute-based tests: - happy path: EnsureInitialize → dispatcher → FinalizeSync - dispatcher throws: FinalizeSync still runs - dispatcher cancelled: FinalizeSync still runs Trim the doc comments on the new ISnapTrieFactory hooks to a single line per the repo's coding-style rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(snap-sync): make SnapSyncRunner Func ctor public The Func<CancellationToken, Task> overload was internal so the test could reach it via InternalsVisibleTo. Make it public — it's a legitimate way to construct the runner, the surface is small, and DI still resolves the SimpleDispatcher overload because Func<CancellationToken, Task> isn't a registered service. Promotes the public ctor pattern to use a primary constructor (matches the rest of the codebase) and drops the assembly-level InternalsVisibleTo that was added solely for this class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): collapse SnapSyncRunner ordering tests into a single TestCase Three near-identical tests (happy / throws / cancels) folded into one parameterized method with a DispatcherOutcome enum. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): switch expression for SnapSyncRunner dispatcher outcome Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LukaszRozmej · 2026-05-04T08:21:06Z

Closing in favor of #11472 and #11102.

@asdacap

* refactor: replace snap+state sync dispatcher with SimpleDispatcher - Introduce SimpleDispatcher that runs feeds sequentially (snap → state) replacing the old parallel SyncDispatcher<T> for snap+state sync. - Merge SyncMode.SnapSync into SyncMode.StateNodes; sequencing is now handled by StateSyncRunner rather than the mode selector. - Move isCloseToHead gate and DB tuning from mode selector / SyncDbTuner into StateSyncRunner. DB tune reset wrapped in try/finally so it runs on cancellation. - StateSyncRunner early-returns when FindBestFullState != 0 (state already present); Run wraps RunRound (reset pivot, run feed, verify trie) plus the production-only waits. - StateSyncFeed / SnapSyncFeed now implement ISimpleSyncFeed; return null signals round finished, dispatcher drains in-flight requests with CancellationToken.None so peer allocations are always freed. - StateSyncFeed/Downloader/AllocationStrategyFactory registered via DI and resolved by StateSyncRunner instead of being new'd per round. - TreeSync.ValidatePrepareRequest collapsed to bool IsSyncRoundFinished; the SyncMode parameter and _syncMode field are gone. - Remove IsSnapGetRangesFinished from ISyncProgressResolver. - Test harness SafeContext.RunFeed now calls IStateSyncRunner.RunRound; feed-level tests resolve StateSyncFeed / StateSyncDownloader via DI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: yield state sync between rounds to MultiSyncModeSelector Restore the old "pause when mode leaves StateNodes" behaviour at the round boundary: StateSyncRunner.RunRound now awaits syncModeSelector.WaitUntilMode(StateNodes) at the top of each iteration, so beacon control / UpdatingPivot / fast-sync re-entry can claim peers between rounds. Pivot changes mid-round are still handled by the feed returning null via IsSyncRoundFinished. Test infra substitutes ISyncModeSelector with one that reports StateNodes immediately — the tests drive RunRound directly and shouldn't depend on the real MultiSyncModeSelector settling. Mid-round yielding (within a single RunFeed call) is not addressed here; the feed's PrepareRequest doesn't consult the mode selector to avoid complicating the test harness. Cost is bounded: pivot moves still end the round naturally, and all currently-observed mode transitions that matter (beacon reorg, pivot updates) also move the pivot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: make SimpleDispatcher<T> generic, bind feed/downloader at construction Push T to the class level and move feed, downloader, allocation strategy factory, and AllocationContexts to the constructor — each dispatcher is now bound to the single feed it drives, and its entry point is Run(CancellationToken) rather than a generic RunFeed<T>(...). Register SimpleDispatcher<StateSyncBatch?> and SimpleDispatcher<SnapSyncBatch?> as singletons via factory delegates in the SynchronizerModule (feed + downloader + strategy factory are also DI-resolved). State/SnapSyncRunner now inject the dispatcher instead of constructing it per call. Also fix the stale class summary (no more WaitForMode) and the stale cref on ISimpleSyncFeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: register and resolve sync feed deps via interfaces Register StateSyncFeed / SnapSyncFeed as ISimpleSyncFeed<T>, downloaders as ISyncDownloader<T>, strategy factories as IPeerAllocationStrategyFactory<T>, and resolve the same interfaces from the SimpleDispatcher factory. Tests that directly touched the feed likewise resolve ISimpleSyncFeed<StateSyncBatch> and ISyncDownloader<StateSyncBatch> instead of the concrete types. Normalize the generic parameter to the non-nullable batch type (StateSyncBatch / SnapSyncBatch) and add `where T : class` to ISimpleSyncFeed<T> / SimpleDispatcher<T> so the interface and implementation generics line up (previously ISimpleSyncFeed<StateSyncBatch?> mismatched ISyncDownloader<StateSyncBatch>, forcing concrete resolution). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: remove unused StateSync using in StateSyncFeedTests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address claude review findings on snap/state sync dispatcher - SnapSyncFeed: catch OperationCanceledException before general Exception so clean shutdown doesn't log ERROR - SimpleDispatcher: remove token from Task.Run to prevent semaphore leak if cancellation fires between WaitAsync and task scheduling - StateSyncRunner: replace LINQ with foreach in WaitForCloseToHead - SafeContext: add Feed property so tests use ctx.Feed instead of container.Resolve<ISimpleSyncFeed<StateSyncBatch>> * fix: swallow OCE on shutdown; retry transient SnapSyncFeed errors - StateSyncRunner.Run wraps the body in try/catch (OperationCanceledException) when token.IsCancellationRequested, so clean shutdown no longer propagates OCE to Synchronizer.StartSnapAndStateSyncComponents' ContinueWith handler and logs "State sync failed" at ERROR. - SnapSyncFeed.PrepareRequest moves try/catch inside the while loop with a dedicated catch (OperationCanceledException) returning null, matching StateSyncFeed's shape. Transient exceptions from _snapProvider.IsFinished() now retry on the next iteration instead of permanently aborting snap sync by signalling done. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: rename RunRound -> RunStateSyncRounds; extract StateSyncPrecursorWait - Rename `IStateSyncRunner.RunRound` to `RunStateSyncRounds` (multiple rounds, plural). - Combine the WaitUntilMode(StateNodes) and WaitForCloseToHead calls into a single private method `StateSyncPrecursorWait` and use it both at the top of `Run` and at the top of every `RunStateSyncRounds` iteration. Yields between rounds when MSMS leaves StateNodes or the node drifts away from head. - Test harness: also substitute ISyncProgressResolver and IBeaconSyncStrategy so the close-to-head check returns immediately (default test peers have HeadNumber=0 which would never satisfy the real predicate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: drop unused ITreeSync interface and SyncCompleted event The SyncCompleted event was fired by TreeSync but no longer had any subscribers after VerifyStateOnStateSyncFinished was deleted and the verify-trie call moved into StateSyncRunner.RunStateSyncRounds. ITreeSync existed only to expose this event, so remove the whole surface. - Delete ITreeSync.cs and SyncCompletedEventArgs. - Drop the event field and its invocation in TreeSync.VerifyPostSyncCleanUp. - DI: AddSingleton<ITreeSync, TreeSync>() -> AddSingleton<TreeSync>(). - Drop the now-unused Substitute.For<ITreeSync>() in SynchronizerModuleTests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run (#11472) * test(snap-sync): add Windows-flake stress reproducer Adds a 30-iteration stress reproducer for the Windows-only E2ESyncTests.SnapSync flake observed in CI (run #25139586200, job #73686059011 on PR #11422). Each iteration is its own NUnit case so a single failure does not terminate the rest, no [Retry] so every flake surfaces, and [Explicit] so it does not run in normal CI. Scoped to DbMode.Flat where the flake has been observed. Local reproduction on Windows (~1m runtime, 60 iterations across the two Flat fixtures): 9-19 failures with both observed CI patterns — "Verification failed: ... N missing in flat" and "TrieNodeException: Failed to load key X from root hash Y" during post-snap-sync block processing. Run: dotnet test --project Nethermind.Synchronization.Test \ -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run Replaces the lazy-init lock + `_initialized` flag in `FlatSnapTrieFactory` (which raced against concurrent `CreateStateTree`/`CreateStorageTree` calls and caused `Clear()` to wipe freshly-written accounts) with explicit sequential lifecycle hooks driven by `SnapSyncRunner.Run`: - `ISnapTrieFactory.EnsureInitialize()` (default no-op) — called once before any tree is created. `FlatSnapTrieFactory` clears the flat DB here. - `ISnapTrieFactory.FinalizeSync()` (default no-op) — called once after the dispatcher has finished. `FlatSnapTrieFactory` flushes the persistence here. - `SnapSyncRunner.Run` brackets `dispatcher.Run` with both calls. Because both halves run from a single sequential entry point — and the new `SimpleDispatcher.Run` only returns once every batch handler (and therefore every `ISnapTree.Dispose`) has completed — the implementation needs no internal locking and the `_initialized` flag goes away. This is the in-tree equivalent of the master-branch fix in #11443; the runner-driven shape suggested by @asdacap is cleaner because it doesn't depend on the factory guessing when "the first call" happens. `PatriciaSnapTrieFactory` and the test factories pick up the empty default implementations of both new methods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): extract RunSnapSyncOnce shared helper `SnapSync` and `SnapSync_StressRepro` had verbatim-copied bodies; extract a private `RunSnapSyncOnce(CancellationToken)` helper and have both call it. Trim the multi-line stress-repro comment to a single line per coding-style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): SnapSyncRunner lifecycle ordering test + cancel/throw cases Adds an internal-visible Func<CancellationToken, Task> ctor on SnapSyncRunner so the lifecycle can be exercised without standing up a real SimpleDispatcher, and three NSubstitute-based tests: - happy path: EnsureInitialize → dispatcher → FinalizeSync - dispatcher throws: FinalizeSync still runs - dispatcher cancelled: FinalizeSync still runs Trim the doc comments on the new ISnapTrieFactory hooks to a single line per the repo's coding-style rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(snap-sync): make SnapSyncRunner Func ctor public The Func<CancellationToken, Task> overload was internal so the test could reach it via InternalsVisibleTo. Make it public — it's a legitimate way to construct the runner, the surface is small, and DI still resolves the SimpleDispatcher overload because Func<CancellationToken, Task> isn't a registered service. Promotes the public ctor pattern to use a primary constructor (matches the rest of the codebase) and drops the assembly-level InternalsVisibleTo that was added solely for this class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): collapse SnapSyncRunner ordering tests into a single TestCase Three near-identical tests (happy / throws / cancels) folded into one parameterized method with a DispatcherOutcome enum. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): switch expression for SnapSyncRunner dispatcher outcome Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Lukasz Rozmej <lukasz.rozmej@gmail.com>

github-actions Bot added sync test labels Apr 30, 2026

claude Bot reviewed Apr 30, 2026

View reviewed changes

github-actions Bot added the database label Apr 30, 2026

LukaszRozmej changed the title ~~test(snap-sync): stress reproducer for Windows-only E2ESyncTests.SnapSync flake~~ fix(db): cross-column snapshot atomicity + Windows-flake stress reproducer Apr 30, 2026

github-actions Bot added bug fix + reliability and removed test labels Apr 30, 2026

LukaszRozmej force-pushed the fix/e2e-snapsync-windows-flake-repro branch from 0f09e4d to c14030d Compare April 30, 2026 19:18

github-actions Bot added the state+storage label May 1, 2026

LukaszRozmej marked this pull request as ready for review May 1, 2026 14:27

claude Bot reviewed May 1, 2026

View reviewed changes

LukaszRozmej changed the title ~~fix(db): cross-column snapshot atomicity + Windows-flake stress reproducer~~ fix(snap-sync): three Windows-flake fixes for E2ESyncTests.SnapSync May 1, 2026

claude Bot reviewed May 1, 2026

View reviewed changes

LukaszRozmej requested review from alexb5dh, asdacap, benaadams and flcl42 May 2, 2026 08:51

smartprogrammer93 added reproducible-benchmark block-processing-benchmark labels May 2, 2026

alexb5dh approved these changes May 2, 2026

View reviewed changes

asdacap requested changes May 3, 2026

View reviewed changes

LukaszRozmej mentioned this pull request May 3, 2026

fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run #11472

Merged

5 tasks

LukaszRozmej closed this May 4, 2026

	// Runs only on Flat dbMode (where the flake has been observed).
	// Stress reproducer for SnapSync Windows flake — run manually; see PR #11443 for context.

	snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30));
	if (!snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30)))
	if (_logger.IsWarn) _logger.Warn("FinalizeSync: in-flight ISnapTree drain timed out — flat state may be incomplete");

	public bool WaitForInFlightTreesDrained(System.TimeSpan timeout)
	public bool WaitForInFlightTreesDrained(TimeSpan timeout)

Conversation

LukaszRozmej commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Failure modes the flake produces

Mode B: TrieNodeException (cross-column tearing)

Mode A: Verification failed: N missing in flat

Fix 1 — c14030da4d cross-column snapshot atomicity

Fix 2 — 12a66f1e1b drain in-flight ISnapTrees in FinalizeSync

Fix 3 — 73b160f057 don't publish _initialized until Clear() completes

How fix 3 was diagnosed

Stress reproducer

Test plan

Uh oh!

claude Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary

Findings

Medium — Code duplication violates DRY and repo test rules

Low — Verbose multi-line comment block violates coding style

Positive notes

Uh oh!

claude Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej commented Apr 30, 2026

Mode A diagnostic — phase-1 SnapSync trees not disposed when verifier runs

Why IsSnapGetRangesFinished doesn't catch this

Tested mitigations

Proposed real fix (next step, not in this PR)

Uh oh!

LukaszRozmej commented May 1, 2026

Update — added drain fix for phase-1 in-flight ISnapTrees

Stress-test results (60 iterations × 2 Flat fixtures, no [Retry])

What's still unfixed

Uh oh!

LukaszRozmej commented May 1, 2026

Uh oh!

claude Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary

Findings

Medium — Code duplication still present (prior finding, unaddressed)

Low — Verbose comment block (prior finding, unaddressed)

Low — WaitForInFlightTreesDrained return value silently discarded

Positive notes

Uh oh!

claude Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej commented May 1, 2026

Uh oh!

claude Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review (re-review)

Summary

Findings

Medium — Code duplication (3rd occurrence, unaddressed)

Low — Verbose comment block (3rd occurrence, unaddressed)

Low — WaitForInFlightTreesDrained return silently discarded (2nd occurrence, unaddressed)

Low — OnTreeCreated() called before tree constructor completes (new)

Positive notes

Uh oh!

claude Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej commented Apr 30, 2026 •

edited

Loading

Mode B: `TrieNodeException` (cross-column tearing)

Mode A: `Verification failed: N missing in flat`

Fix 1 — `c14030da4d` cross-column snapshot atomicity

Fix 2 — `12a66f1e1b` drain in-flight ISnapTrees in FinalizeSync

Fix 3 — `73b160f057` don't publish `_initialized` until `Clear()` completes

claude Bot commented Apr 30, 2026 •

edited

Loading

Why `IsSnapGetRangesFinished` doesn't catch this

Update — added drain fix for phase-1 in-flight `ISnapTrees`

Stress-test results (60 iterations × 2 Flat fixtures, no `[Retry]`)

claude Bot commented May 1, 2026 •

edited

Loading

Low — `WaitForInFlightTreesDrained` return value silently discarded

claude Bot commented May 1, 2026 •

edited

Loading

Low — `WaitForInFlightTreesDrained` return silently discarded (2nd occurrence, unaddressed)

Low — `OnTreeCreated()` called before tree constructor completes (new)

claude Bot commented May 1, 2026 •

edited

Loading

claude Bot commented May 1, 2026 •

edited

Loading

Fix 1 — `SnapshotableMemColumnsDb` lock (`c14030d`)

Fix 2 — `_inFlightTrees` `Interlocked` counter (`12a66f1`)

Fix 3 — `volatile bool _initialized` reorder (`73b160f`)

claude Bot commented May 2, 2026 •

edited

Loading

Fix 1 — `SnapshotableMemColumnsDb` cross-column atomicity (`c14030d`) — test-backend only

Fix 2 — Drain in-flight `ISnapTrees` in `FinalizeSync` (`12a66f1`) — real production bug

Fix 3 — `_initialized` DCL race in `FlatSnapTrieFactory` (`73b160f`) — real production bug, the more serious one

claude Bot commented May 2, 2026 •

edited

Loading

Performance impact of `WaitForInFlightTreesDrained`

What calls `FinalizeSync` and when