Skip to content

fix(snap-sync): three Windows-flake fixes for E2ESyncTests.SnapSync#11443

Closed
LukaszRozmej wants to merge 6 commits into
masterfrom
fix/e2e-snapsync-windows-flake-repro
Closed

fix(snap-sync): three Windows-flake fixes for E2ESyncTests.SnapSync#11443
LukaszRozmej wants to merge 6 commits into
masterfrom
fix/e2e-snapsync-windows-flake-repro

Conversation

@LukaszRozmej
Copy link
Copy Markdown
Member

@LukaszRozmej LukaszRozmej commented Apr 30, 2026

Summary

Three commits in this PR fix the Windows-only E2ESyncTests.SnapSync flake observed in CI (run #25139586200, job #73686059011 on PR #11422):

  1. fix(db): cross-column snapshot atomicity in SnapshotableMemColumnsDb (c14030da4d) — eliminates the TrieNodeException mode.
  2. fix(snap-sync): drain in-flight ISnapTrees in FlatTreeSyncStore.FinalizeSync (12a66f1e1b) — closes a separate ordering gap where post-sync readers could see state from not-yet-disposed phase-1 trees.
  3. fix(snap-sync): don't publish _initialized until Clear() completes (73b160f057) — eliminates the deterministic "missing in flat" mode.

Plus the stress reproducer (edc0d3dbc9) used to drive the diagnosis.

Result: 5 stress runs × 60 tests = 300 invocations, 0 failures. Previously 8–15 failures per 60-test run on Windows.

Failure modes the flake produces

Mode B: TrieNodeException (cross-column tearing)

Block processing leases a fresh persistenceReader while the persistence pipeline is committing, gets a snapshot where the Accounts column is updated but the StateNodes column is not, walks the trie at the new state root, and throws TrieNodeException for a node that should be there.

Mode A: Verification failed: N missing in flat

Trie-side iteration finds account leaves that the flat-side iterator can't see. The same ~9 hashed addresses missed reliably across runs, all in verifier partitions 0–1 (first byte ≤ 0x3f). Counter-based diagnosis (instrumenting both source-side SetAccount/RemoveAccount/DeleteAccountRange and sink-side InMemoryWriteBatch.Set filtered to Column_Account) showed source vs. sink Set counts matched perfectly while the sink had ~507K unaccounted Removes. Tracing back uncovered ClearAllColumns racing against fresh sync writes — the third fix below.

Fix 1 — c14030da4d cross-column snapshot atomicity

SnapshotableMemColumnsDb.CreateSnapshot() iterates columns and captures each column's snapshot independently. InMemoryColumnWriteBatch.Dispose() iterates per-column batches and disposes each independently. Without a global guard, a CreateSnapshot concurrent with an in-flight writeBatch.Dispose can capture some columns after the new writes and others before — a cross-column-inconsistent reader view.

This race only affects the test backend — RocksDB has atomic cross-CF snapshots — but it manifests concretely in E2ESyncTests.SnapSync. The fix wraps CreateSnapshot and the wrapped AtomicColumnsWriteBatch.Dispose in a single Lock so per-Dispose multi-column commits are atomic w.r.t. snapshot creation.

Fix 2 — 12a66f1e1b drain in-flight ISnapTrees in FinalizeSync

Phase-1 / phase-2 trees produced by FlatSnapTrieFactory hold per-tree IWriteBatch instances; until each tree's Dispose returns, its writes aren't yet committed to the underlying persistence. Without a drain, a caller that observes "sync finished" can read state that's missing accounts from late-disposed trees.

FlatSnapTrieFactory now tracks an _inFlightTrees counter (incremented in CreateStateTree/CreateStorageTree, decremented in FlatSnapStateTree.Dispose/FlatSnapStorageTree.Dispose). FlatTreeSyncStore.FinalizeSync calls WaitForInFlightTreesDrained (spin-wait, 30s timeout) before bumping the persisted state ID. This is independent of fix 3 — it closes a different ordering window that can otherwise re-emerge as the same symptom.

Fix 3 — 73b160f057 don't publish _initialized until Clear() completes

FlatSnapTrieFactory.EnsureDatabaseCleared had this ordering:

if (_initialized) return;
using (_lock.EnterScope())
{
    if (_initialized) return;
    _initialized = true;          // published BEFORE Clear() ran
    persistence.Clear();           // still iterating + queueing Removes
}

Sequence that wipes accounts:

  1. Thread T1 enters EnsureDatabaseCleared, sets _initialized = true, then starts persistence.Clear(). Clear() opens a single IColumnsWriteBatch, then for each column iterates db.GetColumnDb(column).GetAllKeys() and queues columnBatch.Remove(key) for every key it sees — and only disposes the batch (applying the queued Removes) when all columns are done.
  2. Thread T2 hits the unlocked fast-path if (_initialized) return;, sees true, and skips Clear entirely.
  3. T2 calls persistence.CreateWriteBatch(...) and starts writing accounts via SetAccount. These commit normally (their own IPersistence.IWriteBatch.Dispose).
  4. T1's Clear() is still inside its IColumnsWriteBatch. As it iterates GetAllKeys() for the Account column, it sees the keys T2 just committed and queues columnBatch.Remove(key) for them.
  5. T1's batch finally disposes. Every queued Remove lands on the live DB — including T2's freshly-written accounts. The flat DB ends up missing entries that the trie still references; the verifier reports "Account in trie not found in flat".

The fix: publish _initialized = true after persistence.Clear() returns, and mark the field volatile. Concurrent fast-path readers either observe false (and wait on the lock) or observe true (and Clear is genuinely done, so it's safe to proceed).

How fix 3 was diagnosed

Surface symptoms (key seen by trie iterator but not by flat iterator) were consistent across runs but didn't point at any obvious caller. Source-side instrumentation in BaseFlatPersistence.WriteBatch (SetAccount / RemoveAccount / DeleteAccountRange preview) plus sink-side instrumentation in InMemoryWriteBatch.Set (filtered to _store.Name == "Column_Account") showed:

setBaseFlat = 1,134,377   setIM = 1,134,377    ← all sets accounted
removeBaseFlat = 1,974    DeleteAccountRangePreview = 552
removeIM = 509,598
clearAllColumnsCalls = 60   clearAccountRemoves = 507,072    ← gap closes here

507,072 of the 509,598 sink Removes came from ClearAllColumns — i.e. ~8,451 accounts removed per test, where the DB was supposed to be empty when Clear ran. ~8,451 ≈ the test fixture's account count, confirming Clear was racing against sync writes. The race is timing-sensitive (any instrumentation in the hot path masks it) which is why Linux/macOS pass and only Windows reproduces.

Stress reproducer

SnapSync_StressRepro runs 30 iterations × 2 Flat fixtures = 60 cases, each as its own NUnit case (so a single failure doesn't terminate the rest), no [Retry], [Explicit]. ~1m runtime.

dotnet test --project src/Nethermind/Nethermind.Synchronization.Test/Nethermind.Synchronization.Test.csproj \
  -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro"
Configuration Local Windows results (per 60-test run)
Baseline (no fix) 18/60 fail; ~12 Mode A + ~6 Mode B
+ cross-column fix (fix 1) ~18/60 fail; 18 Mode A + 0 Mode B
+ ISnapTree drain (fix 2) 8–15/60 fail; all Mode A
+ _initialized race (fix 3) 0/60

Across 5 consecutive runs of the stress reproducer with all three fixes applied: 0/300 failures.

Test plan

  • Nethermind.Db builds (0 warnings, 0 errors)
  • Nethermind.State.Flat builds (0 warnings, 0 errors)
  • Nethermind.Synchronization.Test builds (0 warnings, 0 errors)
  • Stress reproducer: 0/300 failures across 5 consecutive 60-test runs on Windows

Adds a 30-iteration stress reproducer for the Windows-only
E2ESyncTests.SnapSync flake observed in CI (run #25139586200,
job #73686059011 on PR #11422).

Each iteration is its own NUnit case so a single failure does not
terminate the rest, no [Retry] so every flake surfaces, and
[Explicit] so it does not run in normal CI. Scoped to DbMode.Flat
where the flake has been observed.

Local reproduction on Windows (~1m runtime, 60 iterations across the
two Flat fixtures): 9-19 failures with both observed CI patterns —
"Verification failed: ... N missing in flat" and
"TrieNodeException: Failed to load key X from root hash Y" during
post-snap-sync block processing.

Run: dotnet test --project Nethermind.Synchronization.Test \
  -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 30, 2026

Claude finished @LukaszRozmej's task in 2m 49s —— View job


Code Review

  • Gather context and read changed files
  • Read repo rules (coding-style, test-infrastructure)
  • Analyze the reproducer test code
  • Review for correctness, edge cases, and adherence to guidelines
  • Post final review

Summary

This is a well-motivated diagnostic PR — [Explicit], no [Retry], per-iteration isolation via [TestCaseSource], and a clear Assert.Ignore guard for non-Flat modes. The design choices for a flake reproducer are sound. Two findings to address before merge.


Findings

Medium — Code duplication violates DRY and repo test rules

SnapSync_StressRepro (lines 440–463) is a verbatim copy of SnapSync() (lines 406–429). The test-infrastructure rules explicitly prohibit duplicating test methods. The sync logic should be extracted into a shared helper so both tests call it:

private async Task RunSnapSyncClientOnce(CancellationToken cancellationToken)
{
    PrivateKey clientKey = TestItem.PrivateKeyD;
    await using IContainer client = await CreateNode(clientKey, async (cfg, spec) =>
    {
        SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>();
        syncConfig.FastSync = true;
        syncConfig.SnapSync = true;
        await SetPivot(syncConfig, cancellationToken);
        INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>();
        networkConfig.P2PPort = AllocatePort();
        networkConfig.FilterPeersByRecentIp = false;
        networkConfig.FilterDiscoveryNodesByRecentIp = false;
    });
    await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationToken);
}

Then SnapSync() and SnapSync_StressRepro() each become a one-liner call. Fix this →

Low — Verbose multi-line comment block violates coding style

The 7-line comment block above SnapSync_StressRepro (lines 431–437) violates the coding-style rule: "Don't write multi-paragraph docstrings or multi-line comment blocks — one short line max." The run ID, invocation instructions, and rationale already live in the PR description (the right place). The [Explicit] attribute message covers the essential "what" for a reader in the IDE. The comment should be trimmed to a single line, e.g.:

// Stress reproducer — run manually; see PR #11443 for context.

Positive notes

  • [Explicit] correctly prevents accidental CI execution — good safeguard.
  • Intentionally omitting [Retry] is the right call: every flake should surface.
  • Per-iteration isolation via [TestCaseSource] (so one failure doesn't abort the run) is correct.
  • The Assert.Ignore guard is clean — it covers all non-Flat modes (including Hash) without needing a separate guard.
  • _ = iteration is the correct idiom here: using _ as the parameter name would show _ in NUnit test names for all cases, losing the iteration index.
  • StressIterationCount = 30 × 2 Flat fixtures = 60 active cases — matches the PR description.

Comment on lines +447 to +463
PrivateKey clientKey = TestItem.PrivateKeyD;
await using IContainer client = await CreateNode(clientKey, async (cfg, spec) =>
{
SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>();
syncConfig.FastSync = true;
syncConfig.SnapSync = true;

await SetPivot(syncConfig, cancellationTokenSource.Token);

INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>();
networkConfig.P2PPort = AllocatePort();
networkConfig.FilterPeersByRecentIp = false;
networkConfig.FilterDiscoveryNodesByRecentIp = false;
});

await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationTokenSource.Token);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The body of this test is a verbatim copy of SnapSync(). Per the repo's test-infrastructure rules, duplicating test method bodies is not allowed — extract the shared logic into a private helper (e.g. RunSnapSyncClientOnce(CancellationToken)) and have both methods call it.

Comment on lines +431 to +437
// Stress reproducer for the Windows-only SnapSync flake observed in CI:
// - run #25139586200 / job #73686059011 (PR #11422)
// Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case
// so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every
// failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with:
// dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro"
// Runs only on Flat dbMode (where the flake has been observed).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 7-line comment block violates the coding-style rule: "Don't write multi-paragraph docstrings or multi-line comment blocks — one short line max." The run IDs, invocation instructions, and rationale belong in the PR description (where they already are). Trim to a single line, e.g.:

Suggested change
// Stress reproducer for the Windows-only SnapSync flake observed in CI:
// - run #25139586200 / job #73686059011 (PR #11422)
// Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case
// so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every
// failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with:
// dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro"
// Runs only on Flat dbMode (where the flake has been observed).
// Stress reproducer — run manually; see PR #11443 for context.

@LukaszRozmej LukaszRozmej changed the title test(snap-sync): stress reproducer for Windows-only E2ESyncTests.SnapSync flake fix(db): cross-column snapshot atomicity + Windows-flake stress reproducer Apr 30, 2026
SnapshotableMemColumnsDb.CreateSnapshot iterates columns in a loop and
captures each column's snapshot independently. InMemoryColumnWriteBatch
.Dispose iterates the per-column batches and disposes each
independently. Without a global guard, a CreateSnapshot call concurrent
with an in-flight writeBatch.Dispose can capture some columns AFTER the
new writes and others BEFORE — producing a cross-column-inconsistent
reader view.

This race only matters in the test backend (RocksDB has atomic
cross-CF snapshots), but it manifested concretely in
E2ESyncTests.SnapSync on Windows: post-snap-sync block processing leases
a fresh persistenceReader while the persistence pipeline is committing,
gets a snapshot whose Accounts column is updated but whose StateNodes
column is not, then walks the trie at the new state root and throws
TrieNodeException for a node that "should" be there.

Add a ReaderWriterLockSlim around CreateSnapshot (read) and
writeBatch.Dispose (write) so multi-column commits are atomic w.r.t.
snapshot creation. Multiple snapshots can still proceed concurrently —
the only contention is the rare overlap of a snapshot creation with a
multi-column commit.

This eliminates the TrieNodeException failure mode in the stress
reproducer (`SnapSync_StressRepro`). The "missing in flat" mode is a
separate, deterministic bug (the same ~9 addresses miss reliably) that
appears unaffected by this fix and needs follow-up investigation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukaszRozmej LukaszRozmej force-pushed the fix/e2e-snapsync-windows-flake-repro branch from 0f09e4d to c14030d Compare April 30, 2026 19:18
@LukaszRozmej
Copy link
Copy Markdown
Member Author

Mode A diagnostic — phase-1 SnapSync trees not disposed when verifier runs

Added targeted instrumentation to BaseFlatPersistence.WriteBatch.SetAccount / DeleteAccountRange, SnapshotableMemDb.Set, and FlatSnapStateTree constructor / Commit / Dispose. Ran the stress reproducer and traced events for the deterministically-missing 9 paths. Concrete finding:

Some FlatSnapStateTree instances start Dispose() but never finish before the verifier runs, so their queued BaseFlatPersistence.WriteBatch.SetAccount writes never reach the underlying SnapshotableMemDb. Concrete example from one failing iteration (sequence numbers from event stream):

[04349] tid=707 FlatSnapStateTree CREATED id=499
[04358] tid=707 FlatSnapStateTree COMMIT_START id=499 upperBound=3E40FAF1
[04377] tid=707 FlatSnapStateTree COMMIT_END id=499
[04378] tid=707 FlatSnapStateTree DISPOSE_START id=499
... no DISPOSE_END for id=499 in the dump ...
[04386] tid=31 FlatSnapStateTree DISPOSE_START id=501
[04387] tid=31 FlatSnapStateTree DISPOSE_END id=501       ← later tree disposes fine

Tree id=499 started disposing but never finished within the iteration's window. The dump fires when VerifyTrie throws, so by that point the writes from tree 499 are queued in its IWriteBatch but never reached the underlying column DB.

In iter 10 of the traced run: 27 trees created, 24 disposed, 3 stuck (id=499 in Dispose, id=502 / 503 mid-COMMIT).

Why IsSnapGetRangesFinished doesn't catch this

Expected: _activeAccountRequests is decremented in SnapProvider.AddAccountRange AFTER the inner SnapProviderHelper.AddAccountRange returns (which, via using ISnapTree, only returns once the tree's Dispose completes). So in theory phase 1 can't transition until all in-flight trees have finished disposing.

Observed: phase 1 transitions and SyncUntilFinished returns while tree disposals are still in flight. Either the dispose path has a hot reference held longer than its method scope, or the counter / mode-change semantics admit a window I haven't pinned down. The data is unambiguous though — by the time VerifyTrie runs, the writes that should be in flat are still queued on a non-yet-disposed IWriteBatch.

Tested mitigations

Approach Result
Task.Delay(1s) before FlushCache 12-15 fail (similar baseline)
Task.Delay(10s) before FlushCache 13 fail (similar baseline)

Even 10s isn't enough — some disposals appear to take much longer or are genuinely stuck on something. A delay-based fix isn't sufficient.

Proposed real fix (next step, not in this PR)

Add an in-flight counter to FlatSnapTrieFactory:

  • Increment on CreateStateTree / CreateStorageTree.
  • Decrement when the returned tree's Dispose() completes.
  • Expose WaitForDrained(timeout).
  • Call WaitForDrained in FlatTreeSyncStore.FinalizeSync before returning.

That would block the sync→post-sync transition until every phase-1 writeBatch is committed, eliminating the race. Production also benefits: a node that restarts mid-snap-sync currently risks losing uncommitted phase-1 writes.

…izeSync

Phase-1 ISnapTrees produced by FlatSnapTrieFactory each hold a per-tree
IPersistence.IWriteBatch. The writeBatch's writes only become visible to
subsequent IPersistence readers after the tree's Dispose() returns. When
FlatTreeSyncStore.FinalizeSync ran without waiting for outstanding tree
disposals, callers that observed "sync finished" could read state that
was missing accounts whose enclosing tree hadn't fully disposed yet.

Track in-flight ISnapTrees on FlatSnapTrieFactory (incremented on
Create*Tree, decremented in tree Dispose via a back-reference) and have
FlatTreeSyncStore.FinalizeSync drain that counter (with a 30s timeout)
before flushing. This closes the gap where a snap-sync writeBatch was
queued but not yet committed when the verifier ran.

Reduces E2ESyncTests.SnapSync_StressRepro Windows flake rate from
~13-18/60 to ~8-15/60. Doesn't fully eliminate Mode A — the remaining
failures involve a deterministic set of ~9 paths that appear unaffected
by the drain (likely a separate issue in how state-nodes-sync writes
flat entries for some specific topology, requiring further work).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukaszRozmej
Copy link
Copy Markdown
Member Author

Update — added drain fix for phase-1 in-flight ISnapTrees

Pushed 12a66f1e1b: FlatSnapTrieFactory now tracks in-flight ISnapTree instances; FlatTreeSyncStore.FinalizeSync drains them (30s timeout) before flushing. This eliminates the case where phase-1 trees with queued IPersistence.IWriteBatch writes hadn't finished disposing when the verifier ran.

Stress-test results (60 iterations × 2 Flat fixtures, no [Retry])

Configuration Failures Failure type
Master baseline 9–19 Mix of Mode A + Mode B
+ cross-column snapshot fix 13–18 All Mode A
+ drain fix (this) 8–15 All Mode A

Modest but real improvement. The drain doesn't fully eliminate Mode A.

What's still unfixed

The remaining Mode A failures hit the same deterministic 9 paths as before — the drain doesn't change the failure signature, it just makes failure less frequent. In iterations where phase-1 (state ranges) doesn't progress at all (logs show Phase 1: ( 0.00 %) then immediately finished), all work happens via phase-2 state-nodes-sync (FlatTreeSyncStore.SaveNode), and the drain catches no trees. These iterations still fail with the same 9 missing accounts.

So the remaining bug is in phase-2 state-nodes-sync's flat-entry path, not phase-1. The 9 deterministic addresses suggest a specific trie topology (something about how those leaves are reached) that causes WriteAccountFlatEntries to miss them, or some other structural issue in FlatTreeSyncStore.SaveNode.

Investigation handed off to whoever picks this up next: the data points to either (a) leaves that are inline in some node type that WriteAccountFlatEntries doesn't handle, or (b) a specific code path in the Branch deletion-range computation that's incorrect for those 9 addresses' positions in the trie.

The fast-path `if (_initialized) return;` lets a concurrent caller skip
EnsureDatabaseCleared and proceed to write accounts while the very first
caller is still inside persistence.Clear() — iterating GetAllKeys() and
queueing Removes for every existing key. When that batch finally
disposes, those Removes overwrite the just-written accounts, leaving the
flat DB missing entries that the trie still references. The verifier
then reports "Account in trie not found in flat" for whichever accounts
landed in the window. Stress-running E2ESyncTests.SnapSync on Windows
reproduced this in roughly 13–18 of 60 iterations; setting
_initialized after Clear() drops failures to 0/300.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukaszRozmej
Copy link
Copy Markdown
Member Author

Root cause found and fixed in 73b160f.

The remaining "missing in flat" failures came from a race in FlatSnapTrieFactory.EnsureDatabaseCleared:

if (_initialized) return;
using (_lock.EnterScope())
{
    if (_initialized) return;
    _initialized = true;          // ← published BEFORE Clear() ran
    persistence.Clear();           // ← still iterating GetAllKeys() / queueing Removes
}

Sequence that wipes accounts:

  1. Thread T1 enters EnsureDatabaseCleared, sets _initialized = true, starts persistence.Clear() (which calls db.GetColumnDb(column).GetAllKeys() and queues columnBatch.Remove(key) for every key found in each column).
  2. Thread T2 hits the fast path if (_initialized) return; — it sees true and skips Clear.
  3. T2 calls persistence.CreateWriteBatch(...) and starts writing accounts via SetAccount. These commit.
  4. T1's Clear() is still inside the same using IColumnsWriteBatch ... batch. Its GetAllKeys() either already saw T2's keys, or sees them after the next loop iteration; either way columnBatch.Remove(key) is queued.
  5. T1's batch disposes. Every queued Remove is applied to the live DB — including T2's freshly-written accounts.

How I confirmed it: instrumented every IWriteBatch operation that lands on Column_Account and counted source-side calls (SetAccount / RemoveAccount / DeleteAccountRange) vs sink-side calls on the underlying InMemoryWriteBatch. The numbers across 60 iterations:

setBaseFlat = 1,134,377   setIM = 1,134,377   ✓ all sets accounted
removeBaseFlat = 1,974    DeleteAccountRangePreview = 552
removeIM = 509,598
clearAllColumnsCalls = 60   clearAccountRemoves = 507,072    ← gap closes here

507,072 of the 509,598 sink Removes came from ClearAllColumns — i.e. ~8,451 accounts removed per test, where the DB was supposed to be empty when Clear ran. That ~8,451 matches the test fixture's account count, confirming Clear was racing against sync writes.

Fix: publish _initialized = true after persistence.Clear() returns, and mark the field volatile so the publish has release semantics. Now T2's fast-path read of _initialized = true strictly happens-after T1's Clear, so T2 either skips because Clear is genuinely done, or waits on the lock.

Verified: 5 stress runs × 60 tests = 300 invocations, 0 failures. Previously 8–15 failures per 60-test run on Windows.

This was the last bug. Combined effect of the three commits in this PR:

  • c14030da4d — eliminated Mode B (TrieNodeException, cross-column snapshot tearing)
  • 12a66f1e1b — drained phase-1 trees before FinalizeSync (still useful; protects against a separate ordering issue)
  • 73b160f057 — eliminated Mode A (this race)

Marking the PR ready for review.

@LukaszRozmej LukaszRozmej marked this pull request as ready for review May 1, 2026 14:27
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

Claude finished @LukaszRozmej's task in 4m 11s —— View job


Code Review

  • Gather context from PR description and prior comments
  • Read changed files
  • Read repo rules (coding-style, robustness, test-infrastructure)
  • Analyze each fix for correctness and thread-safety
  • Check prior review findings — addressed?
  • Post final review

Summary

All three bug fixes are correct. The cross-column lock fix (SnapshotableMemColumnsDb) and the DCL _initialized fix (FlatSnapTrieFactory) are well-implemented. The drain-counter infrastructure is sound. The root-cause investigation in the PR comments (particularly the _initialized race evidence with ~507K "extra" removes) is excellent.

The two findings from the previous review remain unaddressed. One prior Medium (code duplication) must be fixed before merge.


Findings

Medium — Code duplication still present (prior finding, unaddressed)

SnapSync_StressRepro (lines 447–462) is still a verbatim copy of SnapSync() (lines 412–428). This violates both coding-style.md ("extract repeated blocks (5+ lines) into shared methods") and test-infrastructure.md ("Do not duplicate test methods"). Extract to a shared helper: Fix this →

Low — Verbose comment block (prior finding, unaddressed)

The 7-line comment at lines 431–437 still violates the coding-style one-line-max rule. The run IDs and invocation instructions already live in the PR description. The [Explicit] attribute message covers the essential "why" for IDE readers. Trim to one line, e.g.:

    // Stress reproducer for SnapSync Windows flake — run manually; see PR #11443 for context.

Low — WaitForInFlightTreesDrained return value silently discarded

FlatTreeSyncStore.FinalizeSync line 246 discards the bool return from WaitForInFlightTreesDrained. On timeout, the warning is logged inside the callee, but FinalizeSync proceeds to persistence.Flush() — potentially flushing a flat state that is missing writes from trees still in flight. For the 30-second timeout with only SpinWait between checks this is unlikely in practice, but the discard is a silent contract violation. At minimum, log or throw at the call-site when the drain times out, so it's visible at the finalization layer:

if (!snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30)))
    if (_logger.IsWarn) _logger.Warn("FinalizeSync: drain timed out — flat state may be incomplete");

Fix this →


Positive notes

  • _initialized DCL fix is correct. Moving _initialized = true after persistence.Clear() inside the lock, combined with volatile, gives proper release/acquire ordering. The old code set _initialized = true before Clear() completed, allowing a concurrent thread to race in and have its writes wiped by the still-running Clear() batch.
  • SnapshotableMemColumnsDb lock choice is correct. Using System.Threading.Lock instead of ReaderWriterLockSlim (as the PR description originally described) is simpler and equally correct — this is a test backend, so the serialization between concurrent snapshots is acceptable. The using Lock.Scope pattern in both CreateSnapshot and AtomicColumnsWriteBatch.Dispose is idiomatic for net10.0.
  • _inFlightTrees counter is correctly implemented. Interlocked.Increment/Decrement paired with SpinWait in WaitForInFlightTreesDrained is a standard and correct pattern. The finally block in FlatSnapStateTree.Dispose / FlatSnapStorageTree.Dispose ensures OnTreeDisposed() is called even on exception.
  • [Explicit] on the stress reproducer correctly prevents accidental CI execution.
  • No [Retry] on the stress reproducer is the right call — every occurrence should surface.

});

await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationTokenSource.Token);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a verbatim copy of SnapSync() (lines 412–428). Per coding-style.md ("extract repeated blocks (5+ lines)") and test-infrastructure.md ("Do not duplicate test methods"), extract the shared body into a private helper and have both methods call it:

private async Task RunSnapSyncOnce(CancellationToken cancellationToken)
{
    PrivateKey clientKey = TestItem.PrivateKeyD;
    await using IContainer client = await CreateNode(clientKey, async (cfg, spec) =>
    {
        SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>();
        syncConfig.FastSync = true;
        syncConfig.SnapSync = true;
        await SetPivot(syncConfig, cancellationToken);
        INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>();
        networkConfig.P2PPort = AllocatePort();
        networkConfig.FilterPeersByRecentIp = false;
        networkConfig.FilterDiscoveryNodesByRecentIp = false;
    });
    await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationToken);
}

Then both tests become:

await RunSnapSyncOnce(cancellationTokenSource.Token);

// so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every
// failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with:
// dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro"
// Runs only on Flat dbMode (where the flake has been observed).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7-line comment block violates the one-line-max rule in coding-style.md. The run IDs, invocation instructions, and rationale live in the PR description and in the [Explicit] message. Trim to a single line:

Suggested change
// Runs only on Flat dbMode (where the flake has been observed).
// Stress reproducer for SnapSync Windows flake — run manually; see PR #11443 for context.

// aren't yet committed to the underlying persistence. Without this drain a caller that
// observes "sync finished" can read state that's missing accounts from late-disposed
// trees (manifests as E2ESyncTests.SnapSync "Verification failed: N missing in flat").
snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bool return is silently discarded. On a 30-second drain timeout the warning is logged inside WaitForInFlightTreesDrained, but FinalizeSync proceeds to persistence.Flush() without any signal at this layer. Add a site-local log so finalization visibility is complete:

Suggested change
snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30));
if (!snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30)))
if (_logger.IsWarn) _logger.Warn("FinalizeSync: in-flight ISnapTree drain timed out — flat state may be incomplete");

@LukaszRozmej LukaszRozmej changed the title fix(db): cross-column snapshot atomicity + Windows-flake stress reproducer fix(snap-sync): three Windows-flake fixes for E2ESyncTests.SnapSync May 1, 2026
@LukaszRozmej
Copy link
Copy Markdown
Member Author

@claude re-review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

Claude finished @LukaszRozmej's task in 3m 9s —— View job


Code Review (re-review)

  • Read changed files
  • Read repo rules (coding-style, robustness, test-infrastructure)
  • Check prior findings — addressed?
  • Analyze new commits for correctness
  • Post inline comments
  • Post final summary

Summary

All three production bug fixes are correct and well-implemented. The cross-column lock fix, the DCL _initialized reordering, and the drain counter are each sound. The root-cause analysis and verification methodology are excellent.

Three prior findings remain unaddressed (one Medium, two Low), and one new Low was identified. The Medium blocks merge.


Findings

Medium — Code duplication (3rd occurrence, unaddressed)

SnapSync_StressRepro (lines 447–462) is a verbatim copy of SnapSync() (lines 412–428). coding-style.md: "extract repeated blocks (5+ lines) into shared methods." This has been flagged in both previous reviews with a concrete helper signature. See inline →

Low — Verbose comment block (3rd occurrence, unaddressed)

The 7-line comment at lines 431–437 still violates the one-line-max rule. The [Explicit] attribute message on line 438 already carries the essential intent for IDE readers; run IDs and invocation instructions belong in the PR description. See inline →

Low — WaitForInFlightTreesDrained return silently discarded (2nd occurrence, unaddressed)

FlatTreeSyncStore.FinalizeSync:246 discards the bool result. On timeout, the callee logs a warning but FinalizeSync continues to persistence.Flush() — finalization-layer visibility is missing. See inline →

Low — OnTreeCreated() called before tree constructor completes (new)

In FlatSnapTrieFactory.CreateStateTree / CreateStorageTree, OnTreeCreated() is called at line 66/76 before new FlatSnapStateTree/FlatSnapStorageTree(...) finishes. If the constructor throws, OnTreeDisposed() is never called, permanently leaking the in-flight counter — every subsequent FinalizeSync would spin for 30 seconds. Moving OnTreeCreated() into each tree's constructor (paired with the existing finally-guarded OnTreeDisposed() in Dispose) eliminates the leak. See inline →


Positive notes

  • _initialized DCL is correct. Moving _initialized = true after persistence.Clear(), combined with volatile, gives proper release/acquire ordering — a thread reading _initialized == true outside the lock is guaranteed to see all effects of Clear().
  • SnapshotableMemColumnsDb lock is correct. using Lock.Scope in both CreateSnapshot and AtomicColumnsWriteBatch.Dispose makes multi-column commits atomic w.r.t. snapshot creation without over-complicating the API.
  • _inFlightTrees counter is correctly implemented. Interlocked.Increment/Decrement paired with SpinWait is a standard and correct pattern. The finally blocks in FlatSnapStateTree.Dispose / FlatSnapStorageTree.Dispose ensure OnTreeDisposed() is called even on exception.
  • [Explicit] with message on the stress reproducer correctly prevents accidental CI execution, and the attribute message now carries the essential "why" for IDE readers.
  • No [Retry] on the stress reproducer is the right call — every failure should surface.

});

await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationTokenSource.Token);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium (third occurrence — unaddressed): Lines 447–462 are still a verbatim copy of SnapSync() lines 412–428. coding-style.md requires extracting repeated blocks of 5+ lines into shared methods. Extract to a private helper:

private async Task RunSnapSyncOnce(CancellationToken cancellationToken)
{
    PrivateKey clientKey = TestItem.PrivateKeyD;
    await using IContainer client = await CreateNode(clientKey, async (cfg, spec) =>
    {
        SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>();
        syncConfig.FastSync = true;
        syncConfig.SnapSync = true;
        await SetPivot(syncConfig, cancellationToken);
        INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>();
        networkConfig.P2PPort = AllocatePort();
        networkConfig.FilterPeersByRecentIp = false;
        networkConfig.FilterDiscoveryNodesByRecentIp = false;
    });
    await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationToken);
}

Both SnapSync() and SnapSync_StressRepro() then become a one-liner call to this helper.

Comment on lines +431 to +437
// Stress reproducer for the Windows-only SnapSync flake observed in CI:
// - run #25139586200 / job #73686059011 (PR #11422)
// Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case
// so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every
// failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with:
// dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro"
// Runs only on Flat dbMode (where the flake has been observed).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low (third occurrence — unaddressed): This 7-line block still violates coding-style.md ("Code comments explain why, not what"). The [Explicit] message (line 438) already covers the essential intent for IDE readers. Run IDs and invocation instructions belong in the PR description (where they already are). Trim to one line:

Suggested change
// Stress reproducer for the Windows-only SnapSync flake observed in CI:
// - run #25139586200 / job #73686059011 (PR #11422)
// Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case
// so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every
// failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with:
// dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro"
// Runs only on Flat dbMode (where the flake has been observed).
// Stress reproducer for the SnapSync Windows flake — run manually; see PR #11443 for context.

// aren't yet committed to the underlying persistence. Without this drain a caller that
// observes "sync finished" can read state that's missing accounts from late-disposed
// trees (manifests as E2ESyncTests.SnapSync "Verification failed: N missing in flat").
snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low (second occurrence — unaddressed): The bool return from WaitForInFlightTreesDrained is silently discarded. The timeout warning is logged inside the callee, but FinalizeSync proceeds to persistence.Flush() with no signal at this layer. Add a site-local log so finalization-level visibility is complete:

Suggested change
snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30));
if (!snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30)))
if (_logger.IsWarn) _logger.Warn("FinalizeSync: in-flight ISnapTree drain timed out — flat state may be incomplete");

Comment on lines +63 to +67

IPersistence.IPersistenceReader reader = persistence.CreateReader(ReaderFlags.Sync);
IPersistence.IWriteBatch writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL);
return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager);
OnTreeCreated();
return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low (new): OnTreeCreated() is called at line 66 before new FlatSnapStateTree(...) finishes constructing. If the constructor throws (e.g. new StateTree(...) fails), OnTreeDisposed() is never called — the in-flight counter is permanently incremented, causing WaitForInFlightTreesDrained to spin until its 30-second timeout on every subsequent FinalizeSync call.

The fix: call OnTreeCreated() inside the FlatSnapStateTree constructor (where _ownerFactory is already available), and have the finally block in Dispose pair with it. This way the increment is always matched by the decrement, even if the caller-side code throws after creation.

Same issue exists in CreateStorageTree / FlatSnapStorageTree.

…rden tree-counter on throw

- Extract `RunSnapSyncOnce(CancellationToken)` and call from both `SnapSync()` and
  `SnapSync_StressRepro(int)` (was a verbatim copy of the test body).
- Trim the stress-repro comment block to a single line (coding-style: one-line max).
- Update `FlatTreeSyncStoreTests` constructor call for the new `FlatSnapTrieFactory`
  parameter on `FlatTreeSyncStore` (build was failing).
- Drop unused `using System;` / `using System.Threading;` (ImplicitUsings is on for
  this project; lint flagged them).
- Log a site-local warning in `FinalizeSync` when `WaitForInFlightTreesDrained`
  returns false, instead of silently discarding the bool — finalization-level
  visibility was missing.
- In `FlatSnapTrieFactory.CreateStateTree` / `CreateStorageTree`, pair every
  `OnTreeCreated()` with cleanup if the subsequent constructor throws; otherwise
  the in-flight counter leaks and `WaitForInFlightTreesDrained` would spin to its
  30-second timeout on every `FinalizeSync`. Also dispose the per-tree reader and
  write batch on the throw path so they don't leak either.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukaszRozmej
Copy link
Copy Markdown
Member Author

@claude re-review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

Claude finished @LukaszRozmej's task in 2m 30s —— View job


Code Review (re-review)

  • Read changed files and latest commit (41e858d)
  • Read repo rules (coding-style, robustness, test-infrastructure)
  • Check prior findings — all addressed
  • Analyze new commits for correctness and new issues
  • Post final summary

Summary

All four prior findings are resolved in 41e858d. No new findings. The PR is clean.


Prior findings — all resolved ✅

# Severity Finding Status
1 Medium Code duplication — SnapSync_StressRepro was a verbatim copy of SnapSync ✅ Extracted RunSnapSyncOnce(CancellationToken) — both tests now call the helper
2 Low Verbose 7-line comment block violated one-line-max rule ✅ Trimmed to a single line at E2ESyncTests.cs:414
3 Low WaitForInFlightTreesDrained return value silently discarded in FinalizeSync FinalizeSync now logs a site-level warning when the drain times out (FlatTreeSyncStore.cs:247–248)
4 Low OnTreeCreated() called before constructor completes — counter leak on constructor throw ✅ Both CreateStateTree and CreateStorageTree now wrap the constructor in try/catch; on throw, OnTreeDisposed() is called and the per-tree reader/writeBatch are disposed before rethrowing

Positive notes

  • _initialized DCL is correct. Moving _initialized = true after persistence.Clear(), combined with volatile, gives proper release/acquire ordering — a concurrent thread reading _initialized == true outside the lock is guaranteed to see all effects of Clear().
  • SnapshotableMemColumnsDb lock is correct. using Lock.Scope in both CreateSnapshot and AtomicColumnsWriteBatch.Dispose makes multi-column commits atomic w.r.t. snapshot creation without over-complicating the API.
  • _inFlightTrees counter is robustly implemented. Interlocked.Increment/Decrement with Interlocked.Read in the spin-wait is the correct pattern. The finally blocks in FlatSnapStateTree.Dispose / FlatSnapStorageTree.Dispose ensure OnTreeDisposed() is called even on exception; the new constructor-failure path in FlatSnapTrieFactory ensures OnTreeCreated() is always paired.
  • [Explicit] with message correctly prevents accidental CI execution.
  • No [Retry] on the stress reproducer is the right call.
  • Build fix in FlatTreeSyncStoreTests is minimal and correct — adds FlatSnapTrieFactory to the FlatTreeSyncStore constructor call.

@LukaszRozmej
Copy link
Copy Markdown
Member Author

@claude will this cause performance regression?

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

Claude finished @LukaszRozmej's task in 1m 0s —— View job


Performance regression analysis

  • Gather context
  • Read changed files
  • Analyze each fix for performance impact
  • Post findings

Short answer: No meaningful regression on any production path.

Each fix touches a different scope; here's the breakdown:


Fix 1 — SnapshotableMemColumnsDb lock (c14030d)

SnapshotableMemColumnsDb is the in-memory test backend only. Production nodes use RocksDB, which already has native atomic cross-CF snapshots and is not touched by this change. The new Lock _atomicityLock serializes CreateSnapshot against AtomicColumnsWriteBatch.Dispose, but since this class is instantiated exclusively in tests, production block processing sees zero overhead.


Fix 2 — _inFlightTrees Interlocked counter (12a66f1)

Two new Interlocked operations:

  • Interlocked.Increment in OnTreeCreated() — one call per snap range batch (tree creation)
  • Interlocked.Decrement in OnTreeDisposed() — one call per tree disposal

During snap sync, trees are created at the rate of one per AddAccountRange/AddStorageRange call — potentially hundreds of thousands across a full sync, but spread over minutes. An Interlocked.Increment/Decrement is ~5–10 ns on x64; even at 1 M trees, that's ~10 ms total added to a sync that takes hours. Completely negligible.

WaitForInFlightTreesDrained is called exactly once in FinalizeSync. In the normal case (all trees disposed before finalization is triggered) it exits in a single iteration — no spin, no sleep. Even in the pathological case it just waits for real work to finish, which is required for correctness. The 30-second timeout exists for fault detection, not as expected run time.


Fix 3 — volatile bool _initialized reorder (73b160f)

The field is volatile, which prevents the compiler from caching the read in a register. In the fast path (after the one-time Clear() completes), every EnsureDatabaseCleared call is a single volatile load + branch-not-taken. A volatile read costs ~1 ns versus ~0.3 ns for a non-volatile read — a difference far below any measurement threshold at the call rate of "once per tree creation."


Summary table

Fix Affected scope Overhead per tree Total overhead
Cross-column lock Test backend only 0 in production None
Interlocked counter Snap sync tree lifetime ~10 ns ~10 ms over full sync
volatile read Snap sync tree creation ~0.7 ns delta Sub-millisecond over full sync

None of the changes touch the normal block-processing path (RocksDB, state reads/writes, EVM execution). The overhead is confined to snap sync tree creation/disposal, which is a one-time background operation, not a latency-sensitive path.

@LukaszRozmej
Copy link
Copy Markdown
Member Author

@claude does this also fix some potential edge case production bugs?

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 2, 2026

Claude finished @LukaszRozmej's task in 2m 22s —— View job


Production bug analysis

  • Gather context
  • Read production-code changes
  • Assess production impact of each fix
  • Post findings

Yes — two of the three fixes address real production bugs, not just test artifacts. Here's the breakdown:


Fix 1 — SnapshotableMemColumnsDb cross-column atomicity (c14030d) — test-backend only

SnapshotableMemColumnsDb is only instantiated in tests and in PseudoNethermindModule (the test DI module). Production nodes use RocksDbPersistence via FlatWorldStateModule, which wires RocksDbPersistence or FlatInTriePersistence — both backed by RocksDB, which already has atomic cross-column-family snapshots natively. No production node is affected by this fix.


Fix 2 — Drain in-flight ISnapTrees in FinalizeSync (12a66f1) — real production bug

FlatTreeSyncStore.FinalizeSync is production code called by the synchronizer when snap sync completes. Each ISnapTree holds a live IPersistence.IWriteBatch; the writes only reach the underlying persistence when Dispose() is called on the tree. Before this fix, FinalizeSync could return — and the state ID could be bumped and persistence.Flush() called — while trees from recently-processed batches were still mid-Dispose().

Production consequence: A production FlatDB node doing snap sync could declare "sync complete" while some account and storage trie node writes were still queued in non-yet-disposed IWriteBatch instances. Block processing that immediately follows snap sync would then operate on incomplete flat state — account or storage entries missing from the flat DB that the trie still referenced. This could manifest as TrieNodeException, unexpected empty accounts, or incorrect storage reads during the first blocks replayed post-sync.

The 30-second spin-wait is a safety net; in normal operation the drain completes immediately because all trees have already been disposed before FinalizeSync is reached. The overhead is negligible.


Fix 3 — _initialized DCL race in FlatSnapTrieFactory (73b160f) — real production bug, the more serious one

FlatSnapTrieFactory is registered as a singleton (AddSingleton<ISnapTrieFactory, FlatSnapTrieFactory>() in FlatWorldStateModule). SnapSyncFeed is IsMultiFeed = true, meaning HandleResponse is invoked concurrently from multiple peer-handling threads. Each HandleResponseAddAccountRange / AddStorageRangeSnapProviderHelperfactory.CreateStateTree() / factory.CreateStorageTree()EnsureDatabaseCleared().

Before this fix, the double-checked lock had this ordering:

_initialized = true;      // published BEFORE Clear() ran
persistence.Clear();       // still queueing Removes for every key it finds

In production, two threads can both be processing their first snap sync responses concurrently. Thread T1 sets _initialized = true and starts Clear(); Thread T2 reads _initialized == true at the fast path, bypasses the lock, and starts writing accounts via SetAccount. T1's Clear() batch is still open and is iterating GetAllKeys() — it sees T2's freshly-committed keys and queues Remove for them. When T1's batch disposes, those accounts are erased from the live DB.

Production consequence: Accounts written during the race window are permanently erased from flat state. The trie still references them, so any attempt to execute a transaction or call that touches those accounts after snap sync will fail (missing flat entry). On Linux this race is less likely due to scheduler behavior (matching the CI test pass rate), but it is not impossible — it is a correctness bug on all platforms under the right timing.

The fix is minimal and correct: _initialized = true now publishes after persistence.Clear() returns inside the lock, and the volatile qualifier gives the release semantics needed for the fast-path read outside the lock to be safe.


Summary

Fix Affects production? Impact if untriggered
Fix 1 — cross-column lock No — test backend only N/A
Fix 2 — drain in-flight trees Yes — FlatDB snap sync finalization Flat state incomplete after sync; block replay errors
Fix 3 — _initialized DCL Yes — FlatDB snap sync first-batch race Accounts silently erased from flat; transactions fail post-sync

Both Fix 2 and Fix 3 are latent production correctness bugs for any FlatDB node that performs snap sync. Fix 3 is harder to trigger on Linux (scheduler dependent) but carries a more severe consequence (silent data loss). Fix 2 is a simpler ordering gap that the drain closes cleanly. Both are worth having regardless of the test flake.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

EXPB Benchmark Comparison

Run: View workflow run

superblocks

Scenario: nethermind-flat-superblocks-fix-e2e-snapsync-windows-flake-repro-delay0s

Client Processing (SSE)

Metric PR Master (cached) Delta
AVG (ms) 2493.02 1321.84 +88.60%
MEDIAN (ms) 2281.8 1212.7 +88.16%
P90 (ms) 3352.8 1818.5 +84.37%
P95 (ms) 3984.1 2108.3 +88.97%
P99 (ms) 8690.4 6212.5 +39.89%
MIN (ms) 1674.5 830.4 +101.65%
MAX (ms) 8690.4 6212.5 +39.89%
K6 TTFB
Metric PR Master (cached) Delta
AVG (ms) 2709.87 1560.83 +73.62%
MEDIAN (ms) 2434.24 1414.38 +72.11%
P90 (ms) 3519.42 2037.87 +72.70%
P95 (ms) 4056.16 2552.08 +58.94%
P99 (ms) 8144.33 3587.07 +127.05%
MIN (ms) 1806.95 954.79 +89.25%
MAX (ms) 9149.99 6866.15 +33.26%

realblocks

Scenario: nethermind-flat-realblocks-fix-e2e-snapsync-windows-flake-repro-delay0s

Client Processing (SSE)

Metric PR Master (cached) Delta
AVG (ms) 57.62 36.90 +56.15%
MEDIAN (ms) 52.3 30.7 +70.36%
P90 (ms) 92.6 65.9 +40.52%
P95 (ms) 108.1 79.1 +36.66%
P99 (ms) 181.0 134.5 +34.57%
MIN (ms) 0.5 0.5 +0.00%
MAX (ms) 320.1 240.0 +33.38%
K6 TTFB
Metric PR Master (cached) Delta
AVG (ms) 61.82 42.79 +44.47%
MEDIAN (ms) 56.32 36.12 +55.92%
P90 (ms) 97.61 75.05 +30.06%
P95 (ms) 114.06 86.84 +31.35%
P99 (ms) 179.22 136.32 +31.47%
MIN (ms) 2.21 1.16 +90.52%
MAX (ms) 326.77 245.07 +33.34%

Comment thread src/Nethermind/Nethermind.Db/SnapshotableMemColumnsDb.cs Outdated
/// Spin-wait until all trees handed out by this factory have been disposed. Returns true
/// if drained within <paramref name="timeout"/>, false on timeout.
/// </summary>
public bool WaitForInFlightTreesDrained(System.TimeSpan timeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public bool WaitForInFlightTreesDrained(System.TimeSpan timeout)
public bool WaitForInFlightTreesDrained(TimeSpan timeout)

Comment thread src/Nethermind/Nethermind.State.Flat/Sync/FlatTreeSyncStore.cs Outdated
Comment on lines +64 to +87
IPersistence.IWriteBatch writeBatch;
try
{
writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL);
}
catch
{
reader.Dispose();
throw;
}
OnTreeCreated();
try
{
return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this);
}
catch
{
// Constructor failed: pair the OnTreeCreated above with OnTreeDisposed and release
// the per-tree resources ourselves, since no FlatSnapStateTree.Dispose will run.
OnTreeDisposed();
writeBatch.Dispose();
reader.Dispose();
throw;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative with single try/catch?

Suggested change
IPersistence.IWriteBatch writeBatch;
try
{
writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL);
}
catch
{
reader.Dispose();
throw;
}
OnTreeCreated();
try
{
return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this);
}
catch
{
// Constructor failed: pair the OnTreeCreated above with OnTreeDisposed and release
// the per-tree resources ourselves, since no FlatSnapStateTree.Dispose will run.
OnTreeDisposed();
writeBatch.Dispose();
reader.Dispose();
throw;
}
IPersistence.IPersistenceReader? reader = null;
IPersistence.IWriteBatch? writeBatch = null;
bool treeTracked = false;
try
{
reader = persistence.CreateReader(ReaderFlags.Sync);
writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL);
OnTreeCreated();
treeTracked = true;
return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this);
}
catch
{
if (treeTracked) OnTreeDisposed();
writeBatch?.Dispose();
reader?.Dispose();
throw;
}

@alexb5dh
Copy link
Copy Markdown
Contributor

alexb5dh commented May 2, 2026

@claude what execution paths / operations waiting-for-up-to-30-seconds in FinalizeSync/WaitForInFlightTreesDrained can affect? Risks of it noticeable affecting performance?

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 2, 2026

Claude finished @alexb5dh's task in 1m 45s —— View job


Performance impact of WaitForInFlightTreesDrained

  • Gather context
  • Read FlatTreeSyncStore.cs and FlatSnapTrieFactory.cs
  • Trace FinalizeSync call chain
  • Analyze performance impact
  • Post findings

Short answer: no meaningful performance impact in the normal case; at worst a short spin at end-of-snap-sync on a single thread.


What calls FinalizeSync and when

FinalizeSync is called from TreeSync.SaveNode (TreeSync.cs:726) exactly once, when the root node of the state trie is saved — i.e., the very last step of snap sync phase 2 (state-nodes download). At this point:

  • The node is still in sync mode — block processing, EVM, and JSON-RPC are not yet running.
  • Snap phase 1 (account/storage ranges) is already complete, because IsSnapGetRangesFinished must be true before phase 2 proceeds to root-node finalization.
  • Crucially, FinalizeSync is called outside the _stateDbLock.EnterWriteLock() scope — the lock is acquired and released inside the NodeDataType.State branch of the switch, and the if (syncItem.IsRoot) check is after the switch statement. So WaitForInFlightTreesDrained does not hold any lock while it spins.

What is blocked during the spin

Only the one thread that processed the root node response is blocked in the spin. Concurrently:

  • Other HandleResponse calls on other peer-handling threads continue normally (they use a separate _syncStateLock.EnterReadLock() at line 151, which is not held by the spinning thread).
  • _rootSaved stays 0 until FinalizeSync returns, so ValidatePrepareRequest won't flip the feed to dormant yet — but with the root already saved, there are no meaningful new requests pending anyway.
  • _codeDb.Flush() and Interlocked.Exchange(ref _rootSaved, 1) are deferred until FinalizeSync returns.

Normal-case duration of the wait

In the normal flow, by the time SaveNode processes the root, all phase-1 ISnapTree instances have already been disposed (their Dispose was called synchronously at the end of each SnapProviderHelper.AddAccountRange call). So _inFlightTrees is already 0 when WaitForInFlightTreesDrained is called, and the while-loop exits on the very first iteration — effectively free.

The 30-second ceiling only matters if a tree Dispose is genuinely stuck (e.g. waiting on I/O, holding a blocked write batch). In that case, the spin degrades gracefully: SpinWait.SpinOnce() transitions from CPU-level spins (~10 iterations) to Thread.Yield() and then Thread.Sleep(1) — ~1ms ticks, negligible CPU, the thread is essentially parked. If the full 30s elapsed, it would burn ~30,000 wake-ups of 1ms each on one thread-pool thread. At end-of-sync, with no new peer responses coming in, that thread is otherwise idle, so there's no starvation.

What is NOT affected

  • Block processing, EVM execution, state reads — none of this starts until after sync transitions out of SyncMode.StateNodes, which requires _rootSaved == 1, which requires FinalizeSync to return. The wait is entirely within the sync window, not after it.
  • SnapshotableMemColumnsDb changes are test-backend only.
  • The volatile bool _initialized DCL and the Interlocked counter ops are sub-nanosecond on any sync-frequency call.

Regarding the benchmark comment

The ~88% superblocks regression in the benchmark above is almost certainly not caused by this PR. None of the changed code touches the block-processing path: FlatSnapTrieFactory, FlatSnapStateTree, FlatTreeSyncStore, and SnapshotableMemColumnsDb are all snap-sync-only components. The benchmark comparison note says "Master (cached)" — a cached master baseline from a different build/environment pass, which commonly introduces large variance on the block-processing benchmark. If the concern is real, re-running with a fresh master baseline on the same runner would confirm.

Copy link
Copy Markdown
Contributor

@asdacap asdacap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix issue 2 and 3 on top of #11102. Just call EnsureInitialize at the start of ISnapSyncRunner.Run and call FinalyzeSync at end of ISnapSyncRunner.Run. No lock or _initialized flag needed.

- Trim verbose comment on `_atomicityLock` in `SnapshotableMemColumnsDb` to a
  single line.
- Use plain `TimeSpan` (not `System.TimeSpan`) in `WaitForInFlightTreesDrained`;
  re-add `using System;`.
- Drop the duplicate warning at the `FinalizeSync` call site —
  `WaitForInFlightTreesDrained` already logs the timeout from inside; remove the
  now-unused `_logger` field.
- Refactor the `OnTreeCreated` / cleanup-on-throw blocks in
  `CreateStateTree` / `CreateStorageTree` to a single try/catch with `treeTracked`
  flag and nullable refs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukaszRozmej
Copy link
Copy Markdown
Member Author

@alexb5dh feedback addressed in 91db203:

  • Trimmed the cross-column-atomicity comment to a single line.
  • WaitForInFlightTreesDrained(TimeSpan timeout) — re-added using System; and dropped the System. qualifier.
  • Removed the duplicate warning at the FinalizeSync call site (the timeout warning is logged inside WaitForInFlightTreesDrained already); also removed the now-unused _logger field.
  • Refactored CreateStateTree / CreateStorageTree to your single-try/catch shape with treeTracked flag and nullable refs.

@asdacap re your CHANGES_REQUESTED — opened #11472 stacked on amirul/simple-sync-dispatcher with the runner-driven shape (EnsureInitialize at start, FinalizeSync at end of SnapSyncRunner.Run, no lock or _initialized flag). Stress repro: 0/180 on Windows. Once #11102 lands I can either close this PR (since #11472 supersedes the runtime fixes) or rebase this PR on top to keep just the cross-column fix and the test reproducer — whichever is easier on your end.

LukaszRozmej added a commit that referenced this pull request May 4, 2026
…Runner.Run (#11472)

* test(snap-sync): add Windows-flake stress reproducer

Adds a 30-iteration stress reproducer for the Windows-only
E2ESyncTests.SnapSync flake observed in CI (run #25139586200,
job #73686059011 on PR #11422).

Each iteration is its own NUnit case so a single failure does not
terminate the rest, no [Retry] so every flake surfaces, and
[Explicit] so it does not run in normal CI. Scoped to DbMode.Flat
where the flake has been observed.

Local reproduction on Windows (~1m runtime, 60 iterations across the
two Flat fixtures): 9-19 failures with both observed CI patterns —
"Verification failed: ... N missing in flat" and
"TrieNodeException: Failed to load key X from root hash Y" during
post-snap-sync block processing.

Run: dotnet test --project Nethermind.Synchronization.Test \
  -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run

Replaces the lazy-init lock + `_initialized` flag in `FlatSnapTrieFactory`
(which raced against concurrent `CreateStateTree`/`CreateStorageTree` calls
and caused `Clear()` to wipe freshly-written accounts) with explicit
sequential lifecycle hooks driven by `SnapSyncRunner.Run`:

- `ISnapTrieFactory.EnsureInitialize()` (default no-op) — called once before
  any tree is created. `FlatSnapTrieFactory` clears the flat DB here.
- `ISnapTrieFactory.FinalizeSync()` (default no-op) — called once after the
  dispatcher has finished. `FlatSnapTrieFactory` flushes the persistence here.
- `SnapSyncRunner.Run` brackets `dispatcher.Run` with both calls.

Because both halves run from a single sequential entry point — and the new
`SimpleDispatcher.Run` only returns once every batch handler (and therefore
every `ISnapTree.Dispose`) has completed — the implementation needs no
internal locking and the `_initialized` flag goes away. This is the in-tree
equivalent of the master-branch fix in #11443; the runner-driven shape
suggested by @asdacap is cleaner because it doesn't depend on the factory
guessing when "the first call" happens.

`PatriciaSnapTrieFactory` and the test factories pick up the empty default
implementations of both new methods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): extract RunSnapSyncOnce shared helper

`SnapSync` and `SnapSync_StressRepro` had verbatim-copied bodies; extract a
private `RunSnapSyncOnce(CancellationToken)` helper and have both call it.
Trim the multi-line stress-repro comment to a single line per coding-style.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): SnapSyncRunner lifecycle ordering test + cancel/throw cases

Adds an internal-visible Func<CancellationToken, Task> ctor on SnapSyncRunner
so the lifecycle can be exercised without standing up a real SimpleDispatcher,
and three NSubstitute-based tests:

- happy path: EnsureInitialize → dispatcher → FinalizeSync
- dispatcher throws: FinalizeSync still runs
- dispatcher cancelled: FinalizeSync still runs

Trim the doc comments on the new ISnapTrieFactory hooks to a single line per
the repo's coding-style rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(snap-sync): make SnapSyncRunner Func ctor public

The Func<CancellationToken, Task> overload was internal so the test could
reach it via InternalsVisibleTo. Make it public — it's a legitimate way to
construct the runner, the surface is small, and DI still resolves the
SimpleDispatcher overload because Func<CancellationToken, Task> isn't a
registered service.

Promotes the public ctor pattern to use a primary constructor (matches
the rest of the codebase) and drops the assembly-level
InternalsVisibleTo that was added solely for this class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): collapse SnapSyncRunner ordering tests into a single TestCase

Three near-identical tests (happy / throws / cancels) folded into one
parameterized method with a DispatcherOutcome enum.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): switch expression for SnapSyncRunner dispatcher outcome

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukaszRozmej
Copy link
Copy Markdown
Member Author

Closing in favor of #11472 and #11102.

asdacap added a commit that referenced this pull request May 4, 2026
* refactor: replace snap+state sync dispatcher with SimpleDispatcher

- Introduce SimpleDispatcher that runs feeds sequentially (snap → state)
  replacing the old parallel SyncDispatcher<T> for snap+state sync.
- Merge SyncMode.SnapSync into SyncMode.StateNodes; sequencing is now
  handled by StateSyncRunner rather than the mode selector.
- Move isCloseToHead gate and DB tuning from mode selector / SyncDbTuner
  into StateSyncRunner. DB tune reset wrapped in try/finally so it runs
  on cancellation.
- StateSyncRunner early-returns when FindBestFullState != 0 (state
  already present); Run wraps RunRound (reset pivot, run feed, verify
  trie) plus the production-only waits.
- StateSyncFeed / SnapSyncFeed now implement ISimpleSyncFeed; return
  null signals round finished, dispatcher drains in-flight requests
  with CancellationToken.None so peer allocations are always freed.
- StateSyncFeed/Downloader/AllocationStrategyFactory registered via DI
  and resolved by StateSyncRunner instead of being new'd per round.
- TreeSync.ValidatePrepareRequest collapsed to bool IsSyncRoundFinished;
  the SyncMode parameter and _syncMode field are gone.
- Remove IsSnapGetRangesFinished from ISyncProgressResolver.
- Test harness SafeContext.RunFeed now calls IStateSyncRunner.RunRound;
  feed-level tests resolve StateSyncFeed / StateSyncDownloader via DI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: yield state sync between rounds to MultiSyncModeSelector

Restore the old "pause when mode leaves StateNodes" behaviour at the
round boundary: StateSyncRunner.RunRound now awaits
syncModeSelector.WaitUntilMode(StateNodes) at the top of each iteration,
so beacon control / UpdatingPivot / fast-sync re-entry can claim peers
between rounds. Pivot changes mid-round are still handled by the feed
returning null via IsSyncRoundFinished.

Test infra substitutes ISyncModeSelector with one that reports
StateNodes immediately — the tests drive RunRound directly and shouldn't
depend on the real MultiSyncModeSelector settling.

Mid-round yielding (within a single RunFeed call) is not addressed here;
the feed's PrepareRequest doesn't consult the mode selector to avoid
complicating the test harness. Cost is bounded: pivot moves still end
the round naturally, and all currently-observed mode transitions that
matter (beacon reorg, pivot updates) also move the pivot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: make SimpleDispatcher<T> generic, bind feed/downloader at construction

Push T to the class level and move feed, downloader, allocation strategy
factory, and AllocationContexts to the constructor — each dispatcher is
now bound to the single feed it drives, and its entry point is
Run(CancellationToken) rather than a generic RunFeed<T>(...).

Register SimpleDispatcher<StateSyncBatch?> and SimpleDispatcher<SnapSyncBatch?>
as singletons via factory delegates in the SynchronizerModule (feed +
downloader + strategy factory are also DI-resolved). State/SnapSyncRunner
now inject the dispatcher instead of constructing it per call.

Also fix the stale class summary (no more WaitForMode) and the stale cref
on ISimpleSyncFeed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: register and resolve sync feed deps via interfaces

Register StateSyncFeed / SnapSyncFeed as ISimpleSyncFeed<T>, downloaders
as ISyncDownloader<T>, strategy factories as IPeerAllocationStrategyFactory<T>,
and resolve the same interfaces from the SimpleDispatcher factory. Tests
that directly touched the feed likewise resolve ISimpleSyncFeed<StateSyncBatch>
and ISyncDownloader<StateSyncBatch> instead of the concrete types.

Normalize the generic parameter to the non-nullable batch type
(StateSyncBatch / SnapSyncBatch) and add `where T : class` to
ISimpleSyncFeed<T> / SimpleDispatcher<T> so the interface and
implementation generics line up (previously ISimpleSyncFeed<StateSyncBatch?>
mismatched ISyncDownloader<StateSyncBatch>, forcing concrete resolution).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: remove unused StateSync using in StateSyncFeedTests

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address claude review findings on snap/state sync dispatcher

- SnapSyncFeed: catch OperationCanceledException before general Exception
  so clean shutdown doesn't log ERROR
- SimpleDispatcher: remove token from Task.Run to prevent semaphore leak
  if cancellation fires between WaitAsync and task scheduling
- StateSyncRunner: replace LINQ with foreach in WaitForCloseToHead
- SafeContext: add Feed property so tests use ctx.Feed instead of
  container.Resolve<ISimpleSyncFeed<StateSyncBatch>>

* fix: swallow OCE on shutdown; retry transient SnapSyncFeed errors

- StateSyncRunner.Run wraps the body in try/catch (OperationCanceledException)
  when token.IsCancellationRequested, so clean shutdown no longer
  propagates OCE to Synchronizer.StartSnapAndStateSyncComponents'
  ContinueWith handler and logs "State sync failed" at ERROR.
- SnapSyncFeed.PrepareRequest moves try/catch inside the while loop
  with a dedicated catch (OperationCanceledException) returning null,
  matching StateSyncFeed's shape. Transient exceptions from
  _snapProvider.IsFinished() now retry on the next iteration instead of
  permanently aborting snap sync by signalling done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: rename RunRound -> RunStateSyncRounds; extract StateSyncPrecursorWait

- Rename `IStateSyncRunner.RunRound` to `RunStateSyncRounds` (multiple
  rounds, plural).
- Combine the WaitUntilMode(StateNodes) and WaitForCloseToHead calls
  into a single private method `StateSyncPrecursorWait` and use it both
  at the top of `Run` and at the top of every `RunStateSyncRounds`
  iteration. Yields between rounds when MSMS leaves StateNodes or the
  node drifts away from head.
- Test harness: also substitute ISyncProgressResolver and
  IBeaconSyncStrategy so the close-to-head check returns immediately
  (default test peers have HeadNumber=0 which would never satisfy the
  real predicate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: drop unused ITreeSync interface and SyncCompleted event

The SyncCompleted event was fired by TreeSync but no longer had any
subscribers after VerifyStateOnStateSyncFinished was deleted and the
verify-trie call moved into StateSyncRunner.RunStateSyncRounds.
ITreeSync existed only to expose this event, so remove the whole
surface.

- Delete ITreeSync.cs and SyncCompletedEventArgs.
- Drop the event field and its invocation in TreeSync.VerifyPostSyncCleanUp.
- DI: AddSingleton<ITreeSync, TreeSync>() -> AddSingleton<TreeSync>().
- Drop the now-unused Substitute.For<ITreeSync>() in
  SynchronizerModuleTests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run (#11472)

* test(snap-sync): add Windows-flake stress reproducer

Adds a 30-iteration stress reproducer for the Windows-only
E2ESyncTests.SnapSync flake observed in CI (run #25139586200,
job #73686059011 on PR #11422).

Each iteration is its own NUnit case so a single failure does not
terminate the rest, no [Retry] so every flake surfaces, and
[Explicit] so it does not run in normal CI. Scoped to DbMode.Flat
where the flake has been observed.

Local reproduction on Windows (~1m runtime, 60 iterations across the
two Flat fixtures): 9-19 failures with both observed CI patterns —
"Verification failed: ... N missing in flat" and
"TrieNodeException: Failed to load key X from root hash Y" during
post-snap-sync block processing.

Run: dotnet test --project Nethermind.Synchronization.Test \
  -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run

Replaces the lazy-init lock + `_initialized` flag in `FlatSnapTrieFactory`
(which raced against concurrent `CreateStateTree`/`CreateStorageTree` calls
and caused `Clear()` to wipe freshly-written accounts) with explicit
sequential lifecycle hooks driven by `SnapSyncRunner.Run`:

- `ISnapTrieFactory.EnsureInitialize()` (default no-op) — called once before
  any tree is created. `FlatSnapTrieFactory` clears the flat DB here.
- `ISnapTrieFactory.FinalizeSync()` (default no-op) — called once after the
  dispatcher has finished. `FlatSnapTrieFactory` flushes the persistence here.
- `SnapSyncRunner.Run` brackets `dispatcher.Run` with both calls.

Because both halves run from a single sequential entry point — and the new
`SimpleDispatcher.Run` only returns once every batch handler (and therefore
every `ISnapTree.Dispose`) has completed — the implementation needs no
internal locking and the `_initialized` flag goes away. This is the in-tree
equivalent of the master-branch fix in #11443; the runner-driven shape
suggested by @asdacap is cleaner because it doesn't depend on the factory
guessing when "the first call" happens.

`PatriciaSnapTrieFactory` and the test factories pick up the empty default
implementations of both new methods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): extract RunSnapSyncOnce shared helper

`SnapSync` and `SnapSync_StressRepro` had verbatim-copied bodies; extract a
private `RunSnapSyncOnce(CancellationToken)` helper and have both call it.
Trim the multi-line stress-repro comment to a single line per coding-style.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): SnapSyncRunner lifecycle ordering test + cancel/throw cases

Adds an internal-visible Func<CancellationToken, Task> ctor on SnapSyncRunner
so the lifecycle can be exercised without standing up a real SimpleDispatcher,
and three NSubstitute-based tests:

- happy path: EnsureInitialize → dispatcher → FinalizeSync
- dispatcher throws: FinalizeSync still runs
- dispatcher cancelled: FinalizeSync still runs

Trim the doc comments on the new ISnapTrieFactory hooks to a single line per
the repo's coding-style rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(snap-sync): make SnapSyncRunner Func ctor public

The Func<CancellationToken, Task> overload was internal so the test could
reach it via InternalsVisibleTo. Make it public — it's a legitimate way to
construct the runner, the surface is small, and DI still resolves the
SimpleDispatcher overload because Func<CancellationToken, Task> isn't a
registered service.

Promotes the public ctor pattern to use a primary constructor (matches
the rest of the codebase) and drops the assembly-level
InternalsVisibleTo that was added solely for this class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): collapse SnapSyncRunner ordering tests into a single TestCase

Three near-identical tests (happy / throws / cancels) folded into one
parameterized method with a DispatcherOutcome enum.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): switch expression for SnapSyncRunner dispatcher outcome

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Lukasz Rozmej <lukasz.rozmej@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants