Skip to content

fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run#11472

Merged
LukaszRozmej merged 7 commits into
amirul/simple-sync-dispatcherfrom
fix/snap-sync-init-finalize-on-runner
May 4, 2026
Merged

fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run#11472
LukaszRozmej merged 7 commits into
amirul/simple-sync-dispatcherfrom
fix/snap-sync-init-finalize-on-runner

Conversation

@LukaszRozmej
Copy link
Copy Markdown
Member

Summary

Ports the snap-sync flake fix from #11443 onto @asdacap's snap+state dispatcher refactor (#11102), in the cleaner shape he suggested: drive FlatSnapTrieFactory.Clear() from SnapSyncRunner.Run boundaries instead of a lazy EnsureDatabaseCleared lock.

This PR is stacked on top of amirul/simple-sync-dispatcher so it can land alongside #11102.

Background

#11443 root-caused a Windows-only E2ESyncTests.SnapSync flake to a race in FlatSnapTrieFactory.EnsureDatabaseCleared:

if (_initialized) return;          // unlocked fast path
using (_lock.EnterScope())
{
    if (_initialized) return;
    _initialized = true;            // ← published BEFORE Clear() ran
    persistence.Clear();             // ← still iterating GetAllKeys / queueing Removes
}

Thread T2 could read _initialized = true via the unlocked fast path while T1 was still inside Clear(). T2 would then call CreateStateTree and start writing accounts. When T1's Clear batch finally disposed, every queued Remove (including the keys T2 had just committed) would land on the live DB, and the flat DB ended up missing entries the trie still referenced — surfacing as "Account in trie not found in flat" in the verifier.

#11443 fixed it by publishing _initialized after Clear() and marking the field volatile. Validated as 0/300 stress runs vs. 8–15/60 before.

What this PR does

@asdacap pointed out that on top of #11102's runner refactor, the lazy-init pattern can go away entirely — the snap-sync run has a single sequential entry/exit point in SnapSyncRunner.Run, so init/cleanup can be hooked there:

  • Add two default-no-op methods to ISnapTrieFactory:
    • EnsureInitialize() — called once before any tree is created.
    • FinalizeSync() — called once after the dispatcher has drained.
  • FlatSnapTrieFactory implements them with persistence.Clear() and persistence.Flush() respectively. The lock and the _initialized flag are deleted.
  • SnapSyncRunner.Run brackets dispatcher.Run(token) with EnsureInitialize (before) and FinalizeSync (in finally).
  • PatriciaSnapTrieFactory, TestSnapTrieFactory, and the other test factories pick up the empty default implementations and don't need changes.

Because SimpleDispatcher.Run (introduced by #11102) only returns once every batch handler — and therefore every ISnapTree.Dispose — has completed, neither hook needs internal locking. This also obviates the in-flight tree drain that #11443 needed on master (commit 12a66f1e1b).

Stress reproducer

The same SnapSync_StressRepro test from #11443 (cherry-picked here) runs 30 iterations × 2 Flat fixtures = 60 cases, no [Retry], [Explicit]. Local Windows results on this branch:

Run Verification failed Test failures
Run 1 0 0
Run 2 0 0
Run 3 0 0
Total 0/180 0/180

Plus full Flat test suite (292 tests, 4 skipped) and RecreateState* (20 tests) and StateSyncFeed* (834 tests) all pass.

Test plan

  • Solution builds clean (0 warnings, 0 errors).
  • Nethermind.State.Flat.Test passes (292 tests).
  • RecreateStateFromAccountRanges / RecreateStateFromStorageRanges pass (20 tests).
  • StateSyncFeed* passes (834 tests).
  • SnapSync_StressRepro: 0 failures across 3 stress runs (180 invocations).

LukaszRozmej and others added 3 commits May 3, 2026 07:43
Adds a 30-iteration stress reproducer for the Windows-only
E2ESyncTests.SnapSync flake observed in CI (run #25139586200,
job #73686059011 on PR #11422).

Each iteration is its own NUnit case so a single failure does not
terminate the rest, no [Retry] so every flake surfaces, and
[Explicit] so it does not run in normal CI. Scoped to DbMode.Flat
where the flake has been observed.

Local reproduction on Windows (~1m runtime, 60 iterations across the
two Flat fixtures): 9-19 failures with both observed CI patterns —
"Verification failed: ... N missing in flat" and
"TrieNodeException: Failed to load key X from root hash Y" during
post-snap-sync block processing.

Run: dotnet test --project Nethermind.Synchronization.Test \
  -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Runner.Run

Replaces the lazy-init lock + `_initialized` flag in `FlatSnapTrieFactory`
(which raced against concurrent `CreateStateTree`/`CreateStorageTree` calls
and caused `Clear()` to wipe freshly-written accounts) with explicit
sequential lifecycle hooks driven by `SnapSyncRunner.Run`:

- `ISnapTrieFactory.EnsureInitialize()` (default no-op) — called once before
  any tree is created. `FlatSnapTrieFactory` clears the flat DB here.
- `ISnapTrieFactory.FinalizeSync()` (default no-op) — called once after the
  dispatcher has finished. `FlatSnapTrieFactory` flushes the persistence here.
- `SnapSyncRunner.Run` brackets `dispatcher.Run` with both calls.

Because both halves run from a single sequential entry point — and the new
`SimpleDispatcher.Run` only returns once every batch handler (and therefore
every `ISnapTree.Dispose`) has completed — the implementation needs no
internal locking and the `_initialized` flag goes away. This is the in-tree
equivalent of the master-branch fix in #11443; the runner-driven shape
suggested by @asdacap is cleaner because it doesn't depend on the factory
guessing when "the first call" happens.

`PatriciaSnapTrieFactory` and the test factories pick up the empty default
implementations of both new methods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`SnapSync` and `SnapSync_StressRepro` had verbatim-copied bodies; extract a
private `RunSnapSyncOnce(CancellationToken)` helper and have both call it.
Trim the multi-line stress-repro comment to a single line per coding-style.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 3, 2026

Claude finished @LukaszRozmej's task in 3m 53s —— View job


PR Review

  • Load repo rules and context
  • Read changed files
  • Analyze correctness and edge cases
  • Check security and performance
  • Post findings

Summary

The fix is correct and well-designed. The core reasoning holds up: since ISnapTree instances are always created and disposed within using blocks in SnapProviderHelper.AddAccountRange/AddStorageRange, they are fully disposed before HandleResponse returns, and SimpleDispatcher.Run waits for all in-flight Task.Run workers via the semaphore drain before returning. This means FinalizeSync is guaranteed to run after every write batch has been committed and disposed.

The lazy-init pattern (_initialized + lock) is cleanly replaced by lifecycle hooks at the single sequential entry/exit point — a meaningfully simpler and safer design.

Critical: 0 | High: 0 | Medium: 0 | Low: 2


Findings

Low — FinalizeSync now also runs on every cancellation/exception path

SnapSyncRunner.Run calls FinalizeSync() in a finally, so persistence.Flush() is called on every exit including early cancellation (e.g. shutdown, pivot change). The old code never flushed on these paths. In practice this is harmless — flushing a partial or empty WAL is idempotent and fast — but it is a behavioural change worth noting for reviewers who benchmark startup/shutdown times.

If a deliberate no-flush-on-cancel behaviour is ever needed, a flag on FinalizeSync can be added later; the current choice (always flush) is the safer default for durability.

Low — No targeted unit test for SnapSyncRunner lifecycle ordering

The E2E stress test covers the concrete race via 180 invocations, but there is no lightweight unit test that asserts EnsureInitialize is called before any CreateStateTree/CreateStorageTree, and FinalizeSync after. A mock ISnapTrieFactory + mock SimpleDispatcher could capture call order cheaply. Optional, given the stress coverage, but would document the contract as executable spec.


Code review notes

SimpleDispatcher.Run drain correctness (lines 80-81):

for (int i = 0; i < maxThreads; i++)
    await semaphore.WaitAsync(CancellationToken.None);

Each started Task.Run holds exactly one semaphore slot and releases it in finally. When semaphore.WaitAsync(token) at line 64 is cancelled the slot is not acquired so no release is needed. The drain therefore correctly accounts for all and only started tasks — confirming dispatcher.Run only returns after every HandleResponse (and therefore every ISnapTree.Dispose) has completed.

FlatSnapStateTree.Dispose disposes _writeBatch (line 77) before control returns to SnapProviderHelper, which returns before HandleResponse returns, which returns before the semaphore is released, which happens before the drain unblocks dispatcher.Run. So persistence.Flush() in FinalizeSync is always called after every write batch is closed — no partially-open batches can exist.

DI: the new ISnapTrieFactory constructor parameter in SnapSyncRunner is already registered as a singleton in both Synchronizer.cs (via PatriciaSnapTrieFactory) and FlatWorldStateModule.cs (via FlatSnapTrieFactory), so no DI wiring changes are needed.

The [Explicit] stress reproducer won't run in CI unless manually triggered — correct for a load test. The _ = iteration; discard keeps the parameter in the NUnit test name (e.g. SnapSync_StressRepro(5)) while suppressing the unused-parameter warning; a comment explains the intent.

…w cases

Adds an internal-visible Func<CancellationToken, Task> ctor on SnapSyncRunner
so the lifecycle can be exercised without standing up a real SimpleDispatcher,
and three NSubstitute-based tests:

- happy path: EnsureInitialize → dispatcher → FinalizeSync
- dispatcher throws: FinalizeSync still runs
- dispatcher cancelled: FinalizeSync still runs

Trim the doc comments on the new ISnapTrieFactory hooks to a single line per
the repo's coding-style rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukaszRozmej
Copy link
Copy Markdown
Member Author

Both Low findings addressed in 0a94bfd:

  • Lifecycle ordering test: added SnapSyncRunnerTests with 3 cases (happy path, dispatcher throws, dispatcher cancelled) — all assert EnsureInitialize → dispatcher → FinalizeSync ordering. Required exposing an internal Func<CancellationToken, Task> ctor on SnapSyncRunner so the runner can be exercised without standing up a real SimpleDispatcher. InternalsVisibleTo("Nethermind.Synchronization.Test") already in scope.
  • FinalizeSync on cancel/throw: kept the always-flush behaviour (idempotent, durability-safer default as you noted) and added a one-line comment in the finally block making the intent explicit. Cancel and throw paths are both covered by the new tests.

Also trimmed the multi-line XML doc comments on the new ISnapTrieFactory hooks to single-line comments per the repo's coding-style rule.

LukaszRozmej and others added 3 commits May 3, 2026 08:27
The Func<CancellationToken, Task> overload was internal so the test could
reach it via InternalsVisibleTo. Make it public — it's a legitimate way to
construct the runner, the surface is small, and DI still resolves the
SimpleDispatcher overload because Func<CancellationToken, Task> isn't a
registered service.

Promotes the public ctor pattern to use a primary constructor (matches
the rest of the codebase) and drops the assembly-level
InternalsVisibleTo that was added solely for this class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… TestCase

Three near-identical tests (happy / throws / cancels) folded into one
parameterized method with a DispatcherOutcome enum.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukaszRozmej LukaszRozmej merged commit 0658ccf into amirul/simple-sync-dispatcher May 4, 2026
411 checks passed
@LukaszRozmej LukaszRozmej deleted the fix/snap-sync-init-finalize-on-runner branch May 4, 2026 08:20
asdacap added a commit that referenced this pull request May 4, 2026
* refactor: replace snap+state sync dispatcher with SimpleDispatcher

- Introduce SimpleDispatcher that runs feeds sequentially (snap → state)
  replacing the old parallel SyncDispatcher<T> for snap+state sync.
- Merge SyncMode.SnapSync into SyncMode.StateNodes; sequencing is now
  handled by StateSyncRunner rather than the mode selector.
- Move isCloseToHead gate and DB tuning from mode selector / SyncDbTuner
  into StateSyncRunner. DB tune reset wrapped in try/finally so it runs
  on cancellation.
- StateSyncRunner early-returns when FindBestFullState != 0 (state
  already present); Run wraps RunRound (reset pivot, run feed, verify
  trie) plus the production-only waits.
- StateSyncFeed / SnapSyncFeed now implement ISimpleSyncFeed; return
  null signals round finished, dispatcher drains in-flight requests
  with CancellationToken.None so peer allocations are always freed.
- StateSyncFeed/Downloader/AllocationStrategyFactory registered via DI
  and resolved by StateSyncRunner instead of being new'd per round.
- TreeSync.ValidatePrepareRequest collapsed to bool IsSyncRoundFinished;
  the SyncMode parameter and _syncMode field are gone.
- Remove IsSnapGetRangesFinished from ISyncProgressResolver.
- Test harness SafeContext.RunFeed now calls IStateSyncRunner.RunRound;
  feed-level tests resolve StateSyncFeed / StateSyncDownloader via DI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: yield state sync between rounds to MultiSyncModeSelector

Restore the old "pause when mode leaves StateNodes" behaviour at the
round boundary: StateSyncRunner.RunRound now awaits
syncModeSelector.WaitUntilMode(StateNodes) at the top of each iteration,
so beacon control / UpdatingPivot / fast-sync re-entry can claim peers
between rounds. Pivot changes mid-round are still handled by the feed
returning null via IsSyncRoundFinished.

Test infra substitutes ISyncModeSelector with one that reports
StateNodes immediately — the tests drive RunRound directly and shouldn't
depend on the real MultiSyncModeSelector settling.

Mid-round yielding (within a single RunFeed call) is not addressed here;
the feed's PrepareRequest doesn't consult the mode selector to avoid
complicating the test harness. Cost is bounded: pivot moves still end
the round naturally, and all currently-observed mode transitions that
matter (beacon reorg, pivot updates) also move the pivot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: make SimpleDispatcher<T> generic, bind feed/downloader at construction

Push T to the class level and move feed, downloader, allocation strategy
factory, and AllocationContexts to the constructor — each dispatcher is
now bound to the single feed it drives, and its entry point is
Run(CancellationToken) rather than a generic RunFeed<T>(...).

Register SimpleDispatcher<StateSyncBatch?> and SimpleDispatcher<SnapSyncBatch?>
as singletons via factory delegates in the SynchronizerModule (feed +
downloader + strategy factory are also DI-resolved). State/SnapSyncRunner
now inject the dispatcher instead of constructing it per call.

Also fix the stale class summary (no more WaitForMode) and the stale cref
on ISimpleSyncFeed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: register and resolve sync feed deps via interfaces

Register StateSyncFeed / SnapSyncFeed as ISimpleSyncFeed<T>, downloaders
as ISyncDownloader<T>, strategy factories as IPeerAllocationStrategyFactory<T>,
and resolve the same interfaces from the SimpleDispatcher factory. Tests
that directly touched the feed likewise resolve ISimpleSyncFeed<StateSyncBatch>
and ISyncDownloader<StateSyncBatch> instead of the concrete types.

Normalize the generic parameter to the non-nullable batch type
(StateSyncBatch / SnapSyncBatch) and add `where T : class` to
ISimpleSyncFeed<T> / SimpleDispatcher<T> so the interface and
implementation generics line up (previously ISimpleSyncFeed<StateSyncBatch?>
mismatched ISyncDownloader<StateSyncBatch>, forcing concrete resolution).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: remove unused StateSync using in StateSyncFeedTests

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address claude review findings on snap/state sync dispatcher

- SnapSyncFeed: catch OperationCanceledException before general Exception
  so clean shutdown doesn't log ERROR
- SimpleDispatcher: remove token from Task.Run to prevent semaphore leak
  if cancellation fires between WaitAsync and task scheduling
- StateSyncRunner: replace LINQ with foreach in WaitForCloseToHead
- SafeContext: add Feed property so tests use ctx.Feed instead of
  container.Resolve<ISimpleSyncFeed<StateSyncBatch>>

* fix: swallow OCE on shutdown; retry transient SnapSyncFeed errors

- StateSyncRunner.Run wraps the body in try/catch (OperationCanceledException)
  when token.IsCancellationRequested, so clean shutdown no longer
  propagates OCE to Synchronizer.StartSnapAndStateSyncComponents'
  ContinueWith handler and logs "State sync failed" at ERROR.
- SnapSyncFeed.PrepareRequest moves try/catch inside the while loop
  with a dedicated catch (OperationCanceledException) returning null,
  matching StateSyncFeed's shape. Transient exceptions from
  _snapProvider.IsFinished() now retry on the next iteration instead of
  permanently aborting snap sync by signalling done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: rename RunRound -> RunStateSyncRounds; extract StateSyncPrecursorWait

- Rename `IStateSyncRunner.RunRound` to `RunStateSyncRounds` (multiple
  rounds, plural).
- Combine the WaitUntilMode(StateNodes) and WaitForCloseToHead calls
  into a single private method `StateSyncPrecursorWait` and use it both
  at the top of `Run` and at the top of every `RunStateSyncRounds`
  iteration. Yields between rounds when MSMS leaves StateNodes or the
  node drifts away from head.
- Test harness: also substitute ISyncProgressResolver and
  IBeaconSyncStrategy so the close-to-head check returns immediately
  (default test peers have HeadNumber=0 which would never satisfy the
  real predicate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: drop unused ITreeSync interface and SyncCompleted event

The SyncCompleted event was fired by TreeSync but no longer had any
subscribers after VerifyStateOnStateSyncFinished was deleted and the
verify-trie call moved into StateSyncRunner.RunStateSyncRounds.
ITreeSync existed only to expose this event, so remove the whole
surface.

- Delete ITreeSync.cs and SyncCompletedEventArgs.
- Drop the event field and its invocation in TreeSync.VerifyPostSyncCleanUp.
- DI: AddSingleton<ITreeSync, TreeSync>() -> AddSingleton<TreeSync>().
- Drop the now-unused Substitute.For<ITreeSync>() in
  SynchronizerModuleTests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run (#11472)

* test(snap-sync): add Windows-flake stress reproducer

Adds a 30-iteration stress reproducer for the Windows-only
E2ESyncTests.SnapSync flake observed in CI (run #25139586200,
job #73686059011 on PR #11422).

Each iteration is its own NUnit case so a single failure does not
terminate the rest, no [Retry] so every flake surfaces, and
[Explicit] so it does not run in normal CI. Scoped to DbMode.Flat
where the flake has been observed.

Local reproduction on Windows (~1m runtime, 60 iterations across the
two Flat fixtures): 9-19 failures with both observed CI patterns —
"Verification failed: ... N missing in flat" and
"TrieNodeException: Failed to load key X from root hash Y" during
post-snap-sync block processing.

Run: dotnet test --project Nethermind.Synchronization.Test \
  -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run

Replaces the lazy-init lock + `_initialized` flag in `FlatSnapTrieFactory`
(which raced against concurrent `CreateStateTree`/`CreateStorageTree` calls
and caused `Clear()` to wipe freshly-written accounts) with explicit
sequential lifecycle hooks driven by `SnapSyncRunner.Run`:

- `ISnapTrieFactory.EnsureInitialize()` (default no-op) — called once before
  any tree is created. `FlatSnapTrieFactory` clears the flat DB here.
- `ISnapTrieFactory.FinalizeSync()` (default no-op) — called once after the
  dispatcher has finished. `FlatSnapTrieFactory` flushes the persistence here.
- `SnapSyncRunner.Run` brackets `dispatcher.Run` with both calls.

Because both halves run from a single sequential entry point — and the new
`SimpleDispatcher.Run` only returns once every batch handler (and therefore
every `ISnapTree.Dispose`) has completed — the implementation needs no
internal locking and the `_initialized` flag goes away. This is the in-tree
equivalent of the master-branch fix in #11443; the runner-driven shape
suggested by @asdacap is cleaner because it doesn't depend on the factory
guessing when "the first call" happens.

`PatriciaSnapTrieFactory` and the test factories pick up the empty default
implementations of both new methods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): extract RunSnapSyncOnce shared helper

`SnapSync` and `SnapSync_StressRepro` had verbatim-copied bodies; extract a
private `RunSnapSyncOnce(CancellationToken)` helper and have both call it.
Trim the multi-line stress-repro comment to a single line per coding-style.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): SnapSyncRunner lifecycle ordering test + cancel/throw cases

Adds an internal-visible Func<CancellationToken, Task> ctor on SnapSyncRunner
so the lifecycle can be exercised without standing up a real SimpleDispatcher,
and three NSubstitute-based tests:

- happy path: EnsureInitialize → dispatcher → FinalizeSync
- dispatcher throws: FinalizeSync still runs
- dispatcher cancelled: FinalizeSync still runs

Trim the doc comments on the new ISnapTrieFactory hooks to a single line per
the repo's coding-style rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(snap-sync): make SnapSyncRunner Func ctor public

The Func<CancellationToken, Task> overload was internal so the test could
reach it via InternalsVisibleTo. Make it public — it's a legitimate way to
construct the runner, the surface is small, and DI still resolves the
SimpleDispatcher overload because Func<CancellationToken, Task> isn't a
registered service.

Promotes the public ctor pattern to use a primary constructor (matches
the rest of the codebase) and drops the assembly-level
InternalsVisibleTo that was added solely for this class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): collapse SnapSyncRunner ordering tests into a single TestCase

Three near-identical tests (happy / throws / cancels) folded into one
parameterized method with a DispatcherOutcome enum.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(snap-sync): switch expression for SnapSyncRunner dispatcher outcome

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Lukasz Rozmej <lukasz.rozmej@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants