fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run#11472
Conversation
Adds a 30-iteration stress reproducer for the Windows-only E2ESyncTests.SnapSync flake observed in CI (run #25139586200, job #73686059011 on PR #11422). Each iteration is its own NUnit case so a single failure does not terminate the rest, no [Retry] so every flake surfaces, and [Explicit] so it does not run in normal CI. Scoped to DbMode.Flat where the flake has been observed. Local reproduction on Windows (~1m runtime, 60 iterations across the two Flat fixtures): 9-19 failures with both observed CI patterns — "Verification failed: ... N missing in flat" and "TrieNodeException: Failed to load key X from root hash Y" during post-snap-sync block processing. Run: dotnet test --project Nethermind.Synchronization.Test \ -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Runner.Run Replaces the lazy-init lock + `_initialized` flag in `FlatSnapTrieFactory` (which raced against concurrent `CreateStateTree`/`CreateStorageTree` calls and caused `Clear()` to wipe freshly-written accounts) with explicit sequential lifecycle hooks driven by `SnapSyncRunner.Run`: - `ISnapTrieFactory.EnsureInitialize()` (default no-op) — called once before any tree is created. `FlatSnapTrieFactory` clears the flat DB here. - `ISnapTrieFactory.FinalizeSync()` (default no-op) — called once after the dispatcher has finished. `FlatSnapTrieFactory` flushes the persistence here. - `SnapSyncRunner.Run` brackets `dispatcher.Run` with both calls. Because both halves run from a single sequential entry point — and the new `SimpleDispatcher.Run` only returns once every batch handler (and therefore every `ISnapTree.Dispose`) has completed — the implementation needs no internal locking and the `_initialized` flag goes away. This is the in-tree equivalent of the master-branch fix in #11443; the runner-driven shape suggested by @asdacap is cleaner because it doesn't depend on the factory guessing when "the first call" happens. `PatriciaSnapTrieFactory` and the test factories pick up the empty default implementations of both new methods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`SnapSync` and `SnapSync_StressRepro` had verbatim-copied bodies; extract a private `RunSnapSyncOnce(CancellationToken)` helper and have both call it. Trim the multi-line stress-repro comment to a single line per coding-style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @LukaszRozmej's task in 3m 53s —— View job PR Review
|
…w cases Adds an internal-visible Func<CancellationToken, Task> ctor on SnapSyncRunner so the lifecycle can be exercised without standing up a real SimpleDispatcher, and three NSubstitute-based tests: - happy path: EnsureInitialize → dispatcher → FinalizeSync - dispatcher throws: FinalizeSync still runs - dispatcher cancelled: FinalizeSync still runs Trim the doc comments on the new ISnapTrieFactory hooks to a single line per the repo's coding-style rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Both Low findings addressed in 0a94bfd:
Also trimmed the multi-line XML doc comments on the new |
The Func<CancellationToken, Task> overload was internal so the test could reach it via InternalsVisibleTo. Make it public — it's a legitimate way to construct the runner, the surface is small, and DI still resolves the SimpleDispatcher overload because Func<CancellationToken, Task> isn't a registered service. Promotes the public ctor pattern to use a primary constructor (matches the rest of the codebase) and drops the assembly-level InternalsVisibleTo that was added solely for this class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… TestCase Three near-identical tests (happy / throws / cancels) folded into one parameterized method with a DispatcherOutcome enum. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor: replace snap+state sync dispatcher with SimpleDispatcher - Introduce SimpleDispatcher that runs feeds sequentially (snap → state) replacing the old parallel SyncDispatcher<T> for snap+state sync. - Merge SyncMode.SnapSync into SyncMode.StateNodes; sequencing is now handled by StateSyncRunner rather than the mode selector. - Move isCloseToHead gate and DB tuning from mode selector / SyncDbTuner into StateSyncRunner. DB tune reset wrapped in try/finally so it runs on cancellation. - StateSyncRunner early-returns when FindBestFullState != 0 (state already present); Run wraps RunRound (reset pivot, run feed, verify trie) plus the production-only waits. - StateSyncFeed / SnapSyncFeed now implement ISimpleSyncFeed; return null signals round finished, dispatcher drains in-flight requests with CancellationToken.None so peer allocations are always freed. - StateSyncFeed/Downloader/AllocationStrategyFactory registered via DI and resolved by StateSyncRunner instead of being new'd per round. - TreeSync.ValidatePrepareRequest collapsed to bool IsSyncRoundFinished; the SyncMode parameter and _syncMode field are gone. - Remove IsSnapGetRangesFinished from ISyncProgressResolver. - Test harness SafeContext.RunFeed now calls IStateSyncRunner.RunRound; feed-level tests resolve StateSyncFeed / StateSyncDownloader via DI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: yield state sync between rounds to MultiSyncModeSelector Restore the old "pause when mode leaves StateNodes" behaviour at the round boundary: StateSyncRunner.RunRound now awaits syncModeSelector.WaitUntilMode(StateNodes) at the top of each iteration, so beacon control / UpdatingPivot / fast-sync re-entry can claim peers between rounds. Pivot changes mid-round are still handled by the feed returning null via IsSyncRoundFinished. Test infra substitutes ISyncModeSelector with one that reports StateNodes immediately — the tests drive RunRound directly and shouldn't depend on the real MultiSyncModeSelector settling. Mid-round yielding (within a single RunFeed call) is not addressed here; the feed's PrepareRequest doesn't consult the mode selector to avoid complicating the test harness. Cost is bounded: pivot moves still end the round naturally, and all currently-observed mode transitions that matter (beacon reorg, pivot updates) also move the pivot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: make SimpleDispatcher<T> generic, bind feed/downloader at construction Push T to the class level and move feed, downloader, allocation strategy factory, and AllocationContexts to the constructor — each dispatcher is now bound to the single feed it drives, and its entry point is Run(CancellationToken) rather than a generic RunFeed<T>(...). Register SimpleDispatcher<StateSyncBatch?> and SimpleDispatcher<SnapSyncBatch?> as singletons via factory delegates in the SynchronizerModule (feed + downloader + strategy factory are also DI-resolved). State/SnapSyncRunner now inject the dispatcher instead of constructing it per call. Also fix the stale class summary (no more WaitForMode) and the stale cref on ISimpleSyncFeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: register and resolve sync feed deps via interfaces Register StateSyncFeed / SnapSyncFeed as ISimpleSyncFeed<T>, downloaders as ISyncDownloader<T>, strategy factories as IPeerAllocationStrategyFactory<T>, and resolve the same interfaces from the SimpleDispatcher factory. Tests that directly touched the feed likewise resolve ISimpleSyncFeed<StateSyncBatch> and ISyncDownloader<StateSyncBatch> instead of the concrete types. Normalize the generic parameter to the non-nullable batch type (StateSyncBatch / SnapSyncBatch) and add `where T : class` to ISimpleSyncFeed<T> / SimpleDispatcher<T> so the interface and implementation generics line up (previously ISimpleSyncFeed<StateSyncBatch?> mismatched ISyncDownloader<StateSyncBatch>, forcing concrete resolution). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: remove unused StateSync using in StateSyncFeedTests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address claude review findings on snap/state sync dispatcher - SnapSyncFeed: catch OperationCanceledException before general Exception so clean shutdown doesn't log ERROR - SimpleDispatcher: remove token from Task.Run to prevent semaphore leak if cancellation fires between WaitAsync and task scheduling - StateSyncRunner: replace LINQ with foreach in WaitForCloseToHead - SafeContext: add Feed property so tests use ctx.Feed instead of container.Resolve<ISimpleSyncFeed<StateSyncBatch>> * fix: swallow OCE on shutdown; retry transient SnapSyncFeed errors - StateSyncRunner.Run wraps the body in try/catch (OperationCanceledException) when token.IsCancellationRequested, so clean shutdown no longer propagates OCE to Synchronizer.StartSnapAndStateSyncComponents' ContinueWith handler and logs "State sync failed" at ERROR. - SnapSyncFeed.PrepareRequest moves try/catch inside the while loop with a dedicated catch (OperationCanceledException) returning null, matching StateSyncFeed's shape. Transient exceptions from _snapProvider.IsFinished() now retry on the next iteration instead of permanently aborting snap sync by signalling done. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: rename RunRound -> RunStateSyncRounds; extract StateSyncPrecursorWait - Rename `IStateSyncRunner.RunRound` to `RunStateSyncRounds` (multiple rounds, plural). - Combine the WaitUntilMode(StateNodes) and WaitForCloseToHead calls into a single private method `StateSyncPrecursorWait` and use it both at the top of `Run` and at the top of every `RunStateSyncRounds` iteration. Yields between rounds when MSMS leaves StateNodes or the node drifts away from head. - Test harness: also substitute ISyncProgressResolver and IBeaconSyncStrategy so the close-to-head check returns immediately (default test peers have HeadNumber=0 which would never satisfy the real predicate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: drop unused ITreeSync interface and SyncCompleted event The SyncCompleted event was fired by TreeSync but no longer had any subscribers after VerifyStateOnStateSyncFinished was deleted and the verify-trie call moved into StateSyncRunner.RunStateSyncRounds. ITreeSync existed only to expose this event, so remove the whole surface. - Delete ITreeSync.cs and SyncCompletedEventArgs. - Drop the event field and its invocation in TreeSync.VerifyPostSyncCleanUp. - DI: AddSingleton<ITreeSync, TreeSync>() -> AddSingleton<TreeSync>(). - Drop the now-unused Substitute.For<ITreeSync>() in SynchronizerModuleTests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run (#11472) * test(snap-sync): add Windows-flake stress reproducer Adds a 30-iteration stress reproducer for the Windows-only E2ESyncTests.SnapSync flake observed in CI (run #25139586200, job #73686059011 on PR #11422). Each iteration is its own NUnit case so a single failure does not terminate the rest, no [Retry] so every flake surfaces, and [Explicit] so it does not run in normal CI. Scoped to DbMode.Flat where the flake has been observed. Local reproduction on Windows (~1m runtime, 60 iterations across the two Flat fixtures): 9-19 failures with both observed CI patterns — "Verification failed: ... N missing in flat" and "TrieNodeException: Failed to load key X from root hash Y" during post-snap-sync block processing. Run: dotnet test --project Nethermind.Synchronization.Test \ -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run Replaces the lazy-init lock + `_initialized` flag in `FlatSnapTrieFactory` (which raced against concurrent `CreateStateTree`/`CreateStorageTree` calls and caused `Clear()` to wipe freshly-written accounts) with explicit sequential lifecycle hooks driven by `SnapSyncRunner.Run`: - `ISnapTrieFactory.EnsureInitialize()` (default no-op) — called once before any tree is created. `FlatSnapTrieFactory` clears the flat DB here. - `ISnapTrieFactory.FinalizeSync()` (default no-op) — called once after the dispatcher has finished. `FlatSnapTrieFactory` flushes the persistence here. - `SnapSyncRunner.Run` brackets `dispatcher.Run` with both calls. Because both halves run from a single sequential entry point — and the new `SimpleDispatcher.Run` only returns once every batch handler (and therefore every `ISnapTree.Dispose`) has completed — the implementation needs no internal locking and the `_initialized` flag goes away. This is the in-tree equivalent of the master-branch fix in #11443; the runner-driven shape suggested by @asdacap is cleaner because it doesn't depend on the factory guessing when "the first call" happens. `PatriciaSnapTrieFactory` and the test factories pick up the empty default implementations of both new methods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): extract RunSnapSyncOnce shared helper `SnapSync` and `SnapSync_StressRepro` had verbatim-copied bodies; extract a private `RunSnapSyncOnce(CancellationToken)` helper and have both call it. Trim the multi-line stress-repro comment to a single line per coding-style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): SnapSyncRunner lifecycle ordering test + cancel/throw cases Adds an internal-visible Func<CancellationToken, Task> ctor on SnapSyncRunner so the lifecycle can be exercised without standing up a real SimpleDispatcher, and three NSubstitute-based tests: - happy path: EnsureInitialize → dispatcher → FinalizeSync - dispatcher throws: FinalizeSync still runs - dispatcher cancelled: FinalizeSync still runs Trim the doc comments on the new ISnapTrieFactory hooks to a single line per the repo's coding-style rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(snap-sync): make SnapSyncRunner Func ctor public The Func<CancellationToken, Task> overload was internal so the test could reach it via InternalsVisibleTo. Make it public — it's a legitimate way to construct the runner, the surface is small, and DI still resolves the SimpleDispatcher overload because Func<CancellationToken, Task> isn't a registered service. Promotes the public ctor pattern to use a primary constructor (matches the rest of the codebase) and drops the assembly-level InternalsVisibleTo that was added solely for this class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): collapse SnapSyncRunner ordering tests into a single TestCase Three near-identical tests (happy / throws / cancels) folded into one parameterized method with a DispatcherOutcome enum. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): switch expression for SnapSyncRunner dispatcher outcome Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Lukasz Rozmej <lukasz.rozmej@gmail.com>

Summary
Ports the snap-sync flake fix from #11443 onto @asdacap's snap+state dispatcher refactor (#11102), in the cleaner shape he suggested: drive
FlatSnapTrieFactory.Clear()fromSnapSyncRunner.Runboundaries instead of a lazyEnsureDatabaseClearedlock.This PR is stacked on top of
amirul/simple-sync-dispatcherso it can land alongside #11102.Background
#11443 root-caused a Windows-only
E2ESyncTests.SnapSyncflake to a race inFlatSnapTrieFactory.EnsureDatabaseCleared:Thread T2 could read
_initialized = truevia the unlocked fast path while T1 was still insideClear(). T2 would then callCreateStateTreeand start writing accounts. When T1'sClearbatch finally disposed, every queuedRemove(including the keys T2 had just committed) would land on the live DB, and the flat DB ended up missing entries the trie still referenced — surfacing as "Account in trie not found in flat" in the verifier.#11443 fixed it by publishing
_initializedafterClear()and marking the fieldvolatile. Validated as 0/300 stress runs vs. 8–15/60 before.What this PR does
@asdacap pointed out that on top of #11102's runner refactor, the lazy-init pattern can go away entirely — the snap-sync run has a single sequential entry/exit point in
SnapSyncRunner.Run, so init/cleanup can be hooked there:ISnapTrieFactory:EnsureInitialize()— called once before any tree is created.FinalizeSync()— called once after the dispatcher has drained.FlatSnapTrieFactoryimplements them withpersistence.Clear()andpersistence.Flush()respectively. The lock and the_initializedflag are deleted.SnapSyncRunner.Runbracketsdispatcher.Run(token)withEnsureInitialize(before) andFinalizeSync(infinally).PatriciaSnapTrieFactory,TestSnapTrieFactory, and the other test factories pick up the empty default implementations and don't need changes.Because
SimpleDispatcher.Run(introduced by #11102) only returns once every batch handler — and therefore everyISnapTree.Dispose— has completed, neither hook needs internal locking. This also obviates the in-flight tree drain that #11443 needed on master (commit12a66f1e1b).Stress reproducer
The same
SnapSync_StressReprotest from #11443 (cherry-picked here) runs 30 iterations × 2 Flat fixtures = 60 cases, no[Retry],[Explicit]. Local Windows results on this branch:Verification failedPlus full Flat test suite (292 tests, 4 skipped) and
RecreateState*(20 tests) andStateSyncFeed*(834 tests) all pass.Test plan
Nethermind.State.Flat.Testpasses (292 tests).RecreateStateFromAccountRanges/RecreateStateFromStorageRangespass (20 tests).StateSyncFeed*passes (834 tests).SnapSync_StressRepro: 0 failures across 3 stress runs (180 invocations).