fix(snap-sync): three Windows-flake fixes for E2ESyncTests.SnapSync#11443
fix(snap-sync): three Windows-flake fixes for E2ESyncTests.SnapSync#11443LukaszRozmej wants to merge 6 commits into
Conversation
Adds a 30-iteration stress reproducer for the Windows-only E2ESyncTests.SnapSync flake observed in CI (run #25139586200, job #73686059011 on PR #11422). Each iteration is its own NUnit case so a single failure does not terminate the rest, no [Retry] so every flake surfaces, and [Explicit] so it does not run in normal CI. Scoped to DbMode.Flat where the flake has been observed. Local reproduction on Windows (~1m runtime, 60 iterations across the two Flat fixtures): 9-19 failures with both observed CI patterns — "Verification failed: ... N missing in flat" and "TrieNodeException: Failed to load key X from root hash Y" during post-snap-sync block processing. Run: dotnet test --project Nethermind.Synchronization.Test \ -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @LukaszRozmej's task in 2m 49s —— View job Code Review
SummaryThis is a well-motivated diagnostic PR — FindingsMedium — Code duplication violates DRY and repo test rules
private async Task RunSnapSyncClientOnce(CancellationToken cancellationToken)
{
PrivateKey clientKey = TestItem.PrivateKeyD;
await using IContainer client = await CreateNode(clientKey, async (cfg, spec) =>
{
SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>();
syncConfig.FastSync = true;
syncConfig.SnapSync = true;
await SetPivot(syncConfig, cancellationToken);
INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>();
networkConfig.P2PPort = AllocatePort();
networkConfig.FilterPeersByRecentIp = false;
networkConfig.FilterDiscoveryNodesByRecentIp = false;
});
await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationToken);
}Then Low — Verbose multi-line comment block violates coding styleThe 7-line comment block above // Stress reproducer — run manually; see PR #11443 for context.Positive notes
|
| PrivateKey clientKey = TestItem.PrivateKeyD; | ||
| await using IContainer client = await CreateNode(clientKey, async (cfg, spec) => | ||
| { | ||
| SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>(); | ||
| syncConfig.FastSync = true; | ||
| syncConfig.SnapSync = true; | ||
|
|
||
| await SetPivot(syncConfig, cancellationTokenSource.Token); | ||
|
|
||
| INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>(); | ||
| networkConfig.P2PPort = AllocatePort(); | ||
| networkConfig.FilterPeersByRecentIp = false; | ||
| networkConfig.FilterDiscoveryNodesByRecentIp = false; | ||
| }); | ||
|
|
||
| await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationTokenSource.Token); | ||
| } |
There was a problem hiding this comment.
The body of this test is a verbatim copy of SnapSync(). Per the repo's test-infrastructure rules, duplicating test method bodies is not allowed — extract the shared logic into a private helper (e.g. RunSnapSyncClientOnce(CancellationToken)) and have both methods call it.
| // Stress reproducer for the Windows-only SnapSync flake observed in CI: | ||
| // - run #25139586200 / job #73686059011 (PR #11422) | ||
| // Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case | ||
| // so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every | ||
| // failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with: | ||
| // dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro" | ||
| // Runs only on Flat dbMode (where the flake has been observed). |
There was a problem hiding this comment.
This 7-line comment block violates the coding-style rule: "Don't write multi-paragraph docstrings or multi-line comment blocks — one short line max." The run IDs, invocation instructions, and rationale belong in the PR description (where they already are). Trim to a single line, e.g.:
| // Stress reproducer for the Windows-only SnapSync flake observed in CI: | |
| // - run #25139586200 / job #73686059011 (PR #11422) | |
| // Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case | |
| // so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every | |
| // failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with: | |
| // dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro" | |
| // Runs only on Flat dbMode (where the flake has been observed). | |
| // Stress reproducer — run manually; see PR #11443 for context. |
SnapshotableMemColumnsDb.CreateSnapshot iterates columns in a loop and captures each column's snapshot independently. InMemoryColumnWriteBatch .Dispose iterates the per-column batches and disposes each independently. Without a global guard, a CreateSnapshot call concurrent with an in-flight writeBatch.Dispose can capture some columns AFTER the new writes and others BEFORE — producing a cross-column-inconsistent reader view. This race only matters in the test backend (RocksDB has atomic cross-CF snapshots), but it manifested concretely in E2ESyncTests.SnapSync on Windows: post-snap-sync block processing leases a fresh persistenceReader while the persistence pipeline is committing, gets a snapshot whose Accounts column is updated but whose StateNodes column is not, then walks the trie at the new state root and throws TrieNodeException for a node that "should" be there. Add a ReaderWriterLockSlim around CreateSnapshot (read) and writeBatch.Dispose (write) so multi-column commits are atomic w.r.t. snapshot creation. Multiple snapshots can still proceed concurrently — the only contention is the rare overlap of a snapshot creation with a multi-column commit. This eliminates the TrieNodeException failure mode in the stress reproducer (`SnapSync_StressRepro`). The "missing in flat" mode is a separate, deterministic bug (the same ~9 addresses miss reliably) that appears unaffected by this fix and needs follow-up investigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0f09e4d to
c14030d
Compare
Mode A diagnostic — phase-1 SnapSync trees not disposed when verifier runsAdded targeted instrumentation to Some Tree In iter 10 of the traced run: 27 trees created, 24 disposed, 3 stuck (id=499 in Why
|
| Approach | Result |
|---|---|
Task.Delay(1s) before FlushCache |
12-15 fail (similar baseline) |
Task.Delay(10s) before FlushCache |
13 fail (similar baseline) |
Even 10s isn't enough — some disposals appear to take much longer or are genuinely stuck on something. A delay-based fix isn't sufficient.
Proposed real fix (next step, not in this PR)
Add an in-flight counter to FlatSnapTrieFactory:
- Increment on
CreateStateTree/CreateStorageTree. - Decrement when the returned tree's
Dispose()completes. - Expose
WaitForDrained(timeout). - Call
WaitForDrainedinFlatTreeSyncStore.FinalizeSyncbefore returning.
That would block the sync→post-sync transition until every phase-1 writeBatch is committed, eliminating the race. Production also benefits: a node that restarts mid-snap-sync currently risks losing uncommitted phase-1 writes.
…izeSync Phase-1 ISnapTrees produced by FlatSnapTrieFactory each hold a per-tree IPersistence.IWriteBatch. The writeBatch's writes only become visible to subsequent IPersistence readers after the tree's Dispose() returns. When FlatTreeSyncStore.FinalizeSync ran without waiting for outstanding tree disposals, callers that observed "sync finished" could read state that was missing accounts whose enclosing tree hadn't fully disposed yet. Track in-flight ISnapTrees on FlatSnapTrieFactory (incremented on Create*Tree, decremented in tree Dispose via a back-reference) and have FlatTreeSyncStore.FinalizeSync drain that counter (with a 30s timeout) before flushing. This closes the gap where a snap-sync writeBatch was queued but not yet committed when the verifier ran. Reduces E2ESyncTests.SnapSync_StressRepro Windows flake rate from ~13-18/60 to ~8-15/60. Doesn't fully eliminate Mode A — the remaining failures involve a deterministic set of ~9 paths that appear unaffected by the drain (likely a separate issue in how state-nodes-sync writes flat entries for some specific topology, requiring further work). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update — added drain fix for phase-1 in-flight
|
| Configuration | Failures | Failure type |
|---|---|---|
| Master baseline | 9–19 | Mix of Mode A + Mode B |
| + cross-column snapshot fix | 13–18 | All Mode A |
| + drain fix (this) | 8–15 | All Mode A |
Modest but real improvement. The drain doesn't fully eliminate Mode A.
What's still unfixed
The remaining Mode A failures hit the same deterministic 9 paths as before — the drain doesn't change the failure signature, it just makes failure less frequent. In iterations where phase-1 (state ranges) doesn't progress at all (logs show Phase 1: ( 0.00 %) then immediately finished), all work happens via phase-2 state-nodes-sync (FlatTreeSyncStore.SaveNode), and the drain catches no trees. These iterations still fail with the same 9 missing accounts.
So the remaining bug is in phase-2 state-nodes-sync's flat-entry path, not phase-1. The 9 deterministic addresses suggest a specific trie topology (something about how those leaves are reached) that causes WriteAccountFlatEntries to miss them, or some other structural issue in FlatTreeSyncStore.SaveNode.
Investigation handed off to whoever picks this up next: the data points to either (a) leaves that are inline in some node type that WriteAccountFlatEntries doesn't handle, or (b) a specific code path in the Branch deletion-range computation that's incorrect for those 9 addresses' positions in the trie.
The fast-path `if (_initialized) return;` lets a concurrent caller skip EnsureDatabaseCleared and proceed to write accounts while the very first caller is still inside persistence.Clear() — iterating GetAllKeys() and queueing Removes for every existing key. When that batch finally disposes, those Removes overwrite the just-written accounts, leaving the flat DB missing entries that the trie still references. The verifier then reports "Account in trie not found in flat" for whichever accounts landed in the window. Stress-running E2ESyncTests.SnapSync on Windows reproduced this in roughly 13–18 of 60 iterations; setting _initialized after Clear() drops failures to 0/300. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Root cause found and fixed in 73b160f. The remaining "missing in flat" failures came from a race in if (_initialized) return;
using (_lock.EnterScope())
{
if (_initialized) return;
_initialized = true; // ← published BEFORE Clear() ran
persistence.Clear(); // ← still iterating GetAllKeys() / queueing Removes
}Sequence that wipes accounts:
How I confirmed it: instrumented every 507,072 of the 509,598 sink Removes came from Fix: publish Verified: 5 stress runs × 60 tests = 300 invocations, 0 failures. Previously 8–15 failures per 60-test run on Windows. This was the last bug. Combined effect of the three commits in this PR:
Marking the PR ready for review. |
|
Claude finished @LukaszRozmej's task in 4m 11s —— View job Code Review
SummaryAll three bug fixes are correct. The cross-column lock fix ( The two findings from the previous review remain unaddressed. One prior Medium (code duplication) must be fixed before merge. FindingsMedium — Code duplication still present (prior finding, unaddressed)
Low — Verbose comment block (prior finding, unaddressed)The 7-line comment at lines 431–437 still violates the coding-style one-line-max rule. The run IDs and invocation instructions already live in the PR description. The Low —
|
| }); | ||
|
|
||
| await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationTokenSource.Token); | ||
| } |
There was a problem hiding this comment.
Still a verbatim copy of SnapSync() (lines 412–428). Per coding-style.md ("extract repeated blocks (5+ lines)") and test-infrastructure.md ("Do not duplicate test methods"), extract the shared body into a private helper and have both methods call it:
private async Task RunSnapSyncOnce(CancellationToken cancellationToken)
{
PrivateKey clientKey = TestItem.PrivateKeyD;
await using IContainer client = await CreateNode(clientKey, async (cfg, spec) =>
{
SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>();
syncConfig.FastSync = true;
syncConfig.SnapSync = true;
await SetPivot(syncConfig, cancellationToken);
INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>();
networkConfig.P2PPort = AllocatePort();
networkConfig.FilterPeersByRecentIp = false;
networkConfig.FilterDiscoveryNodesByRecentIp = false;
});
await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationToken);
}Then both tests become:
await RunSnapSyncOnce(cancellationTokenSource.Token);| // so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every | ||
| // failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with: | ||
| // dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro" | ||
| // Runs only on Flat dbMode (where the flake has been observed). |
There was a problem hiding this comment.
7-line comment block violates the one-line-max rule in coding-style.md. The run IDs, invocation instructions, and rationale live in the PR description and in the [Explicit] message. Trim to a single line:
| // Runs only on Flat dbMode (where the flake has been observed). | |
| // Stress reproducer for SnapSync Windows flake — run manually; see PR #11443 for context. |
| // aren't yet committed to the underlying persistence. Without this drain a caller that | ||
| // observes "sync finished" can read state that's missing accounts from late-disposed | ||
| // trees (manifests as E2ESyncTests.SnapSync "Verification failed: N missing in flat"). | ||
| snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30)); |
There was a problem hiding this comment.
The bool return is silently discarded. On a 30-second drain timeout the warning is logged inside WaitForInFlightTreesDrained, but FinalizeSync proceeds to persistence.Flush() without any signal at this layer. Add a site-local log so finalization visibility is complete:
| snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30)); | |
| if (!snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30))) | |
| if (_logger.IsWarn) _logger.Warn("FinalizeSync: in-flight ISnapTree drain timed out — flat state may be incomplete"); |
|
@claude re-review |
|
Claude finished @LukaszRozmej's task in 3m 9s —— View job Code Review (re-review)
SummaryAll three production bug fixes are correct and well-implemented. The cross-column lock fix, the DCL Three prior findings remain unaddressed (one Medium, two Low), and one new Low was identified. The Medium blocks merge. FindingsMedium — Code duplication (3rd occurrence, unaddressed)
Low — Verbose comment block (3rd occurrence, unaddressed)The 7-line comment at lines 431–437 still violates the one-line-max rule. The Low —
|
| }); | ||
|
|
||
| await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationTokenSource.Token); | ||
| } |
There was a problem hiding this comment.
Medium (third occurrence — unaddressed): Lines 447–462 are still a verbatim copy of SnapSync() lines 412–428. coding-style.md requires extracting repeated blocks of 5+ lines into shared methods. Extract to a private helper:
private async Task RunSnapSyncOnce(CancellationToken cancellationToken)
{
PrivateKey clientKey = TestItem.PrivateKeyD;
await using IContainer client = await CreateNode(clientKey, async (cfg, spec) =>
{
SyncConfig syncConfig = (SyncConfig)cfg.GetConfig<ISyncConfig>();
syncConfig.FastSync = true;
syncConfig.SnapSync = true;
await SetPivot(syncConfig, cancellationToken);
INetworkConfig networkConfig = cfg.GetConfig<INetworkConfig>();
networkConfig.P2PPort = AllocatePort();
networkConfig.FilterPeersByRecentIp = false;
networkConfig.FilterDiscoveryNodesByRecentIp = false;
});
await client.Resolve<SyncTestContext>().SyncFromServer(_server, cancellationToken);
}Both SnapSync() and SnapSync_StressRepro() then become a one-liner call to this helper.
| // Stress reproducer for the Windows-only SnapSync flake observed in CI: | ||
| // - run #25139586200 / job #73686059011 (PR #11422) | ||
| // Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case | ||
| // so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every | ||
| // failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with: | ||
| // dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro" | ||
| // Runs only on Flat dbMode (where the flake has been observed). |
There was a problem hiding this comment.
Low (third occurrence — unaddressed): This 7-line block still violates coding-style.md ("Code comments explain why, not what"). The [Explicit] message (line 438) already covers the essential intent for IDE readers. Run IDs and invocation instructions belong in the PR description (where they already are). Trim to one line:
| // Stress reproducer for the Windows-only SnapSync flake observed in CI: | |
| // - run #25139586200 / job #73686059011 (PR #11422) | |
| // Re-runs the same logic as SnapSync() many times, each iteration as its own NUnit case | |
| // so a single flake doesn't terminate the rest. Intentionally NO [Retry] — we want every | |
| // failure to surface. Marked [Explicit] so it doesn't run in normal CI; invoke with: | |
| // dotnet test --filter "FullyQualifiedName~SnapSync_StressRepro" | |
| // Runs only on Flat dbMode (where the flake has been observed). | |
| // Stress reproducer for the SnapSync Windows flake — run manually; see PR #11443 for context. |
| // aren't yet committed to the underlying persistence. Without this drain a caller that | ||
| // observes "sync finished" can read state that's missing accounts from late-disposed | ||
| // trees (manifests as E2ESyncTests.SnapSync "Verification failed: N missing in flat"). | ||
| snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30)); |
There was a problem hiding this comment.
Low (second occurrence — unaddressed): The bool return from WaitForInFlightTreesDrained is silently discarded. The timeout warning is logged inside the callee, but FinalizeSync proceeds to persistence.Flush() with no signal at this layer. Add a site-local log so finalization-level visibility is complete:
| snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30)); | |
| if (!snapTrieFactory.WaitForInFlightTreesDrained(TimeSpan.FromSeconds(30))) | |
| if (_logger.IsWarn) _logger.Warn("FinalizeSync: in-flight ISnapTree drain timed out — flat state may be incomplete"); |
|
|
||
| IPersistence.IPersistenceReader reader = persistence.CreateReader(ReaderFlags.Sync); | ||
| IPersistence.IWriteBatch writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL); | ||
| return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager); | ||
| OnTreeCreated(); | ||
| return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this); |
There was a problem hiding this comment.
Low (new): OnTreeCreated() is called at line 66 before new FlatSnapStateTree(...) finishes constructing. If the constructor throws (e.g. new StateTree(...) fails), OnTreeDisposed() is never called — the in-flight counter is permanently incremented, causing WaitForInFlightTreesDrained to spin until its 30-second timeout on every subsequent FinalizeSync call.
The fix: call OnTreeCreated() inside the FlatSnapStateTree constructor (where _ownerFactory is already available), and have the finally block in Dispose pair with it. This way the increment is always matched by the decrement, even if the caller-side code throws after creation.
Same issue exists in CreateStorageTree / FlatSnapStorageTree.
…rden tree-counter on throw - Extract `RunSnapSyncOnce(CancellationToken)` and call from both `SnapSync()` and `SnapSync_StressRepro(int)` (was a verbatim copy of the test body). - Trim the stress-repro comment block to a single line (coding-style: one-line max). - Update `FlatTreeSyncStoreTests` constructor call for the new `FlatSnapTrieFactory` parameter on `FlatTreeSyncStore` (build was failing). - Drop unused `using System;` / `using System.Threading;` (ImplicitUsings is on for this project; lint flagged them). - Log a site-local warning in `FinalizeSync` when `WaitForInFlightTreesDrained` returns false, instead of silently discarding the bool — finalization-level visibility was missing. - In `FlatSnapTrieFactory.CreateStateTree` / `CreateStorageTree`, pair every `OnTreeCreated()` with cleanup if the subsequent constructor throws; otherwise the in-flight counter leaks and `WaitForInFlightTreesDrained` would spin to its 30-second timeout on every `FinalizeSync`. Also dispose the per-tree reader and write batch on the throw path so they don't leak either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@claude re-review |
|
Claude finished @LukaszRozmej's task in 2m 30s —— View job Code Review (re-review)
SummaryAll four prior findings are resolved in Prior findings — all resolved ✅
Positive notes
|
|
@claude will this cause performance regression? |
|
Claude finished @LukaszRozmej's task in 1m 0s —— View job Performance regression analysis
Short answer: No meaningful regression on any production path. Each fix touches a different scope; here's the breakdown: Fix 1 —
|
| Fix | Affected scope | Overhead per tree | Total overhead |
|---|---|---|---|
| Cross-column lock | Test backend only | 0 in production | None |
Interlocked counter |
Snap sync tree lifetime | ~10 ns | ~10 ms over full sync |
volatile read |
Snap sync tree creation | ~0.7 ns delta | Sub-millisecond over full sync |
None of the changes touch the normal block-processing path (RocksDB, state reads/writes, EVM execution). The overhead is confined to snap sync tree creation/disposal, which is a one-time background operation, not a latency-sensitive path.
|
@claude does this also fix some potential edge case production bugs? |
|
Claude finished @LukaszRozmej's task in 2m 22s —— View job Production bug analysis
Yes — two of the three fixes address real production bugs, not just test artifacts. Here's the breakdown: Fix 1 —
|
| Fix | Affects production? | Impact if untriggered |
|---|---|---|
| Fix 1 — cross-column lock | No — test backend only | N/A |
| Fix 2 — drain in-flight trees | Yes — FlatDB snap sync finalization | Flat state incomplete after sync; block replay errors |
Fix 3 — _initialized DCL |
Yes — FlatDB snap sync first-batch race | Accounts silently erased from flat; transactions fail post-sync |
Both Fix 2 and Fix 3 are latent production correctness bugs for any FlatDB node that performs snap sync. Fix 3 is harder to trigger on Linux (scheduler dependent) but carries a more severe consequence (silent data loss). Fix 2 is a simpler ordering gap that the drain closes cleanly. Both are worth having regardless of the test flake.
EXPB Benchmark ComparisonRun: View workflow run superblocksScenario: Client Processing (SSE)
K6 TTFB
realblocksScenario: Client Processing (SSE)
K6 TTFB
|
| /// Spin-wait until all trees handed out by this factory have been disposed. Returns true | ||
| /// if drained within <paramref name="timeout"/>, false on timeout. | ||
| /// </summary> | ||
| public bool WaitForInFlightTreesDrained(System.TimeSpan timeout) |
There was a problem hiding this comment.
| public bool WaitForInFlightTreesDrained(System.TimeSpan timeout) | |
| public bool WaitForInFlightTreesDrained(TimeSpan timeout) |
| IPersistence.IWriteBatch writeBatch; | ||
| try | ||
| { | ||
| writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL); | ||
| } | ||
| catch | ||
| { | ||
| reader.Dispose(); | ||
| throw; | ||
| } | ||
| OnTreeCreated(); | ||
| try | ||
| { | ||
| return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this); | ||
| } | ||
| catch | ||
| { | ||
| // Constructor failed: pair the OnTreeCreated above with OnTreeDisposed and release | ||
| // the per-tree resources ourselves, since no FlatSnapStateTree.Dispose will run. | ||
| OnTreeDisposed(); | ||
| writeBatch.Dispose(); | ||
| reader.Dispose(); | ||
| throw; | ||
| } |
There was a problem hiding this comment.
Alternative with single try/catch?
| IPersistence.IWriteBatch writeBatch; | |
| try | |
| { | |
| writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL); | |
| } | |
| catch | |
| { | |
| reader.Dispose(); | |
| throw; | |
| } | |
| OnTreeCreated(); | |
| try | |
| { | |
| return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this); | |
| } | |
| catch | |
| { | |
| // Constructor failed: pair the OnTreeCreated above with OnTreeDisposed and release | |
| // the per-tree resources ourselves, since no FlatSnapStateTree.Dispose will run. | |
| OnTreeDisposed(); | |
| writeBatch.Dispose(); | |
| reader.Dispose(); | |
| throw; | |
| } | |
| IPersistence.IPersistenceReader? reader = null; | |
| IPersistence.IWriteBatch? writeBatch = null; | |
| bool treeTracked = false; | |
| try | |
| { | |
| reader = persistence.CreateReader(ReaderFlags.Sync); | |
| writeBatch = persistence.CreateWriteBatch(StateId.Sync, StateId.Sync, WriteFlags.DisableWAL); | |
| OnTreeCreated(); | |
| treeTracked = true; | |
| return new FlatSnapStateTree(reader, writeBatch, syncConfig.EnableSnapDoubleWriteCheck, logManager, this); | |
| } | |
| catch | |
| { | |
| if (treeTracked) OnTreeDisposed(); | |
| writeBatch?.Dispose(); | |
| reader?.Dispose(); | |
| throw; | |
| } |
|
@claude what execution paths / operations waiting-for-up-to-30-seconds in |
|
Claude finished @alexb5dh's task in 1m 45s —— View job Performance impact of
|
- Trim verbose comment on `_atomicityLock` in `SnapshotableMemColumnsDb` to a single line. - Use plain `TimeSpan` (not `System.TimeSpan`) in `WaitForInFlightTreesDrained`; re-add `using System;`. - Drop the duplicate warning at the `FinalizeSync` call site — `WaitForInFlightTreesDrained` already logs the timeout from inside; remove the now-unused `_logger` field. - Refactor the `OnTreeCreated` / cleanup-on-throw blocks in `CreateStateTree` / `CreateStorageTree` to a single try/catch with `treeTracked` flag and nullable refs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@alexb5dh feedback addressed in 91db203:
@asdacap re your CHANGES_REQUESTED — opened #11472 stacked on |
…Runner.Run (#11472) * test(snap-sync): add Windows-flake stress reproducer Adds a 30-iteration stress reproducer for the Windows-only E2ESyncTests.SnapSync flake observed in CI (run #25139586200, job #73686059011 on PR #11422). Each iteration is its own NUnit case so a single failure does not terminate the rest, no [Retry] so every flake surfaces, and [Explicit] so it does not run in normal CI. Scoped to DbMode.Flat where the flake has been observed. Local reproduction on Windows (~1m runtime, 60 iterations across the two Flat fixtures): 9-19 failures with both observed CI patterns — "Verification failed: ... N missing in flat" and "TrieNodeException: Failed to load key X from root hash Y" during post-snap-sync block processing. Run: dotnet test --project Nethermind.Synchronization.Test \ -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run Replaces the lazy-init lock + `_initialized` flag in `FlatSnapTrieFactory` (which raced against concurrent `CreateStateTree`/`CreateStorageTree` calls and caused `Clear()` to wipe freshly-written accounts) with explicit sequential lifecycle hooks driven by `SnapSyncRunner.Run`: - `ISnapTrieFactory.EnsureInitialize()` (default no-op) — called once before any tree is created. `FlatSnapTrieFactory` clears the flat DB here. - `ISnapTrieFactory.FinalizeSync()` (default no-op) — called once after the dispatcher has finished. `FlatSnapTrieFactory` flushes the persistence here. - `SnapSyncRunner.Run` brackets `dispatcher.Run` with both calls. Because both halves run from a single sequential entry point — and the new `SimpleDispatcher.Run` only returns once every batch handler (and therefore every `ISnapTree.Dispose`) has completed — the implementation needs no internal locking and the `_initialized` flag goes away. This is the in-tree equivalent of the master-branch fix in #11443; the runner-driven shape suggested by @asdacap is cleaner because it doesn't depend on the factory guessing when "the first call" happens. `PatriciaSnapTrieFactory` and the test factories pick up the empty default implementations of both new methods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): extract RunSnapSyncOnce shared helper `SnapSync` and `SnapSync_StressRepro` had verbatim-copied bodies; extract a private `RunSnapSyncOnce(CancellationToken)` helper and have both call it. Trim the multi-line stress-repro comment to a single line per coding-style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): SnapSyncRunner lifecycle ordering test + cancel/throw cases Adds an internal-visible Func<CancellationToken, Task> ctor on SnapSyncRunner so the lifecycle can be exercised without standing up a real SimpleDispatcher, and three NSubstitute-based tests: - happy path: EnsureInitialize → dispatcher → FinalizeSync - dispatcher throws: FinalizeSync still runs - dispatcher cancelled: FinalizeSync still runs Trim the doc comments on the new ISnapTrieFactory hooks to a single line per the repo's coding-style rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(snap-sync): make SnapSyncRunner Func ctor public The Func<CancellationToken, Task> overload was internal so the test could reach it via InternalsVisibleTo. Make it public — it's a legitimate way to construct the runner, the surface is small, and DI still resolves the SimpleDispatcher overload because Func<CancellationToken, Task> isn't a registered service. Promotes the public ctor pattern to use a primary constructor (matches the rest of the codebase) and drops the assembly-level InternalsVisibleTo that was added solely for this class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): collapse SnapSyncRunner ordering tests into a single TestCase Three near-identical tests (happy / throws / cancels) folded into one parameterized method with a DispatcherOutcome enum. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): switch expression for SnapSyncRunner dispatcher outcome Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor: replace snap+state sync dispatcher with SimpleDispatcher - Introduce SimpleDispatcher that runs feeds sequentially (snap → state) replacing the old parallel SyncDispatcher<T> for snap+state sync. - Merge SyncMode.SnapSync into SyncMode.StateNodes; sequencing is now handled by StateSyncRunner rather than the mode selector. - Move isCloseToHead gate and DB tuning from mode selector / SyncDbTuner into StateSyncRunner. DB tune reset wrapped in try/finally so it runs on cancellation. - StateSyncRunner early-returns when FindBestFullState != 0 (state already present); Run wraps RunRound (reset pivot, run feed, verify trie) plus the production-only waits. - StateSyncFeed / SnapSyncFeed now implement ISimpleSyncFeed; return null signals round finished, dispatcher drains in-flight requests with CancellationToken.None so peer allocations are always freed. - StateSyncFeed/Downloader/AllocationStrategyFactory registered via DI and resolved by StateSyncRunner instead of being new'd per round. - TreeSync.ValidatePrepareRequest collapsed to bool IsSyncRoundFinished; the SyncMode parameter and _syncMode field are gone. - Remove IsSnapGetRangesFinished from ISyncProgressResolver. - Test harness SafeContext.RunFeed now calls IStateSyncRunner.RunRound; feed-level tests resolve StateSyncFeed / StateSyncDownloader via DI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: yield state sync between rounds to MultiSyncModeSelector Restore the old "pause when mode leaves StateNodes" behaviour at the round boundary: StateSyncRunner.RunRound now awaits syncModeSelector.WaitUntilMode(StateNodes) at the top of each iteration, so beacon control / UpdatingPivot / fast-sync re-entry can claim peers between rounds. Pivot changes mid-round are still handled by the feed returning null via IsSyncRoundFinished. Test infra substitutes ISyncModeSelector with one that reports StateNodes immediately — the tests drive RunRound directly and shouldn't depend on the real MultiSyncModeSelector settling. Mid-round yielding (within a single RunFeed call) is not addressed here; the feed's PrepareRequest doesn't consult the mode selector to avoid complicating the test harness. Cost is bounded: pivot moves still end the round naturally, and all currently-observed mode transitions that matter (beacon reorg, pivot updates) also move the pivot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: make SimpleDispatcher<T> generic, bind feed/downloader at construction Push T to the class level and move feed, downloader, allocation strategy factory, and AllocationContexts to the constructor — each dispatcher is now bound to the single feed it drives, and its entry point is Run(CancellationToken) rather than a generic RunFeed<T>(...). Register SimpleDispatcher<StateSyncBatch?> and SimpleDispatcher<SnapSyncBatch?> as singletons via factory delegates in the SynchronizerModule (feed + downloader + strategy factory are also DI-resolved). State/SnapSyncRunner now inject the dispatcher instead of constructing it per call. Also fix the stale class summary (no more WaitForMode) and the stale cref on ISimpleSyncFeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: register and resolve sync feed deps via interfaces Register StateSyncFeed / SnapSyncFeed as ISimpleSyncFeed<T>, downloaders as ISyncDownloader<T>, strategy factories as IPeerAllocationStrategyFactory<T>, and resolve the same interfaces from the SimpleDispatcher factory. Tests that directly touched the feed likewise resolve ISimpleSyncFeed<StateSyncBatch> and ISyncDownloader<StateSyncBatch> instead of the concrete types. Normalize the generic parameter to the non-nullable batch type (StateSyncBatch / SnapSyncBatch) and add `where T : class` to ISimpleSyncFeed<T> / SimpleDispatcher<T> so the interface and implementation generics line up (previously ISimpleSyncFeed<StateSyncBatch?> mismatched ISyncDownloader<StateSyncBatch>, forcing concrete resolution). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: remove unused StateSync using in StateSyncFeedTests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address claude review findings on snap/state sync dispatcher - SnapSyncFeed: catch OperationCanceledException before general Exception so clean shutdown doesn't log ERROR - SimpleDispatcher: remove token from Task.Run to prevent semaphore leak if cancellation fires between WaitAsync and task scheduling - StateSyncRunner: replace LINQ with foreach in WaitForCloseToHead - SafeContext: add Feed property so tests use ctx.Feed instead of container.Resolve<ISimpleSyncFeed<StateSyncBatch>> * fix: swallow OCE on shutdown; retry transient SnapSyncFeed errors - StateSyncRunner.Run wraps the body in try/catch (OperationCanceledException) when token.IsCancellationRequested, so clean shutdown no longer propagates OCE to Synchronizer.StartSnapAndStateSyncComponents' ContinueWith handler and logs "State sync failed" at ERROR. - SnapSyncFeed.PrepareRequest moves try/catch inside the while loop with a dedicated catch (OperationCanceledException) returning null, matching StateSyncFeed's shape. Transient exceptions from _snapProvider.IsFinished() now retry on the next iteration instead of permanently aborting snap sync by signalling done. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: rename RunRound -> RunStateSyncRounds; extract StateSyncPrecursorWait - Rename `IStateSyncRunner.RunRound` to `RunStateSyncRounds` (multiple rounds, plural). - Combine the WaitUntilMode(StateNodes) and WaitForCloseToHead calls into a single private method `StateSyncPrecursorWait` and use it both at the top of `Run` and at the top of every `RunStateSyncRounds` iteration. Yields between rounds when MSMS leaves StateNodes or the node drifts away from head. - Test harness: also substitute ISyncProgressResolver and IBeaconSyncStrategy so the close-to-head check returns immediately (default test peers have HeadNumber=0 which would never satisfy the real predicate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: drop unused ITreeSync interface and SyncCompleted event The SyncCompleted event was fired by TreeSync but no longer had any subscribers after VerifyStateOnStateSyncFinished was deleted and the verify-trie call moved into StateSyncRunner.RunStateSyncRounds. ITreeSync existed only to expose this event, so remove the whole surface. - Delete ITreeSync.cs and SyncCompletedEventArgs. - Drop the event field and its invocation in TreeSync.VerifyPostSyncCleanUp. - DI: AddSingleton<ITreeSync, TreeSync>() -> AddSingleton<TreeSync>(). - Drop the now-unused Substitute.For<ITreeSync>() in SynchronizerModuleTests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run (#11472) * test(snap-sync): add Windows-flake stress reproducer Adds a 30-iteration stress reproducer for the Windows-only E2ESyncTests.SnapSync flake observed in CI (run #25139586200, job #73686059011 on PR #11422). Each iteration is its own NUnit case so a single failure does not terminate the rest, no [Retry] so every flake surfaces, and [Explicit] so it does not run in normal CI. Scoped to DbMode.Flat where the flake has been observed. Local reproduction on Windows (~1m runtime, 60 iterations across the two Flat fixtures): 9-19 failures with both observed CI patterns — "Verification failed: ... N missing in flat" and "TrieNodeException: Failed to load key X from root hash Y" during post-snap-sync block processing. Run: dotnet test --project Nethermind.Synchronization.Test \ -c release -- --filter "FullyQualifiedName~SnapSync_StressRepro" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(snap-sync): drive FlatSnapTrieFactory init/finalize from SnapSyncRunner.Run Replaces the lazy-init lock + `_initialized` flag in `FlatSnapTrieFactory` (which raced against concurrent `CreateStateTree`/`CreateStorageTree` calls and caused `Clear()` to wipe freshly-written accounts) with explicit sequential lifecycle hooks driven by `SnapSyncRunner.Run`: - `ISnapTrieFactory.EnsureInitialize()` (default no-op) — called once before any tree is created. `FlatSnapTrieFactory` clears the flat DB here. - `ISnapTrieFactory.FinalizeSync()` (default no-op) — called once after the dispatcher has finished. `FlatSnapTrieFactory` flushes the persistence here. - `SnapSyncRunner.Run` brackets `dispatcher.Run` with both calls. Because both halves run from a single sequential entry point — and the new `SimpleDispatcher.Run` only returns once every batch handler (and therefore every `ISnapTree.Dispose`) has completed — the implementation needs no internal locking and the `_initialized` flag goes away. This is the in-tree equivalent of the master-branch fix in #11443; the runner-driven shape suggested by @asdacap is cleaner because it doesn't depend on the factory guessing when "the first call" happens. `PatriciaSnapTrieFactory` and the test factories pick up the empty default implementations of both new methods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): extract RunSnapSyncOnce shared helper `SnapSync` and `SnapSync_StressRepro` had verbatim-copied bodies; extract a private `RunSnapSyncOnce(CancellationToken)` helper and have both call it. Trim the multi-line stress-repro comment to a single line per coding-style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): SnapSyncRunner lifecycle ordering test + cancel/throw cases Adds an internal-visible Func<CancellationToken, Task> ctor on SnapSyncRunner so the lifecycle can be exercised without standing up a real SimpleDispatcher, and three NSubstitute-based tests: - happy path: EnsureInitialize → dispatcher → FinalizeSync - dispatcher throws: FinalizeSync still runs - dispatcher cancelled: FinalizeSync still runs Trim the doc comments on the new ISnapTrieFactory hooks to a single line per the repo's coding-style rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(snap-sync): make SnapSyncRunner Func ctor public The Func<CancellationToken, Task> overload was internal so the test could reach it via InternalsVisibleTo. Make it public — it's a legitimate way to construct the runner, the surface is small, and DI still resolves the SimpleDispatcher overload because Func<CancellationToken, Task> isn't a registered service. Promotes the public ctor pattern to use a primary constructor (matches the rest of the codebase) and drops the assembly-level InternalsVisibleTo that was added solely for this class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): collapse SnapSyncRunner ordering tests into a single TestCase Three near-identical tests (happy / throws / cancels) folded into one parameterized method with a DispatcherOutcome enum. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(snap-sync): switch expression for SnapSyncRunner dispatcher outcome Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Lukasz Rozmej <lukasz.rozmej@gmail.com>
Summary
Three commits in this PR fix the Windows-only
E2ESyncTests.SnapSyncflake observed in CI (run #25139586200, job #73686059011 on PR #11422):SnapshotableMemColumnsDb(c14030da4d) — eliminates theTrieNodeExceptionmode.FlatTreeSyncStore.FinalizeSync(12a66f1e1b) — closes a separate ordering gap where post-sync readers could see state from not-yet-disposed phase-1 trees._initializeduntilClear()completes (73b160f057) — eliminates the deterministic "missing in flat" mode.Plus the stress reproducer (
edc0d3dbc9) used to drive the diagnosis.Result: 5 stress runs × 60 tests = 300 invocations, 0 failures. Previously 8–15 failures per 60-test run on Windows.
Failure modes the flake produces
Mode B:
TrieNodeException(cross-column tearing)Block processing leases a fresh
persistenceReaderwhile the persistence pipeline is committing, gets a snapshot where the Accounts column is updated but the StateNodes column is not, walks the trie at the new state root, and throwsTrieNodeExceptionfor a node that should be there.Mode A:
Verification failed: N missing in flatTrie-side iteration finds account leaves that the flat-side iterator can't see. The same ~9 hashed addresses missed reliably across runs, all in verifier partitions 0–1 (first byte ≤ 0x3f). Counter-based diagnosis (instrumenting both source-side
SetAccount/RemoveAccount/DeleteAccountRangeand sink-sideInMemoryWriteBatch.Setfiltered toColumn_Account) showed source vs. sink Set counts matched perfectly while the sink had ~507K unaccounted Removes. Tracing back uncoveredClearAllColumnsracing against fresh sync writes — the third fix below.Fix 1 —
c14030da4dcross-column snapshot atomicitySnapshotableMemColumnsDb.CreateSnapshot()iterates columns and captures each column's snapshot independently.InMemoryColumnWriteBatch.Dispose()iterates per-column batches and disposes each independently. Without a global guard, aCreateSnapshotconcurrent with an in-flightwriteBatch.Disposecan capture some columns after the new writes and others before — a cross-column-inconsistent reader view.This race only affects the test backend — RocksDB has atomic cross-CF snapshots — but it manifests concretely in
E2ESyncTests.SnapSync. The fix wrapsCreateSnapshotand the wrappedAtomicColumnsWriteBatch.Disposein a singleLockso per-Dispose multi-column commits are atomic w.r.t. snapshot creation.Fix 2 —
12a66f1e1bdrain in-flight ISnapTrees in FinalizeSyncPhase-1 / phase-2 trees produced by
FlatSnapTrieFactoryhold per-treeIWriteBatchinstances; until each tree'sDisposereturns, its writes aren't yet committed to the underlying persistence. Without a drain, a caller that observes "sync finished" can read state that's missing accounts from late-disposed trees.FlatSnapTrieFactorynow tracks an_inFlightTreescounter (incremented inCreateStateTree/CreateStorageTree, decremented inFlatSnapStateTree.Dispose/FlatSnapStorageTree.Dispose).FlatTreeSyncStore.FinalizeSynccallsWaitForInFlightTreesDrained(spin-wait, 30s timeout) before bumping the persisted state ID. This is independent of fix 3 — it closes a different ordering window that can otherwise re-emerge as the same symptom.Fix 3 —
73b160f057don't publish_initializeduntilClear()completesFlatSnapTrieFactory.EnsureDatabaseClearedhad this ordering:Sequence that wipes accounts:
EnsureDatabaseCleared, sets_initialized = true, then startspersistence.Clear().Clear()opens a singleIColumnsWriteBatch, then for each column iteratesdb.GetColumnDb(column).GetAllKeys()and queuescolumnBatch.Remove(key)for every key it sees — and only disposes the batch (applying the queued Removes) when all columns are done.if (_initialized) return;, seestrue, and skips Clear entirely.persistence.CreateWriteBatch(...)and starts writing accounts viaSetAccount. These commit normally (their ownIPersistence.IWriteBatch.Dispose).Clear()is still inside itsIColumnsWriteBatch. As it iteratesGetAllKeys()for the Account column, it sees the keys T2 just committed and queuescolumnBatch.Remove(key)for them.Removelands on the live DB — including T2's freshly-written accounts. The flat DB ends up missing entries that the trie still references; the verifier reports "Account in trie not found in flat".The fix: publish
_initialized = trueafterpersistence.Clear()returns, and mark the fieldvolatile. Concurrent fast-path readers either observefalse(and wait on the lock) or observetrue(and Clear is genuinely done, so it's safe to proceed).How fix 3 was diagnosed
Surface symptoms (key seen by trie iterator but not by flat iterator) were consistent across runs but didn't point at any obvious caller. Source-side instrumentation in
BaseFlatPersistence.WriteBatch(SetAccount/RemoveAccount/DeleteAccountRangepreview) plus sink-side instrumentation inInMemoryWriteBatch.Set(filtered to_store.Name == "Column_Account") showed:507,072 of the 509,598 sink Removes came from
ClearAllColumns— i.e. ~8,451 accounts removed per test, where the DB was supposed to be empty when Clear ran. ~8,451 ≈ the test fixture's account count, confirming Clear was racing against sync writes. The race is timing-sensitive (any instrumentation in the hot path masks it) which is why Linux/macOS pass and only Windows reproduces.Stress reproducer
SnapSync_StressReproruns 30 iterations × 2 Flat fixtures = 60 cases, each as its own NUnit case (so a single failure doesn't terminate the rest), no[Retry],[Explicit]. ~1m runtime._initializedrace (fix 3)Across 5 consecutive runs of the stress reproducer with all three fixes applied: 0/300 failures.
Test plan
Nethermind.Dbbuilds (0 warnings, 0 errors)Nethermind.State.Flatbuilds (0 warnings, 0 errors)Nethermind.Synchronization.Testbuilds (0 warnings, 0 errors)