db: opt-in ReadAhead of Table with limited "ahead window"#21880
db: opt-in ReadAhead of Table with limited "ahead window"#21880AskAlexSharov wants to merge 57 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an opt-in mechanism to warm specific MDBX tables into the OS page cache to improve operation when chaindata is much larger than RAM, and wires this warmup into several read-heavy workflows (collate/prune, backup/copy, integration stages).
Changes:
- Extend
kv.RoDB/kv.TemporalRoDBwithWarmupTable/WarmupAPIs and provide implementations (MDBX + no-ops for remote/memory DBs). - Integrate per-table warmup into state domain/inverted-index collate/prune and into backup table clear/copy paths.
- Update call sites/signatures in stagedsync resets and integration commands to pass the DB handle where needed; adjust
stage_execflow aroundBuildFiles.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| polygon/db/database.go | Exposes WarmupTable on Polygon DB wrapper by delegating to underlying kv.RoDB. |
| execution/stagedsync/stage_custom_trace.go | Updates backup.ClearTables call to pass the DB handle. |
| execution/stagedsync/rawdbreset/reset_stages.go | Adds DB arg to reset helpers and passes it to backup.ClearTables for warmup-while-clear. |
| db/state/inverted_index.go | Adds kv.RoDB field and fires background WarmupTable before reading keys table in collate. |
| db/state/domain.go | Fires background WarmupTable before reading domain values table in collate/prune. |
| db/state/aggregator.go | Wires Aggregator DB handle into registered Domains/InvertedIndexes (enables warmup). |
| db/rawdb/blockio/block_writer.go | Passes db into backup.ClearTables when truncating bodies-related tables. |
| db/kv/temporal/kv_temporal.go | Adds TemporalRoDB.Warmup(ctx, force) implementation via underlying MDBX tx warmup. |
| db/kv/remotedb/kv_remote.go | Implements no-op Warmup/WarmupTable for remote DB provider. |
| db/kv/membatchwithdb/memory_store.go | Implements no-op WarmupTable for in-memory store DB. |
| db/kv/membatchwithdb/memory_mutation.go | Implements no-op Warmup/WarmupTable for temporal in-memory DB wrapper. |
| db/kv/mdbx/kv_mdbx.go | Implements MDBX WarmupTable via parallel prefix/seek scans. |
| db/kv/kv_interface.go | Extends interfaces with WarmupTable and Warmup API contracts. |
| db/kv/helpers.go | Forwards WarmupTable through gatedRoDB. |
| db/kv/backup/backup.go | Adds warmupWhile helper; warms tables during Kv2kv and ClearTables; updates signatures. |
| common/dbg/experiments.go | Adds env-controlled WARMUP_TABLE_WORKERS knob (default 0 disables warmup). |
| cmd/integration/commands/stages.go | Updates reset call signature; modifies stageExec commit/build flow. |
| cmd/integration/commands/reset_state.go | Updates backup.ClearTables call to pass DB handle. |
| cmd/integration/commands/refetence_db.go | Updates backup.Kv2kv call signature (drops threads arg). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
a9c47e9 to
d3aa6e4
Compare
|
A few substantive concerns on the warmup logic before this comes out of WIP: 1. No parallelism on counter-keyed tables. 2. The 1000-goroutine second loop is redundant. 3. Long-lived read txns. Each prefix scan wraps the full iteration in one 4. The Minor: the new |
d3aa6e4 to
6bf1e15
Compare
b83f030 to
6aea736
Compare
yperbasis
left a comment
There was a problem hiding this comment.
Much nicer design than the previous round. Two blockers plus assorted smaller items.
Blockers:
-
clearTablereuses un-cloned mdbx memory acrossDeleteRangecalls (db/kv/backup/backup.go).chunkBounds' doc says "returns the cloned boundaries", but it returnsDistributeCursorsoutput raw — zero-copy slices aliasing the mmap. The delete loop then feedsbounds[i]/bounds[i+1]intoDeleteRangeafter earlier chunks' deletes have already mutated the tree in the same write tx. Per the MDBX contract those slices are only valid until the next update op, and with the default--db.writemap=truepages retired by an earlier chunk can be recycled and rewritten in place — later bounds become garbage, so chunks get silently skipped (from >= toreturns 0) or the wrong range is deleted, i.e.stage_exec --resetcan leave residual rows in state tables.ReadAhead.plan()clones the same output for exactly this reason (bounds[i] = bytes.Clone(k));chunkBoundsneeds the same. -
kv.BlockAccessListdoesn't belong instateHistoryBuckets(execution/stagedsync/rawdbreset/reset_stages.go). BALs are written only at block-insert time from received payloads (execmodule/inserters.go); exec computes them for validation but never persists. OnceResetExecwipes the table, re-exec silently falls back to non-BAL scheduling — under NO_PRUNE that invalidates BAL vs non-BAL benchmark comparisons, and the data can't be rebuilt without re-downloading blocks. Block-layer resets (ResetBlocks→TruncateBlocks) already clear it at the right layer. Please drop this line, or state the rationale.
Also:
warm()'s_, _ = v[0], v[len(v)-1]is dead-load-eliminated (I checked the assembly: the guarded touch compiles to a bareRET), so only the C-side cursor traversal faults pages and multi-page overflow values (EthTx, Code) are warmed only up to their first page. Fold the bytes into a sink the compiler can't elide, e.g. XOR into an atomic.- Effective prefetch concurrency is capped at
ahead+1≈ 33 regardless ofWARMUP_TABLE_WORKERS(ahead= 1GB/32MB, independent of workers, and parked workers hold errgroup slots insidewaitTurn). At the suggested 256 workers, ~223 goroutines only poll their 1ms tickers (~220k wakeups/s for hours). Suggestmin(workers, ahead+1), or signal fromSetPoson chunk advance instead of polling. AlwaysGenerateChangesetsFlagis registered onsnapshots retire, butdoRetireCommandnever reads it — a silent no-op flag.- No tests cover
MdbxTx.DeleteRange, temporalRwTx.DeleteRange,kv.ReadAhead, or the chunkedClearTablespath (onlyDistributeCursorsis tested). A multi-chunkClearTablestest on a WriteMap DB would have caught the first blocker. - The temporal
RwTx.DeleteRangefallback buffers every key of the range in RAM and returnslen(keys)even when a mid-loop delete fails; it's also unreachable today (every temporal backend is mdbx). Stream it likedb/state.DeleteRangeFromTbl, or drop it. - The new TODO in
backupTablestill has no linked issue/owner (per repo policy).
Nit-level: newReadAhead's full/label params are dead (nothing passes full=true); the 32MB chunk constant is defined twice and bounds are computed twice per clear when warmup is on; the two no_prune_test.go hunks are pure churn; NewReadAhead dropped the old nil-db guard (latent — a nil db now panics in a detached goroutine).
FYI: the eest-spec-enginextests-stable-sequential failure is 1/63920 (CancunToPrague excess_blob_gas transition, payload INVALID) — looks flaky, but worth a re-run given the mdbx-go bump touches every code path.
yperbasis
left a comment
There was a problem hiding this comment.
Follow-up on head cd869c7: new findings plus what's still open from my July 2 round, by severity.
Blocking:
- Still no tests for
MdbxTx.DeleteRangeor the chunkedClearTablespath (onlyDistributeCursorsis tested). These now run ungated on every reset path, and the bug class they can hide is exactly blocker 1 from the last round (silently skipped chunks → residual state rows). A multi-chunkClearTablestest on a WriteMap DB would pin it — also prudent given the mdbx-go bump touches every code path.
Should fix before merge (cheap, and they matter more if warmup gets auto-enabled):
-
waitTurn's 1ms poll (db/kv/readahead.go): at 256 workers that's ~250k timer wakeups/s (~0.13–0.5 of a core) for the whole copy/clear, and effective concurrency is silently capped atahead+1≈ 33 (≈5 for ≥1TB tables past the 4096-chunk cap) — soWARMUP_TABLE_WORKERS=256doesn't deliver 256.min(workers, ahead+1)is lossless; a SetPos-driven wakeup would be nicer. -
AlwaysGenerateChangesetsFlagonseg retireis still inert —doRetireCommandnever reads it and the flag has no EnvVars. Drop it or wire it.
Fix-or-drop while fresh (latent — unreachable today):
-
temporal
RwTx.DeleteRangeemulation still buffers every key of the range in RAM and returnslen(keys)on a mid-loop delete error (the interface says keys removed). Every temporal backend today is mdbx, so it's dead code that mishandles the one case it exists for — either drop it (lettingclearTable'sClearTablefallback fire) or stream it likedb/state.DeleteRangeFromTbl. -
NewReadAheadlost the nil-db guard the oldkv.ReadAheadhad; nil db +WARMUP_TABLE_WORKERS>0panics in a detached goroutine. -
(new)
DistributeCursorsdiscards mdbx-go'sallSetreturn and treats anyGetCurrenterror as "unset surplus cursor", so a genuineMDBX_BAD_TXN/EBADSIGN/MDBX_PANICsilently degrades to coarser bounds (low impact: range coverage is preserved and the error resurfaces at the next same-tx op). Related: the unconditionalkeys = keys[:len(keys)-1]is only correct whenallSetis true — when surplus cursors were left unset it drops a real interior boundary and merges the final two chunks. CheckingallSetfixes both.
Nits (all still open): the backupTable TODO has no tracking issue/owner; newReadAhead's full/label params are dead; the 32MB chunk constant is defined twice and bounds are computed twice per clear when warmup is on; the two no_prune_test.go hunks are pure churn; ReadAheadDeprecated should use the machine-readable // Deprecated: form.
One downgrade from my July 2 list: the warm()/value-touch item is less material than I stated. mdbx's page-header validation already faults the entire first overflow page, so at our 16KB pages only values >16.4KB (code near the 24KB cap, blob-carrying EthTx) leave tail pages cold — and MADV_NORMAL sequential readahead likely covers those. Nit-level now: fold a compiler-proof sink in whenever the file is next touched.
…eleteRange clones keys warm/ReadAheadDeprecated discarded the key/value from the iterator, so the Go compiler dead-code-eliminated the touch loads (verified: a guarded v[i] whose result is unused compiles to a bare RET) — only the cgo cursor traversal faulted pages, leaving multi-page overflow values (EthTx, Code) warmed no further than their first page. Fold one byte per OS page into an atomic sink the compiler can't elide, so every value page is actually faulted. temporal.RwTx.DeleteRange emulation: clone keys before deleting (Range yields zero-copy keys invalidated by the next Next()) and return the count actually deleted instead of the planned count.
warm() is best-effort, but a genuine cursor/range/IO error should be diagnosable. Log a warning on error when the context isn't canceled; stay silent on the expected cancellation path.
Completely changed PR.
Now:
mdbx_cursor_distribute()1gbahead)ClearTable()method now based onmdbxgo_cursor_delete_range()+ReadAheader()It greatly speedup "clear table" and "prune" operations on
chaindata >> RAMcaseSpeedup tools:
integration stage_exec --resetintegration mdbx_to_mdbxReadAhead is behind
WARMUP_TABLE_WORKERS=256feature-flag.By default not enabled. But maybe i will auto-enable it - because it's used on offline-maintainance-tools only.
FYI:
mdbx_cursor_distribute()is fast ondb >> rambecause i run it only on BranchNodes of btree:table.stats.depth - 2