Skip to content

db: opt-in ReadAhead of Table with limited "ahead window"#21880

Open
AskAlexSharov wants to merge 57 commits into
mainfrom
alex/warmup_table_36
Open

db: opt-in ReadAhead of Table with limited "ahead window"#21880
AskAlexSharov wants to merge 57 commits into
mainfrom
alex/warmup_table_36

Conversation

@AskAlexSharov

@AskAlexSharov AskAlexSharov commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Completely changed PR.

Now:

  • table split on ch"keys amount aware" and equal chunks - by mdbx_cursor_distribute()
  • read-aheader doesn't run to far ahead from consumer (1gb ahead)
  • ClearTable() method now based on mdbxgo_cursor_delete_range() + ReadAheader()

It greatly speedup "clear table" and "prune" operations on chaindata >> RAM case

Speedup tools:

  • integration stage_exec --reset
  • integration mdbx_to_mdbx

ReadAhead is behind WARMUP_TABLE_WORKERS=256 feature-flag.
By default not enabled. But maybe i will auto-enable it - because it's used on offline-maintainance-tools only.

FYI: mdbx_cursor_distribute() is fast on db >> ram because i run it only on BranchNodes of btree: table.stats.depth - 2

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in mechanism to warm specific MDBX tables into the OS page cache to improve operation when chaindata is much larger than RAM, and wires this warmup into several read-heavy workflows (collate/prune, backup/copy, integration stages).

Changes:

  • Extend kv.RoDB / kv.TemporalRoDB with WarmupTable / Warmup APIs and provide implementations (MDBX + no-ops for remote/memory DBs).
  • Integrate per-table warmup into state domain/inverted-index collate/prune and into backup table clear/copy paths.
  • Update call sites/signatures in stagedsync resets and integration commands to pass the DB handle where needed; adjust stage_exec flow around BuildFiles.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
polygon/db/database.go Exposes WarmupTable on Polygon DB wrapper by delegating to underlying kv.RoDB.
execution/stagedsync/stage_custom_trace.go Updates backup.ClearTables call to pass the DB handle.
execution/stagedsync/rawdbreset/reset_stages.go Adds DB arg to reset helpers and passes it to backup.ClearTables for warmup-while-clear.
db/state/inverted_index.go Adds kv.RoDB field and fires background WarmupTable before reading keys table in collate.
db/state/domain.go Fires background WarmupTable before reading domain values table in collate/prune.
db/state/aggregator.go Wires Aggregator DB handle into registered Domains/InvertedIndexes (enables warmup).
db/rawdb/blockio/block_writer.go Passes db into backup.ClearTables when truncating bodies-related tables.
db/kv/temporal/kv_temporal.go Adds TemporalRoDB.Warmup(ctx, force) implementation via underlying MDBX tx warmup.
db/kv/remotedb/kv_remote.go Implements no-op Warmup/WarmupTable for remote DB provider.
db/kv/membatchwithdb/memory_store.go Implements no-op WarmupTable for in-memory store DB.
db/kv/membatchwithdb/memory_mutation.go Implements no-op Warmup/WarmupTable for temporal in-memory DB wrapper.
db/kv/mdbx/kv_mdbx.go Implements MDBX WarmupTable via parallel prefix/seek scans.
db/kv/kv_interface.go Extends interfaces with WarmupTable and Warmup API contracts.
db/kv/helpers.go Forwards WarmupTable through gatedRoDB.
db/kv/backup/backup.go Adds warmupWhile helper; warms tables during Kv2kv and ClearTables; updates signatures.
common/dbg/experiments.go Adds env-controlled WARMUP_TABLE_WORKERS knob (default 0 disables warmup).
cmd/integration/commands/stages.go Updates reset call signature; modifies stageExec commit/build flow.
cmd/integration/commands/reset_state.go Updates backup.ClearTables call to pass DB handle.
cmd/integration/commands/refetence_db.go Updates backup.Kv2kv call signature (drops threads arg).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db/kv/mdbx/kv_mdbx.go Outdated
Comment thread db/kv/mdbx/kv_mdbx.go Outdated
Comment thread db/kv/mdbx/kv_mdbx.go Outdated
Comment thread cmd/integration/commands/stages.go Outdated
@AskAlexSharov AskAlexSharov changed the title db/kv, db/state, execution: opt-in per-table page-cache warmup [wip] db: opt-in per-table warmup Jun 18, 2026
@AskAlexSharov AskAlexSharov requested a review from Copilot June 18, 2026 09:20
@yperbasis yperbasis added this to the 3.6.0 milestone Jun 18, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Comment thread db/kv/mdbx/kv_mdbx.go Outdated
Comment thread db/kv/mdbx/kv_mdbx.go Outdated
Comment thread db/kv/mdbx/kv_mdbx.go Outdated
Comment thread db/kv/mdbx/kv_mdbx.go Outdated
Comment thread db/kv/backup/backup.go Outdated
Comment thread db/kv/helpers.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Comment thread db/kv/mdbx/kv_mdbx.go Outdated
Comment thread db/kv/mdbx/kv_mdbx.go Outdated
@yperbasis

Copy link
Copy Markdown
Member

A few substantive concerns on the warmup logic before this comes out of WIP:

1. No parallelism on counter-keyed tables. WarmupTable partitions by the first 2 bytes (Prefix([]byte{i,j})), which only fans out when those bytes are well-distributed (domain ValuesTable — the big win). For 8-byte-BE counter keys (txNum/blockNum) the top 2–4 bytes are always 0x00 even at mainnet scale, so the whole table sits under prefix [0,0] and one goroutine scans it while the other 65 535 find nothing. This hits ii.KeysTable — exactly what the II collate warmup targets — plus EthTx/MaxTxNum/Senders (Kv2kv/TruncateBodies/ResetSenders). Those paths degenerate to a single-threaded full scan.

2. The 1000-goroutine second loop is redundant. Prefix(8-byte seek) = Range(seek, NextSubtree(seek)) matches only keys whose first 8 bytes equal seek — it doesn't scan between seeks. So its coverage is a strict subset of loop 1 (every seek starts with 00 00), it double-counts into progress, and it doesn't actually help the counter tables it looks aimed at. Either drop it, or replace both loops with Range(seek_i, seek_{i+1}) partitioning — which would also fix point 1.

3. Long-lived read txns. Each prefix scan wraps the full iteration in one db.View (one read txn). The dominant [0,0] goroutine on a counter table then holds a single reader open for the entire >>RAM scan, concurrent with the collate/prune writes that spawned it → pins the MVCC snapshot and blocks free-page reclamation (DB bloat). Same in ClearTables (warm-while-drop pins the pages the drop frees). Worth bounding scan-txn lifetime before this is enabled anywhere.

4. The stage_exec change is always-on, not gated by the flag. The new commit → BuildFiles(EndTxNumMinimax()+StepSize()) → reopen block is gated by !noCommit, not WARMUP_TABLE_WORKERS, so it changes integration stage_exec regardless of the feature (synchronous build + extra commit/reopen per iteration). Probably worth splitting out / calling out. Also EndTxNumMinimax()+StepSize() builds only one step past the frontier per iteration with no final catch-up — intended?

Minor: the new // TODO in backupTable (sequence/Unwind) has no linked issue/owner and is unrelated to warmup — per repo policy, file an issue or drop it.

@AskAlexSharov AskAlexSharov force-pushed the alex/warmup_table_36 branch from d3aa6e4 to 6bf1e15 Compare June 19, 2026 01:50
@AskAlexSharov AskAlexSharov requested a review from Copilot June 19, 2026 01:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@AskAlexSharov AskAlexSharov requested a review from Copilot June 19, 2026 01:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@AskAlexSharov AskAlexSharov force-pushed the alex/warmup_table_36 branch 3 times, most recently from b83f030 to 6aea736 Compare June 21, 2026 06:45
@AskAlexSharov AskAlexSharov changed the title [wip] db: opt-in per-table warmup [wip] db: opt-in ReadAhead with "run too far ahead" prevention Jun 21, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 2 comments.

Comment thread db/kv/mdbx/kv_mdbx.go
Comment thread db/kv/backup/backup.go
@AskAlexSharov AskAlexSharov requested a review from yperbasis June 29, 2026 12:08
@AskAlexSharov AskAlexSharov enabled auto-merge July 1, 2026 10:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 1 comment.

Comment thread db/kv/backup/backup.go

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer design than the previous round. Two blockers plus assorted smaller items.

Blockers:

  1. clearTable reuses un-cloned mdbx memory across DeleteRange calls (db/kv/backup/backup.go). chunkBounds' doc says "returns the cloned boundaries", but it returns DistributeCursors output raw — zero-copy slices aliasing the mmap. The delete loop then feeds bounds[i]/bounds[i+1] into DeleteRange after earlier chunks' deletes have already mutated the tree in the same write tx. Per the MDBX contract those slices are only valid until the next update op, and with the default --db.writemap=true pages retired by an earlier chunk can be recycled and rewritten in place — later bounds become garbage, so chunks get silently skipped (from >= to returns 0) or the wrong range is deleted, i.e. stage_exec --reset can leave residual rows in state tables. ReadAhead.plan() clones the same output for exactly this reason (bounds[i] = bytes.Clone(k)); chunkBounds needs the same.

  2. kv.BlockAccessList doesn't belong in stateHistoryBuckets (execution/stagedsync/rawdbreset/reset_stages.go). BALs are written only at block-insert time from received payloads (execmodule/inserters.go); exec computes them for validation but never persists. Once ResetExec wipes the table, re-exec silently falls back to non-BAL scheduling — under NO_PRUNE that invalidates BAL vs non-BAL benchmark comparisons, and the data can't be rebuilt without re-downloading blocks. Block-layer resets (ResetBlocksTruncateBlocks) already clear it at the right layer. Please drop this line, or state the rationale.

Also:

  • warm()'s _, _ = v[0], v[len(v)-1] is dead-load-eliminated (I checked the assembly: the guarded touch compiles to a bare RET), so only the C-side cursor traversal faults pages and multi-page overflow values (EthTx, Code) are warmed only up to their first page. Fold the bytes into a sink the compiler can't elide, e.g. XOR into an atomic.
  • Effective prefetch concurrency is capped at ahead+1 ≈ 33 regardless of WARMUP_TABLE_WORKERS (ahead = 1GB/32MB, independent of workers, and parked workers hold errgroup slots inside waitTurn). At the suggested 256 workers, ~223 goroutines only poll their 1ms tickers (~220k wakeups/s for hours). Suggest min(workers, ahead+1), or signal from SetPos on chunk advance instead of polling.
  • AlwaysGenerateChangesetsFlag is registered on snapshots retire, but doRetireCommand never reads it — a silent no-op flag.
  • No tests cover MdbxTx.DeleteRange, temporal RwTx.DeleteRange, kv.ReadAhead, or the chunked ClearTables path (only DistributeCursors is tested). A multi-chunk ClearTables test on a WriteMap DB would have caught the first blocker.
  • The temporal RwTx.DeleteRange fallback buffers every key of the range in RAM and returns len(keys) even when a mid-loop delete fails; it's also unreachable today (every temporal backend is mdbx). Stream it like db/state.DeleteRangeFromTbl, or drop it.
  • The new TODO in backupTable still has no linked issue/owner (per repo policy).

Nit-level: newReadAhead's full/label params are dead (nothing passes full=true); the 32MB chunk constant is defined twice and bounds are computed twice per clear when warmup is on; the two no_prune_test.go hunks are pure churn; NewReadAhead dropped the old nil-db guard (latent — a nil db now panics in a detached goroutine).

FYI: the eest-spec-enginextests-stable-sequential failure is 1/63920 (CancunToPrague excess_blob_gas transition, payload INVALID) — looks flaky, but worth a re-run given the mdbx-go bump touches every code path.

@AskAlexSharov AskAlexSharov requested a review from yperbasis July 3, 2026 02:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 3 comments.

Comment thread db/kv/temporal/kv_temporal.go
Comment thread db/kv/helpers.go Outdated
Comment thread db/kv/readahead.go

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up on head cd869c7: new findings plus what's still open from my July 2 round, by severity.

Blocking:

  1. Still no tests for MdbxTx.DeleteRange or the chunked ClearTables path (only DistributeCursors is tested). These now run ungated on every reset path, and the bug class they can hide is exactly blocker 1 from the last round (silently skipped chunks → residual state rows). A multi-chunk ClearTables test on a WriteMap DB would pin it — also prudent given the mdbx-go bump touches every code path.

Should fix before merge (cheap, and they matter more if warmup gets auto-enabled):

  1. waitTurn's 1ms poll (db/kv/readahead.go): at 256 workers that's ~250k timer wakeups/s (~0.13–0.5 of a core) for the whole copy/clear, and effective concurrency is silently capped at ahead+1 ≈ 33 (≈5 for ≥1TB tables past the 4096-chunk cap) — so WARMUP_TABLE_WORKERS=256 doesn't deliver 256. min(workers, ahead+1) is lossless; a SetPos-driven wakeup would be nicer.

  2. AlwaysGenerateChangesetsFlag on seg retire is still inert — doRetireCommand never reads it and the flag has no EnvVars. Drop it or wire it.

Fix-or-drop while fresh (latent — unreachable today):

  1. temporal RwTx.DeleteRange emulation still buffers every key of the range in RAM and returns len(keys) on a mid-loop delete error (the interface says keys removed). Every temporal backend today is mdbx, so it's dead code that mishandles the one case it exists for — either drop it (letting clearTable's ClearTable fallback fire) or stream it like db/state.DeleteRangeFromTbl.

  2. NewReadAhead lost the nil-db guard the old kv.ReadAhead had; nil db + WARMUP_TABLE_WORKERS>0 panics in a detached goroutine.

  3. (new) DistributeCursors discards mdbx-go's allSet return and treats any GetCurrent error as "unset surplus cursor", so a genuine MDBX_BAD_TXN/EBADSIGN/MDBX_PANIC silently degrades to coarser bounds (low impact: range coverage is preserved and the error resurfaces at the next same-tx op). Related: the unconditional keys = keys[:len(keys)-1] is only correct when allSet is true — when surplus cursors were left unset it drops a real interior boundary and merges the final two chunks. Checking allSet fixes both.

Nits (all still open): the backupTable TODO has no tracking issue/owner; newReadAhead's full/label params are dead; the 32MB chunk constant is defined twice and bounds are computed twice per clear when warmup is on; the two no_prune_test.go hunks are pure churn; ReadAheadDeprecated should use the machine-readable // Deprecated: form.

One downgrade from my July 2 list: the warm()/value-touch item is less material than I stated. mdbx's page-header validation already faults the entire first overflow page, so at our 16KB pages only values >16.4KB (code near the 24KB cap, blob-carrying EthTx) leave tail pages cold — and MADV_NORMAL sequential readahead likely covers those. Nit-level now: fold a compiler-proof sink in whenever the file is next touched.

…eleteRange clones keys

warm/ReadAheadDeprecated discarded the key/value from the iterator, so the
Go compiler dead-code-eliminated the touch loads (verified: a guarded v[i]
whose result is unused compiles to a bare RET) — only the cgo cursor
traversal faulted pages, leaving multi-page overflow values (EthTx, Code)
warmed no further than their first page. Fold one byte per OS page into an
atomic sink the compiler can't elide, so every value page is actually faulted.

temporal.RwTx.DeleteRange emulation: clone keys before deleting (Range yields
zero-copy keys invalidated by the next Next()) and return the count actually
deleted instead of the planned count.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 1 comment.

Comment thread db/kv/readahead.go
@AskAlexSharov AskAlexSharov requested a review from yperbasis July 4, 2026 03:00
warm() is best-effort, but a genuine cursor/range/IO error should be
diagnosable. Log a warning on error when the context isn't canceled;
stay silent on the expected cancellation path.
@AskAlexSharov AskAlexSharov requested a review from Copilot July 4, 2026 03:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants