Enable RocksDB atomic_flush to keep the index crash-consistent during IBD#11
Closed
DeviaVir wants to merge 3 commits into
Closed
Enable RocksDB atomic_flush to keep the index crash-consistent during IBD#11DeviaVir wants to merge 3 commits into
DeviaVir wants to merge 3 commits into
Conversation
… IBD The WAL is disabled during IBD for write speed. Each block update is a single WriteBatch spanning multiple column families (utxos, history, block hash/height). Without the WAL, column families flush their memtables independently, so a hard crash mid-sync (OOM, SIGKILL, host loss) can persist some families ahead of others — e.g. the block-height marker advances while a UTXO that a later block spends was never flushed. On restart this is fatal and unrecoverable: indexing the next block hits `every utxo must exist when spent` and the process crash-loops forever. Set `atomic_flush(true)` so multi-family flushes commit atomically; the recovered on-disk state is then always a consistent prefix even with the WAL off. It only changes flush grouping (not the write path), so the cost at tip is negligible; the work falls on IBD, where the guarantee is needed. Also flush all column families on graceful shutdown so a clean restart keeps its sync progress instead of replaying from the last automatic flush.
Extracts the DB-wide options into `DBStore::db_options` so the test opens with the exact production options (including atomic_flush) and asserts the fix end to end. The test writes a UTXO and then enough height markers (WAL-off, tiny write buffer) to trigger a background flush, snapshots the on-disk directory the moment the flush lands — modelling an ungraceful crash, since only flushed data survives — and reopens from the snapshot. With atomic_flush the UTXO is co-flushed and survives; without it the UTXO is lost and the assertion fails, reproducing the "every utxo must exist when spent" corruption.
An ignored test that replays a synthetic initial sync (WAL off) through the real update path with atomic_flush off then on, printing wall-clock time and flush/compaction bytes. Production column-family settings by default; set BENCH_WRITE_BUFFER_KIB to force frequent flushes and probe the worst case. Run: cargo test --release --lib bench_ibd_atomic_flush -- --ignored --nocapture Measured (6M UTXO writes, this machine): - 64 MiB buffers (production): 5.36s off vs 5.34s on — within noise. - 256 KiB buffers (worst case, constant flushing): 5.42s off vs 4.89s on, with less compaction on. No measurable slowdown from atomic_flush.
Author
|
Superseded by RCasatta#84 (review) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The WAL is disabled during IBD (
write()→write_without_wal()whileibd == true) to speed up the initial sync. A single block update is oneWriteBatchspanning several column families — UTXOs, history, and the block hash/height marker (plus reorg data once at tip). The batch is atomic at write time, but durability is a separate matter.With the WAL off, each column family flushes its memtable to SST independently. RocksDB makes no cross-CF ordering guarantee for these background flushes, so a hard, ungraceful stop mid-sync — OOM kill,
SIGKILL, power/host loss — can leave the families persisted to different points. The dangerous case:hashesCF) gets flushed through block N, bututxosCF) was still in an un-flushed memtable and is lost.On the next startup the index believes it is synced through N, fetches N+1, and tries to spend a UTXO that no longer exists. That hits:
which is a hard
panic. Because the on-disk state is permanently inconsistent, every subsequent restart re-reads the same block and panics again — an unrecoverable crash loop. The only recovery is wiping the DB and re-indexing (or restoring a known-good copy).Note this window is not limited to the first genesis sync: IBD-with-WAL-off also runs briefly on every restart, while the node catches up from its last indexed height to the tip. Any ungraceful kill landing in that window can corrupt the index.
Fix
Enable RocksDB's
atomic_flush:atomic_flushmakes multi-column-family flushes commit atomically, so the recovered on-disk state is always a consistent prefix of the write history — even with the WAL disabled. This is precisely the configuration RocksDB documents it for; the rust-rocksdb docs note it is "only useful when the WAL is disabled."Secondarily, this flushes all column families on graceful shutdown too, so a clean restart keeps its sync progress instead of replaying from the last automatic flush. (This only helps clean exits;
atomic_flushis what makes the ungraceful kills safe.)Performance
This was deliberately scoped to preserve the speed the WAL-off path was added for:
atomic_flushchanges flush grouping, not the write path — per-write latency is unaffected. With the WAL enabled at tip, writes are already durable and consistent regardless. Flushes there are infrequent (one block at a time + small mempool deltas), so the coordination cost is in the noise.utxosmemtable triggers a flush, the smaller CFs (e.g. the 40-bytes-per-block height marker) are flushed alongside it, producing some additional small SSTs / write amplification. The dominant flush cost — the UTXO data itself — is unchanged, and the big win (no per-batch WAL fsync) is fully retained. If IBD throughput regresses measurably, the clean lever is a largerwrite_buffer_size/max_write_buffer_numberso flushes fire less often; that reduces the number of atomic-flush events without weakening the guarantee.atomic_flushis an immutable DB-open option (it cannot be toggled at runtime without reopening the DB), so it is set once at open. Given the at-tip cost is ~zero, leaving it on permanently is simpler than reopening the DB after IBD and carries no meaningful steady-state penalty.Measured
An ignored benchmark (
bench_ibd_atomic_flush) replays a synthetic initial sync — 6M UTXO writes through the realupdatepath, WAL off — withatomic_flushoff then on:So even when flushes are forced to be pathologically frequent,
atomic_flushadds no measurable cost (coordinated flushes produce fewer tiny SSTs, so compaction work is equal-or-lower). Run it withcargo test --release --lib bench_ibd_atomic_flush -- --ignored --nocapture(scale viaBENCH_BLOCKS/BENCH_CREATES_PER_BLOCK/BENCH_WRITE_BUFFER_KIB).Testing
Added
store::db::test::atomic_flush_keeps_utxos_consistent_on_crash, which reproduces the corruption end to end:db_optionsso the test exercises the realatomic_flushsetting).With
atomic_flushthe UTXO is co-flushed and survives; removeset_atomic_flush(true)and the test fails with the UTXO gone — i.e. it directly guards the fix and demonstrates the failure mode. (The flush has to be the automatic memtable-pressure one; a manual single-family flush does not co-flush other families even withatomic_flush.)cargo test --libgreen (incl. the new test);cargo clippy -- -D warningsandcargo fmtclean.