Enable RocksDB atomic_flush to keep the index crash-consistent during IBD by DeviaVir · Pull Request #11 · Blockstream/waterfalls

DeviaVir · 2026-06-29T17:41:32Z

Problem

The WAL is disabled during IBD (write() → write_without_wal() while ibd == true) to speed up the initial sync. A single block update is one WriteBatch spanning several column families — UTXOs, history, and the block hash/height marker (plus reorg data once at tip). The batch is atomic at write time, but durability is a separate matter.

With the WAL off, each column family flushes its memtable to SST independently. RocksDB makes no cross-CF ordering guarantee for these background flushes, so a hard, ungraceful stop mid-sync — OOM kill, SIGKILL, power/host loss — can leave the families persisted to different points. The dangerous case:

the block-height marker (hashes CF) gets flushed through block N, but
a UTXO created at some block ≤ N (in the utxos CF) was still in an un-flushed memtable and is lost.

On the next startup the index believes it is synced through N, fetches N+1, and tries to spend a UTXO that no longer exists. That hits:

every utxo must exist when spent, can't find <outpoint>

which is a hard panic. Because the on-disk state is permanently inconsistent, every subsequent restart re-reads the same block and panics again — an unrecoverable crash loop. The only recovery is wiping the DB and re-indexing (or restoring a known-good copy).

Note this window is not limited to the first genesis sync: IBD-with-WAL-off also runs briefly on every restart, while the node catches up from its last indexed height to the tip. Any ungraceful kill landing in that window can corrupt the index.

Fix

Enable RocksDB's atomic_flush:

db_opts.set_atomic_flush(true);

atomic_flush makes multi-column-family flushes commit atomically, so the recovered on-disk state is always a consistent prefix of the write history — even with the WAL disabled. This is precisely the configuration RocksDB documents it for; the rust-rocksdb docs note it is "only useful when the WAL is disabled."

Secondarily, this flushes all column families on graceful shutdown too, so a clean restart keeps its sync progress instead of replaying from the last automatic flush. (This only helps clean exits; atomic_flush is what makes the ungraceful kills safe.)

Performance

This was deliberately scoped to preserve the speed the WAL-off path was added for:

At tip (steady state): negligible. atomic_flush changes flush grouping, not the write path — per-write latency is unaffected. With the WAL enabled at tip, writes are already durable and consistent regardless. Flushes there are infrequent (one block at a time + small mempool deltas), so the coordination cost is in the noise.
During IBD: a small, bounded overhead. When the large utxos memtable triggers a flush, the smaller CFs (e.g. the 40-bytes-per-block height marker) are flushed alongside it, producing some additional small SSTs / write amplification. The dominant flush cost — the UTXO data itself — is unchanged, and the big win (no per-batch WAL fsync) is fully retained. If IBD throughput regresses measurably, the clean lever is a larger write_buffer_size / max_write_buffer_number so flushes fire less often; that reduces the number of atomic-flush events without weakening the guarantee.

atomic_flush is an immutable DB-open option (it cannot be toggled at runtime without reopening the DB), so it is set once at open. Given the at-tip cost is ~zero, leaving it on permanently is simpler than reopening the DB after IBD and carries no meaningful steady-state penalty.

Measured

An ignored benchmark (bench_ibd_atomic_flush) replays a synthetic initial sync — 6M UTXO writes through the real update path, WAL off — with atomic_flush off then on:

write buffers	atomic_flush off	atomic_flush on	notes
64 MiB (production)	5.36 s	5.34 s	within run-to-run noise; identical flush/compaction bytes
256 KiB (worst case, constant flushing)	5.42 s	4.89 s	no slowdown — slightly less compaction with it on

So even when flushes are forced to be pathologically frequent, atomic_flush adds no measurable cost (coordinated flushes produce fewer tiny SSTs, so compaction work is equal-or-lower). Run it with cargo test --release --lib bench_ibd_atomic_flush -- --ignored --nocapture (scale via BENCH_BLOCKS / BENCH_CREATES_PER_BLOCK / BENCH_WRITE_BUFFER_KIB).

Testing

Added store::db::test::atomic_flush_keeps_utxos_consistent_on_crash, which reproduces the corruption end to end:

Open with the production DB options (the options builder is extracted into db_options so the test exercises the real atomic_flush setting).
Write a UTXO, then enough block-height markers (WAL-off, tiny write buffer) to trigger a background flush.
Snapshot the on-disk directory the moment the flush lands — this models an ungraceful crash, since only data already flushed to disk survives.
Reopen from the snapshot and assert the UTXO is still there.

With atomic_flush the UTXO is co-flushed and survives; remove set_atomic_flush(true) and the test fails with the UTXO gone — i.e. it directly guards the fix and demonstrates the failure mode. (The flush has to be the automatic memtable-pressure one; a manual single-family flush does not co-flush other families even with atomic_flush.)

cargo test --lib green (incl. the new test); cargo clippy -- -D warnings and cargo fmt clean.

… IBD The WAL is disabled during IBD for write speed. Each block update is a single WriteBatch spanning multiple column families (utxos, history, block hash/height). Without the WAL, column families flush their memtables independently, so a hard crash mid-sync (OOM, SIGKILL, host loss) can persist some families ahead of others — e.g. the block-height marker advances while a UTXO that a later block spends was never flushed. On restart this is fatal and unrecoverable: indexing the next block hits `every utxo must exist when spent` and the process crash-loops forever. Set `atomic_flush(true)` so multi-family flushes commit atomically; the recovered on-disk state is then always a consistent prefix even with the WAL off. It only changes flush grouping (not the write path), so the cost at tip is negligible; the work falls on IBD, where the guarantee is needed. Also flush all column families on graceful shutdown so a clean restart keeps its sync progress instead of replaying from the last automatic flush.

Extracts the DB-wide options into `DBStore::db_options` so the test opens with the exact production options (including atomic_flush) and asserts the fix end to end. The test writes a UTXO and then enough height markers (WAL-off, tiny write buffer) to trigger a background flush, snapshots the on-disk directory the moment the flush lands — modelling an ungraceful crash, since only flushed data survives — and reopens from the snapshot. With atomic_flush the UTXO is co-flushed and survives; without it the UTXO is lost and the assertion fails, reproducing the "every utxo must exist when spent" corruption.

An ignored test that replays a synthetic initial sync (WAL off) through the real update path with atomic_flush off then on, printing wall-clock time and flush/compaction bytes. Production column-family settings by default; set BENCH_WRITE_BUFFER_KIB to force frequent flushes and probe the worst case. Run: cargo test --release --lib bench_ibd_atomic_flush -- --ignored --nocapture Measured (6M UTXO writes, this machine): - 64 MiB buffers (production): 5.36s off vs 5.34s on — within noise. - 256 KiB buffers (worst case, constant flushing): 5.42s off vs 4.89s on, with less compaction on. No measurable slowdown from atomic_flush.

DeviaVir · 2026-07-01T12:23:18Z

Superseded by RCasatta#84 (review)

DeviaVir self-assigned this Jun 29, 2026

DeviaVir added 2 commits June 29, 2026 20:33

DeviaVir marked this pull request as ready for review June 29, 2026 19:37

DeviaVir requested a review from RCasatta June 29, 2026 19:38

RCasatta mentioned this pull request Jul 1, 2026

Remove disabling WAL during IBD RCasatta/waterfalls#84

Merged

DeviaVir closed this Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable RocksDB atomic_flush to keep the index crash-consistent during IBD#11

Enable RocksDB atomic_flush to keep the index crash-consistent during IBD#11
DeviaVir wants to merge 3 commits into
masterfrom
atomic-flush-ibd-durability

DeviaVir commented Jun 29, 2026 •

edited

Loading

Uh oh!

DeviaVir commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DeviaVir commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Performance

Measured

Testing

Uh oh!

DeviaVir commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DeviaVir commented Jun 29, 2026 •

edited

Loading