Skip to content

Enable RocksDB atomic_flush to keep the index crash-consistent during IBD#11

Closed
DeviaVir wants to merge 3 commits into
masterfrom
atomic-flush-ibd-durability
Closed

Enable RocksDB atomic_flush to keep the index crash-consistent during IBD#11
DeviaVir wants to merge 3 commits into
masterfrom
atomic-flush-ibd-durability

Conversation

@DeviaVir

@DeviaVir DeviaVir commented Jun 29, 2026

Copy link
Copy Markdown

Problem

The WAL is disabled during IBD (write()write_without_wal() while ibd == true) to speed up the initial sync. A single block update is one WriteBatch spanning several column families — UTXOs, history, and the block hash/height marker (plus reorg data once at tip). The batch is atomic at write time, but durability is a separate matter.

With the WAL off, each column family flushes its memtable to SST independently. RocksDB makes no cross-CF ordering guarantee for these background flushes, so a hard, ungraceful stop mid-sync — OOM kill, SIGKILL, power/host loss — can leave the families persisted to different points. The dangerous case:

  • the block-height marker (hashes CF) gets flushed through block N, but
  • a UTXO created at some block ≤ N (in the utxos CF) was still in an un-flushed memtable and is lost.

On the next startup the index believes it is synced through N, fetches N+1, and tries to spend a UTXO that no longer exists. That hits:

every utxo must exist when spent, can't find <outpoint>

which is a hard panic. Because the on-disk state is permanently inconsistent, every subsequent restart re-reads the same block and panics again — an unrecoverable crash loop. The only recovery is wiping the DB and re-indexing (or restoring a known-good copy).

Note this window is not limited to the first genesis sync: IBD-with-WAL-off also runs briefly on every restart, while the node catches up from its last indexed height to the tip. Any ungraceful kill landing in that window can corrupt the index.

Fix

Enable RocksDB's atomic_flush:

db_opts.set_atomic_flush(true);

atomic_flush makes multi-column-family flushes commit atomically, so the recovered on-disk state is always a consistent prefix of the write history — even with the WAL disabled. This is precisely the configuration RocksDB documents it for; the rust-rocksdb docs note it is "only useful when the WAL is disabled."

Secondarily, this flushes all column families on graceful shutdown too, so a clean restart keeps its sync progress instead of replaying from the last automatic flush. (This only helps clean exits; atomic_flush is what makes the ungraceful kills safe.)

Performance

This was deliberately scoped to preserve the speed the WAL-off path was added for:

  • At tip (steady state): negligible. atomic_flush changes flush grouping, not the write path — per-write latency is unaffected. With the WAL enabled at tip, writes are already durable and consistent regardless. Flushes there are infrequent (one block at a time + small mempool deltas), so the coordination cost is in the noise.
  • During IBD: a small, bounded overhead. When the large utxos memtable triggers a flush, the smaller CFs (e.g. the 40-bytes-per-block height marker) are flushed alongside it, producing some additional small SSTs / write amplification. The dominant flush cost — the UTXO data itself — is unchanged, and the big win (no per-batch WAL fsync) is fully retained. If IBD throughput regresses measurably, the clean lever is a larger write_buffer_size / max_write_buffer_number so flushes fire less often; that reduces the number of atomic-flush events without weakening the guarantee.

atomic_flush is an immutable DB-open option (it cannot be toggled at runtime without reopening the DB), so it is set once at open. Given the at-tip cost is ~zero, leaving it on permanently is simpler than reopening the DB after IBD and carries no meaningful steady-state penalty.

Measured

An ignored benchmark (bench_ibd_atomic_flush) replays a synthetic initial sync — 6M UTXO writes through the real update path, WAL off — with atomic_flush off then on:

write buffers atomic_flush off atomic_flush on notes
64 MiB (production) 5.36 s 5.34 s within run-to-run noise; identical flush/compaction bytes
256 KiB (worst case, constant flushing) 5.42 s 4.89 s no slowdown — slightly less compaction with it on

So even when flushes are forced to be pathologically frequent, atomic_flush adds no measurable cost (coordinated flushes produce fewer tiny SSTs, so compaction work is equal-or-lower). Run it with cargo test --release --lib bench_ibd_atomic_flush -- --ignored --nocapture (scale via BENCH_BLOCKS / BENCH_CREATES_PER_BLOCK / BENCH_WRITE_BUFFER_KIB).

Testing

Added store::db::test::atomic_flush_keeps_utxos_consistent_on_crash, which reproduces the corruption end to end:

  1. Open with the production DB options (the options builder is extracted into db_options so the test exercises the real atomic_flush setting).
  2. Write a UTXO, then enough block-height markers (WAL-off, tiny write buffer) to trigger a background flush.
  3. Snapshot the on-disk directory the moment the flush lands — this models an ungraceful crash, since only data already flushed to disk survives.
  4. Reopen from the snapshot and assert the UTXO is still there.

With atomic_flush the UTXO is co-flushed and survives; remove set_atomic_flush(true) and the test fails with the UTXO gone — i.e. it directly guards the fix and demonstrates the failure mode. (The flush has to be the automatic memtable-pressure one; a manual single-family flush does not co-flush other families even with atomic_flush.)

  • cargo test --lib green (incl. the new test); cargo clippy -- -D warnings and cargo fmt clean.

… IBD

The WAL is disabled during IBD for write speed. Each block update is a
single WriteBatch spanning multiple column families (utxos, history,
block hash/height). Without the WAL, column families flush their
memtables independently, so a hard crash mid-sync (OOM, SIGKILL, host
loss) can persist some families ahead of others — e.g. the block-height
marker advances while a UTXO that a later block spends was never flushed.

On restart this is fatal and unrecoverable: indexing the next block hits
`every utxo must exist when spent` and the process crash-loops forever.

Set `atomic_flush(true)` so multi-family flushes commit atomically; the
recovered on-disk state is then always a consistent prefix even with the
WAL off. It only changes flush grouping (not the write path), so the cost
at tip is negligible; the work falls on IBD, where the guarantee is needed.

Also flush all column families on graceful shutdown so a clean restart
keeps its sync progress instead of replaying from the last automatic flush.
@DeviaVir DeviaVir self-assigned this Jun 29, 2026
DeviaVir added 2 commits June 29, 2026 20:33
Extracts the DB-wide options into `DBStore::db_options` so the test opens
with the exact production options (including atomic_flush) and asserts the
fix end to end.

The test writes a UTXO and then enough height markers (WAL-off, tiny write
buffer) to trigger a background flush, snapshots the on-disk directory the
moment the flush lands — modelling an ungraceful crash, since only flushed
data survives — and reopens from the snapshot. With atomic_flush the UTXO
is co-flushed and survives; without it the UTXO is lost and the assertion
fails, reproducing the "every utxo must exist when spent" corruption.
An ignored test that replays a synthetic initial sync (WAL off) through the
real update path with atomic_flush off then on, printing wall-clock time and
flush/compaction bytes. Production column-family settings by default; set
BENCH_WRITE_BUFFER_KIB to force frequent flushes and probe the worst case.

Run: cargo test --release --lib bench_ibd_atomic_flush -- --ignored --nocapture

Measured (6M UTXO writes, this machine):
- 64 MiB buffers (production): 5.36s off vs 5.34s on — within noise.
- 256 KiB buffers (worst case, constant flushing): 5.42s off vs 4.89s on,
  with less compaction on. No measurable slowdown from atomic_flush.
@DeviaVir DeviaVir marked this pull request as ready for review June 29, 2026 19:37
@DeviaVir DeviaVir requested a review from RCasatta June 29, 2026 19:38
@DeviaVir

DeviaVir commented Jul 1, 2026

Copy link
Copy Markdown
Author

Superseded by RCasatta#84 (review)

@DeviaVir DeviaVir closed this Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant