Skip to content

Add async WAL precreation#14738

Open
xingbowang wants to merge 3 commits into
facebook:mainfrom
xingbowang:2026_05_11_async_wal_rotation
Open

Add async WAL precreation#14738
xingbowang wants to merge 3 commits into
facebook:mainfrom
xingbowang:2026_05_11_async_wal_rotation

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

@xingbowang xingbowang commented May 13, 2026

Summary

  • Add experimental immutable DBOptions::async_wal_precreate to reserve and open one future WAL on a background HIGH-priority task, with sanitization that disables the optimization when WAL recycling is configured.
  • Split WAL creation into open/preallocate and start phases so SwitchMemtable() can consume a prepared WAL after writing normal WAL metadata, wait for in-flight precreation, fall back to synchronous creation, and delete an unstarted prepared WAL on start failure.
  • Keep WAL numbering, close, recovery, and read-only open safe for empty future WAL files left by async precreation; error_if_wal_file_exists=true now rejects non-empty WALs while tolerating empty WALs.
  • Add public option plumbing for the C API, options parsing/stringification, random option testing, db_bench, db_stress, and crash-test configuration.
  • Add WAL precreate statistics counters plus Java TickerType/JNI mappings, and update C++, C, and Java read-only-open documentation for the empty-WAL behavior.
  • Add focused WAL/option/C/Java tests for async precreate ready/wait/failure/recovery paths, read-only WAL detection, option sanitization, and API plumbing, plus write-flow docs and unreleased history entries for the new feature and behavior change.

Testing

  • git diff --check
  • Not run locally: Java test execution requires a local Java runtime, and this checkout has no Java runtime / JAVA_HOME.

@meta-cla meta-cla Bot added the CLA Signed label May 13, 2026
Add an immutable async_wal_precreate DB option that reserves and creates the next WAL on a background thread, then consumes it during WAL rotation when available.

Fall back to synchronous WAL creation when precreation is unavailable, wait for pending precreation at rotation, and release unused precreated WAL writers on close without deleting the empty future WAL. Add option plumbing, db_bench/db_stress flags, crash-test coverage, unit tests, docs, and release note.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

✅ clang-tidy: No findings on changed lines

Completed in 564.2s.

@xingbowang xingbowang force-pushed the 2026_05_11_async_wal_rotation branch from a558318 to 63839c6 Compare May 13, 2026 14:33
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 8c7ad3b


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 8c7ad3b


Summary

Well-designed feature that follows established RocksDB async patterns (mirrors AsyncFileOpenState). The state machine, mutex coordination, error handling with synchronous fallback, and recovery tolerance are all correctly implemented. The main concerns are around the error_if_wal_file_exists behavioral change, BuildDBOptions copy overhead, and a few missing test coverage areas.

High-severity findings (1):

  • [db/db_impl/db_impl_open.cc:813] error_if_wal_file_exists behavioral change: read-only open now silently accepts empty WAL files — undocumented public API semantic change.
Full review (click to expand)

Findings

🔴 HIGH

H1. error_if_wal_file_exists Public API Behavioral Change — db/db_impl/db_impl_open.cc:813
  • Issue: The PR changes error_if_wal_file_exists from rejecting ANY WAL file to rejecting only non-empty WALs. This is a behavioral change to a public API parameter used in DB::OpenForReadOnly(), the C API, and Java bindings. Existing tools that use this flag to detect unclean shutdowns (where any WAL indicates unprocessed data) will silently succeed when empty WAL files from async precreation are present.
  • Root cause: The change is necessary for async precreation to work with read-only opens, but it changes the semantic contract of an existing API parameter.
  • Suggested fix: Document this as a deliberate behavioral change in the release notes. Update the C++ header doc at OpenForReadOnly to clarify that empty WAL files are now tolerated. The Java doc update in the diff is good but the C++ header should match. Alternatively, a new flag could be introduced, but this may be over-engineering.

🟡 MEDIUM

M1. BuildDBOptions Copy Overhead in Warm Path — db/db_impl/db_impl_open.cc:2499
  • Issue: BuildDBOptions() constructs a full DBOptions with ~160 fields including strings and shared_ptrs. Called in both MaybeScheduleAsyncWALPrecreate() and SwitchMemtable() sync fallback. The async path adds one extra copy per SwitchMemtable beyond what existed before.
  • Suggested fix: Create a lightweight WalCreationParams struct capturing only the ~10 fields actually used by CreateWALWriter.
M2. Missing Test: kAbsoluteConsistency + Empty Future WAL — db/db_wal_test.cc
  • Issue: No test covers the strictest WAL recovery mode with empty trailing WALs.
  • Suggested fix: Add a variant of RecoveryToleratesEmptyFutureWal with wal_recovery_mode = kAbsoluteConsistency.
M3. Missing Test: two_write_queues + Async WAL Precreation — db/db_wal_test.cc
  • Issue: No test coverage for WritePreparedTxnDB (which uses two_write_queues_=true) with async WAL precreation.
  • Suggested fix: Add at least one test combining these features.

🟢 LOW / NIT

L1. UnpublishedWAL Custom Move Operations — db/db_impl/db_impl.h:2835
  • Issue: Custom move constructor/assignment could be = default since the struct only contains uint64_t + unique_ptr.
L2. Release Note Missing Behavioral Change — unreleased_history/new_features/async_wal_precreate.md
  • Issue: Does not mention the error_if_wal_file_exists behavioral change.

Cross-Component Analysis

Context Safe? Notes
WritePreparedTxnDB YES SwitchMemtable path unchanged; WAL creation independent of txn state
ReadOnly DB See H1 error_if_wal_file_exists change is the only interaction
WAL recycling N/A Correctly sanitized to false when recycle > 0
2PC YES WAL precreation orthogonal to 2PC lifecycle
User-defined timestamps YES UDT doesn't affect WAL file creation

Disproven claims from initial agents: TOCTOU race (mutex prevents), double file number allocation (state check prevents), file leak on StartWALFile failure (cleanup code exists), ticker ABI breakage (documented as unstable), shutdown race (wait loop + cleanup handles correctly).

Positive Observations

  • Clean CreateWAL decomposition into async-safe I/O + sync metadata
  • Best-effort fallback ensures correctness is never compromised
  • Good test coverage: 7 feature tests + deterministic sync points
  • Consistent with existing AsyncFileOpenState pattern
  • Comprehensive option plumbing across C/Java/db_bench/db_stress/crash test

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 13, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D105020559.

@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 53186ec


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 53186ec


Summary

Well-structured PR that adds async WAL precreation with clean separation of concerns (CreateWALWriter/StartWALFile), proper state machine, synchronous fallback, and comprehensive tests. The main concerns are: (1) a behavioral change to error_if_wal_file_exists that affects all ReadOnly users regardless of whether they use async WAL precreation, (2) new statistics tickers inserted mid-enum which shifts subsequent values, and (3) a BuildDBOptions() call on the hot path that can be moved into a conditional block.

High-severity findings (2):

  • [statistics.h:226-234] New WAL_PRECREATE_* tickers inserted mid-enum between WAL_FILE_BYTES and WRITE_DONE_BY_SELF, shifting all subsequent ticker numeric values and breaking serialization compatibility.
  • [db_impl_open.cc:810-828] error_if_wal_file_exists=true behavioral change affects ALL ReadOnly users—empty WALs are now silently tolerated even when async_wal_precreate is disabled.
Full review (click to expand)

Findings

🔴 HIGH

H1. Statistics enum insertion breaks ticker value stability — include/rocksdb/statistics.h:226
  • Issue: Five new tickers (WAL_PRECREATE_HIT through WAL_PRECREATE_FAILED) are inserted between WAL_FILE_BYTES (0x54) and WRITE_DONE_BY_SELF (0x55 in the old enum). This shifts all subsequent ticker numeric values. Any downstream system that serializes or persists ticker enum values (monitoring dashboards, stats collectors, JNI without remapping) will silently report incorrect metrics.
  • Root cause: Enum members are inserted mid-sequence instead of appended near TICKER_ENUM_MAX.
  • Suggested fix: Move the five new tickers to the end of the enum, just before TICKER_ENUM_MAX. This preserves all existing ticker values. Update the Java JNI mappings accordingly.
H2. error_if_wal_file_exists semantic change is unconditional — db/db_impl/db_impl_open.cc:810
  • Issue: The error_if_wal_file_exists code path is changed from "reject if any WAL exists" to "reject only if a non-empty WAL exists." This behavioral change applies to ALL ReadOnly opens, even when async_wal_precreate=false. Users who rely on this flag to detect any WAL presence (e.g., verifying a clean checkpoint before serving reads) will no longer detect empty WAL files.
  • Root cause: The change is motivated by async WAL precreation leaving empty files, but the tolerance is applied globally.
  • Suggested fix: Either (a) only tolerate empty WALs when async_wal_precreate is known to be enabled in the DB's persisted options, or (b) clearly document this as a deliberate, universal behavioral change in the release notes and update the error_if_wal_file_exists_empty_wals.md to emphasize it affects all users. Option (b) may be acceptable if the RocksDB team considers empty WALs harmless in all contexts.

🟡 MEDIUM

M1. BuildDBOptions() called unconditionally on SwitchMemtable hot path — db/db_impl/db_impl_write.cc:3167
  • Issue: const DBOptions db_options_snapshot = BuildDBOptions(immutable_db_options_, mutable_db_options_) is executed on every SwitchMemtable call, even when creating_new_log is false (i.e., the WAL is still empty and no new file is needed). BuildDBOptions copies ~100+ fields including strings, vectors, and shared_ptrs.
  • Root cause: The variable is declared outside the if (creating_new_log) block.
  • Suggested fix: Move the BuildDBOptions call and db_options_snapshot declaration inside the if (creating_new_log) block (around line 3183).
M2. Heap-allocated AsyncWALPrecreateContext copies full DBOptionsdb/db_impl/db_impl_open.cc:2488
  • Issue: Each MaybeScheduleAsyncWALPrecreate call allocates an AsyncWALPrecreateContext on the heap and copies the full DBOptions into it. DBOptions is a large struct. This happens while holding mutex_.
  • Root cause: The background task needs DB options to create the WAL file, so a snapshot is taken at schedule time.
  • Suggested fix: Consider storing only the fields actually used by CreateWALWriter (file options, WAL dir, listeners, clock, compression, etc.) in a lighter-weight struct, or use an ImmutableDBOptions reference (which is already immutable and outlives the task).
M3. Precreated WAL file not protected from FindObsoleteFiles full scan — db/db_impl/db_impl_files.cc:286-318
  • Issue: During a full directory scan in FindObsoleteFiles, all files are enumerated as candidates. The precreated WAL file exists on disk but is NOT in alive_wal_files_ or logs_. The WAL deletion logic (line 392) only deletes WALs with numbers BELOW min_log_number, so a precreated WAL with a higher number should be safe. However, PurgeObsoleteFiles processes full_scan_candidate_files and may delete WAL files it considers unknown.
  • Root cause: The precreated WAL is in a liminal state—on disk but not tracked in any live-file list.
  • Suggested fix: Verify that PurgeObsoleteFiles does NOT delete WAL files with numbers >= min_log_number. If it does, the precreated WAL number should be registered in a protected set. If current behavior already protects it (WAL numbers >= min_log are never purged), add a comment documenting this invariant.
M4. CloseHelper drops mutex_ during async WAL writer destruction — db/db_impl/db_impl.cc:802-808
  • Issue: After the wait loop completes and the precreated WAL is moved out, CloseHelper unlocks mutex_ to destroy the writer (which may flush/close the file), then relocks. This pattern is common in RocksDB but creates a window where the DB state could be observed in an intermediate state. Since CloseHelper has already waited for all background work, this window is likely safe, but it warrants a comment.
  • Root cause: Writer destruction involves file I/O that shouldn't be done under mutex_.
  • Suggested fix: Add a comment explaining why the unlock/destroy/lock window is safe (all bg work has already completed, and the DB is shutting down). The existing code has a comment but it could be more explicit.
M5. WaitForAsyncWALPrecreate called unconditionally — db/db_impl/db_impl_write.cc:3139
  • Issue: WaitForAsyncWALPrecreate() is called on every SwitchMemtable when creating_new_log is true, even when async_wal_precreate is disabled. The function checks AsyncWALPrecreateEnabled() and returns early, but this adds a function call to the hot path for all users.
  • Root cause: Simplicity of the code structure.
  • Suggested fix: Guard the call with if (AsyncWALPrecreateEnabled()) at the call site, or mark the function inline/ROCKSDB_LIKELY for the early return path. This is a minor optimization.

🟢 LOW / NIT

L1. UnpublishedWAL move assignment calls Reset() on source — db/db_impl/db_impl.h:2839
  • Issue: The move assignment operator calls other.Reset() which sets log_number = 0 and calls writer.reset(). The writer.reset() is redundant after std::move(other.writer) which already leaves the source in a valid empty state.
  • Suggested fix: Remove other.Reset() and just set other.log_number = 0 since std::move already handles the unique_ptr.
L2. Missing test for two_write_queues mode — db/db_wal_test.cc
  • Issue: All async WAL tests use default (single write queue) configuration. The SwitchMemtable code has special two_write_queues_ paths that interact with wal_write_mutex_ differently.
  • Suggested fix: Add at least one test with Options::two_write_queues = true and Options::async_wal_precreate = true.
L3. No test for shutdown racing with async WAL creation — db/db_wal_test.cc
  • Issue: No test explicitly exercises the CloseHelper path where the DB is closed while async_wal_precreate_state_ == kScheduled.
  • Suggested fix: Add a test using sync points to hold BGWorkAsyncWALPrecreate:Start while calling Close(), verifying clean shutdown.
L4. Java TickerType byte range approaching limit — java/src/main/java/org/rocksdb/TickerType.java
  • Issue: New tickers use -0x6C through -0x70, approaching the signed byte minimum (-0x80). Only 16 more negative values remain.
  • Suggested fix: Plan for a mapping scheme change (e.g., short or int) before the byte range is exhausted.
L5. Release note for behavioral change could be more prominent — unreleased_history/behavior_changes/error_if_wal_file_exists_empty_wals.md
  • Issue: The release note is brief and doesn't emphasize that this affects all users of error_if_wal_file_exists, not just those enabling async WAL precreation.
  • Suggested fix: Expand the note to mention the universal scope of the change and cross-reference the async WAL precreation feature.

Cross-Component Analysis

Context Code executes? Assumptions hold? Action needed?
WritePreparedTxnDB YES (inherits SwitchMemtable) YES (precreated WALs start empty) Safe
ReadOnly DB NO (no writes) N/A (recovery change is separate) H2 above
Secondary Instance PARTIAL (scans WAL dirs) MOSTLY (empty WALs treated as EOF) Minor: empty WAL in GetSortedWals
User-defined timestamps YES YES (UDT is above WAL creation layer) Safe
MemPurge YES (calls SwitchMemtable) YES Safe
FIFO/Universal compaction YES (same flush path) YES Safe
allow_2pc=true YES YES (precreated WALs have no prepared txns) Safe
WAL recycling NO (sanitized to false) YES Safe

File number safety after crash: MaybeUpdateNextFileNumber() in db_impl_files.cc:1187 scans ALL database paths during recovery and advances next_file_number_ past any existing file, including orphaned precreated WALs. This protects against file number reuse.

WAL preallocation vs GetFileSize: SetPreallocationBlockSize() uses fallocate() on Linux which does NOT change the file's st_size (only allocated extents). GetFileSize() returns st_size, so a preallocated-but-unwritten file correctly reports 0 bytes. This makes the error_if_wal_file_exists size check reliable on Linux. On Windows, SetEndOfFile behavior may differ; this should be verified.

FindObsoleteFiles safety: WAL deletion in FindObsoleteFiles only removes WALs from alive_wal_files_ with numbers BELOW min_log_number. A precreated WAL has a number HIGHER than any active WAL and is therefore safe from this path. The full_scan_candidate_files path also uses min_log_number to determine WAL obsolescence.

Positive Observations

  • Clean two-phase split (CreateWALWriter/StartWALFile) that separates I/O from metadata, following single-responsibility principle.
  • State machine pattern (kNotScheduled/kScheduled/kReady) is consistent with existing AsyncFileOpenState.
  • Best-effort design with synchronous fallback is production-friendly.
  • Comprehensive sync-point-based tests covering ready, wait, failure, and recovery paths.
  • Statistics counters (hit/miss/waited/wait_micros/failed) provide excellent production observability.
  • Proper sanitization against WAL recycling.
  • Recovery robustness via MaybeUpdateNextFileNumber filesystem scan.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

meta-codesync Bot pushed a commit that referenced this pull request May 15, 2026
Summary:
- Add experimental immutable `DBOptions::async_wal_precreate` to reserve and open one future WAL on a background HIGH-priority task, with sanitization that disables the optimization when WAL recycling is configured.
- Split WAL creation into open/preallocate and start phases so `SwitchMemtable()` can consume a prepared WAL after writing normal WAL metadata, wait for in-flight precreation, fall back to synchronous creation, and delete an unstarted prepared WAL on start failure.
- Keep WAL numbering, close, recovery, and read-only open safe for empty future WAL files left by async precreation; `error_if_wal_file_exists=true` now rejects non-empty WALs while tolerating empty WALs.
- Add public option plumbing for the C API, options parsing/stringification, random option testing, `db_bench`, `db_stress`, and crash-test configuration.
- Add WAL precreate statistics counters plus Java `TickerType`/JNI mappings, and update C++, C, and Java read-only-open documentation for the empty-WAL behavior.
- Add focused WAL/option/C/Java tests for async precreate ready/wait/failure/recovery paths, read-only WAL detection, option sanitization, and API plumbing, plus write-flow docs and unreleased history entries for the new feature and behavior change.

PR #14738

Reviewed By: pdillinger

Differential Revision: D105020559

fbshipit-source-id: 5059b424702e021abb8de65ceeb6d3b975280ffc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant