Skip to content

[rocksdb] Sync recovery SST directory before reused MANIFEST append#14780

Open
xingbowang wants to merge 1 commit into
facebook:mainfrom
xingbowang:2026_05_23_oncall_T272584339
Open

[rocksdb] Sync recovery SST directory before reused MANIFEST append#14780
xingbowang wants to merge 1 commit into
facebook:mainfrom
xingbowang:2026_05_23_oncall_T272584339

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

@xingbowang xingbowang commented May 23, 2026

Summary

  • When reuse_manifest_on_open reuses the current MANIFEST, DB::Open recovery can flush WAL data into a new L0 SST and append the corresponding VersionEdit to that already-current MANIFEST.
  • If open later fails and the process crashes, the MANIFEST edit can be durable while the recovered SST directory entry is not, leaving the DB pointing at a missing SST.
  • Fsync the recovered SST's data directory before adding the file to the recovery edit when appending to a reused MANIFEST.
  • Add a regression test that injects failure after MANIFEST sync, simulates crash cleanup of files created after the last directory sync, and verifies the recovered key remains readable.

Task

  • T272584339

Test Plan

CI

Summary: When `reuse_manifest_on_open` is active, `DB::Open` can recover WAL data into a new L0 SST and append the corresponding VersionEdit to the already-current MANIFEST. The MANIFEST sync makes that edit durable before `SetCurrentFile` has a chance to fsync the DB directory, since the reused-MANIFEST path does not create a new CURRENT file.

If Open later fails and the process crashes, the SST directory entry can be lost while the MANIFEST still references it. The stress test reproduced this by deleting files created after the last directory sync, then reopening into a MANIFEST that referenced missing `001674.sst`.

Fix by fsyncing the recovery SST data directory before adding the file to the recovery edit when the existing MANIFEST writer is being reused.

Test Plan: make -j128 db_basic_test

timeout 60 ./db_basic_test --gtest_filter=DBBasicTest.ReuseManifestOnOpenSyncsRecoverySstDirBeforeManifestAppend --gtest_repeat=5

timeout 60 ./db_basic_test --gtest_filter='DBBasicTest.ReuseManifestOnOpen*'

Reverted the production fix and confirmed the new test fails with a missing SST referenced by MANIFEST, then reapplied the fix.

make clean

COERCE_CONTEXT_SWITCH=1 make -j128 db_basic_test

timeout 60 ./db_basic_test --gtest_filter=DBBasicTest.ReuseManifestOnOpenSyncsRecoverySstDirBeforeManifestAppend --gtest_repeat=5

make check-sources

git diff --check

Reviewers:

Subscribers:

Tasks: 272584339

Tags:
@meta-cla meta-cla Bot added the CLA Signed label May 23, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 23, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D106201774.

@github-actions
Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 329.1s.

@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 1c0d2f5


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 1c0d2f5


Summary

Correct and well-targeted fix for a real crash-safety gap in the reuse_manifest_on_open recovery path. The approach follows established RocksDB patterns (FlushJob, CompactionJob). One medium-severity defensive coding issue found.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Missing null check on GetDataDir() return value -- db/db_impl/db_impl_open.cc:2240
  • Issue: cfd->GetDataDir(meta.fd.GetPathId()) can return nullptr when data_dirs_ is empty (see column_family.cc:1752-1754). The code dereferences the result without a null check, which would crash.
  • Root cause: GetDataDir() returns nullptr when data_dirs_ is empty. In normal recovery flow, AddDirectories() (line 719) runs before RecoverLogFiles() (line 851), so data_dirs_ is populated. However, both FlushJob (flush_job.cc:1104) and CompactionJob (compaction_job.cc:864) defensively null-check before calling FsyncWithDirOptions, establishing a clear codebase convention. Edge cases (partial AddDirectories failure, unusual configurations) could theoretically leave data_dirs_ empty.
  • Suggested fix:
    if (s.ok() && has_output) {
      if (immutable_db_options_.reuse_manifest_on_open &&
          versions_->descriptor_log_ != nullptr) {
        FSDirectory* data_dir = cfd->GetDataDir(meta.fd.GetPathId());
        if (data_dir != nullptr) {
          s = data_dir->FsyncWithDirOptions(
                  IOOptions(), nullptr,
                  DirFsyncOptions(
                      DirFsyncOptions::FsyncReason::kNewFileSynced));
        }
      }
    }

🟢 LOW / NIT

L1. Blob file directory sync not addressed -- db/db_impl/db_impl_open.cc:2248
  • Issue: BuildTable can produce blob file additions during recovery (line 2074, 2170, 2248-2250). The PR syncs the SST directory but not the blob file directory. CompactionJob handles this separately (compaction_job.cc:870-873).
  • Root cause: In practice, during recovery blob files are created in cf_paths.front().path (blob_file_builder.cc:191), which is typically the same directory as the SST. The risk is low because blob files share the SST directory in the common case, and the same directory fsync covers both. This would only matter if blob files used a different directory, which doesn't happen via BuildTable during recovery.
  • Suggested fix: Consider adding a comment noting that blob files share the SST directory during recovery, or adding a separate blob dir fsync if blob_output_directory != data_dir (matching compaction_job pattern) for completeness.
L2. Test covers only single-CF, single-SST recovery -- db/db_basic_test.cc:770
  • Issue: The regression test exercises the fix for a single column family with a single recovered key. Multi-CF recovery and multiple recovered SSTs are not tested.
  • Suggested fix: Consider adding a variant with multiple column families to ensure directory sync works correctly across CFs.

Cross-Component Analysis

Context Affected? Safe? Notes
WritePreparedTxnDB Yes (same recovery) Yes Same code path
ReadOnly DB No N/A No recovery flush in read-only mode
best_efforts_recovery No N/A Mutually exclusive with reuse_manifest_on_open
User-defined timestamps Yes Yes Orthogonal to dir sync
btrfs filesystem Yes Yes btrfs skips dir fsync for kNewFileSynced intentionally -- btrfs guarantees new file visibility after file fsync without dir fsync. This is a documented RocksDB optimization (io_posix.cc:1959-1963), not a bug.
avoid_flush_during_recovery No N/A Recovery SSTs not created
Multiple db_paths Yes Yes GetPathId() selects correct dir

Positive Observations

  • Correct fix location: The fsync is placed after SST creation and before edit->AddFile(), matching the established pattern in FlushJob and CompactionJob.
  • Minimal and focused scope: Only the necessary code path is modified (reused MANIFEST during recovery).
  • Proper condition: Checking both reuse_manifest_on_open and descriptor_log_ != nullptr correctly identifies exactly the case needing the fix.
  • Good test design: The regression test uses FaultInjectionTestFS and SyncPoint injection (not sleep), making it deterministic and non-flaky. The test correctly simulates the crash scenario using DeleteFilesCreatedAfterLastDirSync.
  • Cold path: This runs only during DB::Open recovery, so there is zero performance impact on steady-state operations.
  • Error propagation is correct: A failed fsync properly prevents edit->AddFile() from executing (the subsequent if (s.ok() && has_output) block is gated on s).

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant