You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduces a new DBOption `use_direct_reads_for_compaction` (default
false) that lets users route compaction (and flush) background reads
through O_DIRECT while keeping user reads on the buffered/page-cache
path. Sequential compaction reads otherwise pollute the OS page cache
with read-once data that evicts the hot user-read working set;
bypassing the cache for those reads protects user-read tail latency on
write-heavy workloads without forcing users onto the global
`use_direct_reads` path (which slows user reads dramatically).
A naive implementation that only flipped the FileOptions returned by
`OptimizeForCompactionTableRead` does not actually trigger the
OS-level O_DIRECT open, because the TableCache (and
FileMetaData::pinned_reader) already holds long-lived buffered
handles opened at flush time or at DB::Open via LoadTableHandlers.
Compaction would silently reuse those cached buffered handles and the
kernel would never see the O_DIRECT flag.
The fix opens ephemeral O_DIRECT handles for the lifetime of the
compaction scan, separate from the cache:
* TableCache::FindTable / NewIterator learn a `bypass_cache_for_scan`
mode. When set, the pinned-reader fast path and the shared cache
are skipped, GetTableReader is called directly with the caller's
FileOptions, and ownership of the freshly opened TableReader is
handed back to the caller. The iterator takes ownership via
RegisterCleanup and frees the reader on destruction.
* VersionSet::MakeInputIterator and LevelIterator plumb the flag
through both the L0 and L1+ compaction-input paths.
* CompactionJob::ProcessKeyValueCompaction enables the flag exactly
when `use_direct_reads_for_compaction` is on, the global
`use_direct_reads` is off, and `OptimizeForCompactionTableRead`
actually produced `use_direct_reads=true` in the
compaction-read FileOptions.
An end-to-end test in db_compaction_test.cc uses the existing
`NewRandomAccessFile:O_DIRECT` sync point in env/fs_posix.cc to assert
that the kernel-level open really happens for compaction inputs when
the flag is set, and never fires when the flag is off. The test is
scoped to platforms that use the O_DIRECT path.
A small unrelated convenience also lands here: a new db_bench flag
`--bgwriter_num` that lets the writer thread in readwhilewriting use a
wider keyspace than the readers. This is what made it possible to
benchmark the new option realistically -- the readers see a small hot
subset (cache-resident), the writer spreads puts across the full DB
which drives continuous compaction.
The new option follows the existing add_option.md checklist: it is
registered in ImmutableDBOptions for serialization, surfaced through
the C API, exposed in db_bench / db_stress / db_crashtest.py,
randomized in RandomInitDBOptions, validated against allow_mmap_reads
at Open time, and documented in unreleased_history. Java JNI is left
for a follow-up.
Benchmark results
=================
Setup: Ubuntu 24.04 (kernel 7.0.5 OrbStack Linux VM on Apple Silicon),
14 vCPUs, virtio-blk disk. MGLRU disabled (echo 0 >
/sys/kernel/mm/lru_gen/enabled). 14 GB DB (3.5M keys * 4 KB values),
no compression. Each measurement run pinned to a 1 GB cgroup
via `systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0`, so
DB-to-cache ratio is ~14x. Page cache dropped between configs.
Workload: readwhilewriting for 180 s, 4 reader threads on a hot
2,000-key subset (~8 MB, ~3% of cache) + 1 writer thread spreading
overwrites across the full 3.5M-key keyspace
(via `--bgwriter_num=3500000`), throttled at 100 MB/s. Compaction
ran at ~500 MB/s read/write during the buffered run, ~400 MB/s with
direct compaction.
Each run was 3 minutes long; "buffered" is the existing default.
| Config | Throughput | Read P50 | Read P99 | Read P99.9 | Read P99.99 |
|-------------------------------------------|-----------------|---------------|---------------|----------------|----------------|
| buffered (default) | 406 K ops/s | 7.34 us | 79.11 us | 533.14 us | 1647.79 us |
| direct_compaction_read_write | **464 K ops/s** | **6.37 us** | **71.64 us** | **468.28 us** | **1363.91 us** |
| | (+14%) | (-13%) | (-9%) | (-12%) | (-17%) |
| direct_compaction_read_only | 421 K ops/s | 6.99 us | 88.95 us | 504.32 us | 1456.75 us |
| | (+4%) | (-5%) | (+13%) | (-5%) | (-12%) |
| use_direct_reads = true (existing global) | 442 K ops/s | 7.37 us | 50.82 us | 472.23 us | 1626.77 us |
| | (+9%) | (0%) | (-36%) | (-11%) | (-1%) |
The recommended production configuration is
`use_direct_reads_for_compaction = true` together with
`use_direct_io_for_flush_and_compaction = true` ("direct reads + writes
for compaction"). It wins on every metric simultaneously: throughput
up 14%, every read percentile from P50 to P99.99 down 9 to 17%. The
existing global `use_direct_reads = true` flag does help P99
specifically but at a noticeable throughput cost and is no better at
P99.99; the new compaction-only path is strictly better for the
write-heavy workloads it is designed for.
Higher DB-to-cache ratios (the Cassandra blog at
https://lightfoot.dev/direct-i-o-for-cassandra-compaction-cutting-p99-read-latency-by-5x/
reports ~5x P99 improvement at a 43x ratio) should widen the gap
further; the 14x ratio used above is what fit in a single laptop's
disk budget.
Repro recipe
============
Setup:
- Install OrbStack on macOS or use any Linux host
- On macOS: orb create -t ubuntu rocksdb-bench
- Inside the Linux machine:
apt-get install -y build-essential clang cmake git pkg-config \
libgflags-dev libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev \
libzstd-dev rsync
cmake -DCMAKE_BUILD_TYPE=Release -DPORTABLE=1 -DWITH_GFLAGS=1 \
-DWITH_TESTS=0 .. && make -j db_bench
Build the source DB (once, unrestricted memory):
./db_bench --benchmarks=fillrandom,compact,waitforcompaction,stats \
--db=/path/to/source_db --num=3500000 --key_size=16 \
--value_size=4096 --write_buffer_size=16777216 \
--target_file_size_base=16777216 --max_background_jobs=4 \
--compression_type=none --cache_size=4194304 \
--max_bytes_for_level_base=67108864 --disable_wal=1 --sync=0
Per-config measurement (copy source_db -> scratch_db first, then
drop_caches, then run under cgroup):
sudo systemd-run --scope -p MemoryMax=1G -p MemorySwapMax=0 \
./db_bench --use_existing_db=1 \
--benchmarks=readwhilewriting,stats --db=/path/to/scratch_db \
--threads=5 --duration=180 --statistics=true --histogram=1 \
--num=2000 --bgwriter_num=3500000 \
--key_size=16 --value_size=4096 \
--write_buffer_size=16777216 --target_file_size_base=16777216 \
--max_background_jobs=4 --compression_type=none \
--cache_size=4194304 --open_files=200 \
--skip_stats_update_on_db_open=true \
--max_bytes_for_level_base=67108864 \
--benchmark_write_rate_limit=104857600 \
--rate_limiter_bytes_per_sec=0 \
--use_direct_reads={true|false} \
--use_direct_reads_for_compaction={true|false} \
--use_direct_io_for_flush_and_compaction={true|false}
Disable MGLRU first so the kernel uses the classic active/inactive LRU:
echo 0 | sudo tee /sys/kernel/mm/lru_gen/enabled
0 commit comments