CASSANDRA-21134: Direct I/O for background SSTable writes#4815
Closed
samueldlightfoot wants to merge 1 commit into
Closed
CASSANDRA-21134: Direct I/O for background SSTable writes#4815samueldlightfoot wants to merge 1 commit into
samueldlightfoot wants to merge 1 commit into
Conversation
3349322 to
005f4e1
Compare
a49806c to
ca8ef09
Compare
aweisberg
requested changes
May 29, 2026
aweisberg
left a comment
Contributor
There was a problem hiding this comment.
Mostly just nits but at least one or two things worth doing and then I shared a pastebin with you of an analysis of the DirectCompressedSequentialWriterTest code and boundary coverage that is worth considering.
8af930a to
9ab9c8d
Compare
a5c418e to
aee37bc
Compare
aweisberg
approved these changes
Jun 11, 2026
aweisberg
left a comment
Contributor
There was a problem hiding this comment.
Only minor nits left. TY!
netudima
reviewed
Jun 18, 2026
netudima
reviewed
Jun 18, 2026
netudima
reviewed
Jun 19, 2026
netudima
reviewed
Jun 19, 2026
netudima
reviewed
Jun 20, 2026
netudima
reviewed
Jun 20, 2026
netudima
reviewed
Jun 20, 2026
netudima
reviewed
Jun 21, 2026
| throw new IllegalStateException("Filesystem block size must be a power of two for Direct IO. " + | ||
| "Block size: " + blockSize); | ||
|
|
||
| int configuredSize = DatabaseDescriptor.getDirectWriteBufferSize().toBytes(); |
Contributor
There was a problem hiding this comment.
it would be nice to add a validation for the max value of this configuration parameter, too large values can cause int overflow when a buffer size is calculated
netudima
reviewed
Jun 21, 2026
netudima
reviewed
Jun 21, 2026
netudima
reviewed
Jun 21, 2026
netudima
approved these changes
Jun 22, 2026
Adds an opt-in O_DIRECT write path for background SSTable producers, bypassing the OS page cache for data that is unlikely to be re-read soon after being written. Memtable flushes remain buffered. Enabled via two new YAML knobs: - background_write_disk_access_mode: standard (default) | direct - direct_write_buffer_size: 1MiB (default; aligned up to FS block size, auto-grown to fit a worst-case compressed chunk) The path is gated by config, table compression being enabled, and an OperationType allowlist in DataComponent. The allowlist is exhaustive over OperationType: any new value left unclassified fails static initialization. Operations on the DIO path: COMPACTION, MAJOR_COMPACTION, TOMBSTONE_COMPACTION, ANTICOMPACTION, GARBAGE_COLLECT, CLEANUP, UPGRADE_SSTABLES, WRITE, STREAM (chunked receiver only), RELOCATE, UNKNOWN (offline sstablesplit). Operations off the DIO path: - FLUSH (policy: just-flushed data is hot, keep in page cache) - SCRUB (correctness: tryAppend needs mark/resetAndTruncate) - Zero-Copy Streaming (bypasses DataComponent.buildWriter) - Uncompressed writers (only CompressedSequentialWriter has a DIO subclass in this change) StartupChecks fails fast if 'direct' is requested on a platform/FS that does not support O_DIRECT. patch by Sam Lightfoot; reviewed by Ariel Weisberg, Dmitry Konstantinov for CASSANDRA-21134
5437990 to
c0beaa7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CASSANDRA-21134: Direct I/O for background SSTable writes
Summary
Opt-in
O_DIRECTwrite path for background SSTable producers, bypassing the OS page cache for write-once read-never data. Memtable flushes remain buffered (hot data benefits from the cache).Gated by (1) config, (2) table compression enabled, (3) an
OperationTypeallowlist (DataComponent#DIRECT_WRITE_SUPPORT). Selection is central inDataComponent.buildWriter; producers are unchanged.Performance
Benchmark results are attached to the JIRA. Significant p99 read latency improvements under throttled compaction.
Operations covered (DIO eligible)
WRITECQLSSTableWriterDaemonTest(parameterised on disk mode)COMPACTIONCompactionsTest(parameterised on disk mode)MAJOR_COMPACTIONCompactionsTest.testCompactionWithSizeLimitedRewriterCLEANUP,GARBAGE_COLLECT,TOMBSTONE_COMPACTION,UPGRADE_SSTABLESCompactionsTest(transitive)ANTICOMPACTIONAntiCompactionTest.testAntiCompactionWithCompressedTableAndDirectWritesSTREAMDataComponentDirectWriteSelectionTest(selection only — incoming stream buffers reach the writer like any other producer)The allowlist is exhaustive: any new
OperationTypewithwritesData == truethat is not classified fails static initialization (AssertionError).Operations NOT covered
FLUSH(memtable)UNSUPPORTED_POLICYDataComponentDirectWriteSelectionTestSCRUBUNSUPPORTED_CORRECTNESStryAppendneedsmark()/resetAndTruncate(), which DIO cannot satisfy.DataComponentDirectWriteSelectionTestDataComponent.buildWriter.CompressedSequentialWriterhas a DIO subclass.DataComponentDirectWriteSelectionTest(compression gate)Removing an
UNSUPPORTED_CORRECTNESSentry requires code changes;UNSUPPORTED_POLICYis a policy decision.Key code
io/DirectIoSupport.java— eligibility enum (SUPPORTED/UNSUPPORTED_CORRECTNESS/UNSUPPORTED_POLICY/NOT_APPLICABLE).io/sstable/format/DataComponent.java— selection, allowlist, exhaustiveness check.io/compress/DirectCompressedSequentialWriter.java— new writer; aligned buffers,mark()/resetAndTruncate()unsupported; durable-offset tracking (chunk-boundary ringbuffer + post-flush listener) so preemptive early open binds only to whole-block durable
offsets.
io/compress/CompressedSequentialWriter.java— refactored so the DIO subclass canoverride the write-chunk path;
writeChunkcontract documented and asserted.config/Config.java,config/DatabaseDescriptor.java— new knobs, validation, startupwiring; buffer size aligned to FS block size, auto-grown to chunk length.
service/StartupChecks.java— fails fast ifdirectis requested on a platform/FSthat does not support
O_DIRECT.Tests introduced
vs. the buffered writer over compressors × chunk lengths × random payload sizes;
seed-logged for repro (
DirectCompressedSequentialWriterTest).reports exactly the on-disk whole-block offsets (including under preemptive early open);
fault-injecting
FileChannelverifies write/truncate/position failures surface asFSWriteError/FSReadErrorwith clean abort (DirectCompressedSequentialWriterTest).flushCompleteBlocks(DirectCompressedSequentialWriterTest).OperationTypeeligibility, allowlist exhaustiveness,compression gate, config-mode gate (
DataComponentDirectWriteSelectionTest).WRITE(CQLSSTableWriterDaemonTest) andthe compaction family +
ANTICOMPACTION(extendedCompactionsTest,AntiCompactionTest).block-size rejection, once-per-JVM undersized-buffer warn, SCRUB-gating canaries
(
DirectCompressedSequentialWriterTest).BufferPoolMXBeancheck that the off-heap alignedbuffer is returned on close (
DirectCompressedSequentialWriterTest).in
DatabaseDescriptorTest.Not in scope
Reviewer notes
Findings from the Cassandra bug-hunting skills (Opus 4.7 xhigh & kimi-k2.6:cloud) were addressed prior to
review.