Skip to content

CASSANDRA-21134: Direct I/O for background SSTable writes#4815

Closed
samueldlightfoot wants to merge 1 commit into
apache:cassandra-6.0from
samueldlightfoot:CASSANDRA-21134-direct-compaction-writes
Closed

CASSANDRA-21134: Direct I/O for background SSTable writes#4815
samueldlightfoot wants to merge 1 commit into
apache:cassandra-6.0from
samueldlightfoot:CASSANDRA-21134-direct-compaction-writes

Conversation

@samueldlightfoot

@samueldlightfoot samueldlightfoot commented May 17, 2026

Copy link
Copy Markdown
Contributor

CASSANDRA-21134: Direct I/O for background SSTable writes

Summary

Opt-in O_DIRECT write path for background SSTable producers, bypassing the OS page cache for write-once read-never data. Memtable flushes remain buffered (hot data benefits from the cache).

background_write_disk_access_mode: direct    # default: standard
direct_write_buffer_size: 1MiB                # aligned up to FS block size; auto-grows to chunk_length

Gated by (1) config, (2) table compression enabled, (3) an OperationType allowlist (DataComponent#DIRECT_WRITE_SUPPORT). Selection is central in DataComponent.buildWriter; producers are unchanged.

Performance

Benchmark results are attached to the JIRA. Significant p99 read latency improvements under throttled compaction.

Operations covered (DIO eligible)

OperationType End-to-end test
WRITE CQLSSTableWriterDaemonTest (parameterised on disk mode)
COMPACTION CompactionsTest (parameterised on disk mode)
MAJOR_COMPACTION CompactionsTest.testCompactionWithSizeLimitedRewriter
CLEANUP, GARBAGE_COLLECT, TOMBSTONE_COMPACTION, UPGRADE_SSTABLES CompactionsTest (transitive)
ANTICOMPACTION AntiCompactionTest.testAntiCompactionWithCompressedTableAndDirectWrites
STREAM DataComponentDirectWriteSelectionTest (selection only — incoming stream buffers reach the writer like any other producer)

The allowlist is exhaustive: any new OperationType with writesData == true that is not classified fails static initialization (AssertionError).

Operations NOT covered

Path Classification Reason Coverage
FLUSH (memtable) UNSUPPORTED_POLICY Just-flushed data is hot — keep it in the page cache. DataComponentDirectWriteSelectionTest
SCRUB UNSUPPORTED_CORRECTNESS tryAppend needs mark() / resetAndTruncate(), which DIO cannot satisfy. DataComponentDirectWriteSelectionTest
Zero-Copy Streaming n/a (path bypass) Entire-SSTable streaming bypasses DataComponent.buildWriter. n/a (never reaches the selection gate)
Uncompressed writers n/a (path bypass) Only CompressedSequentialWriter has a DIO subclass. DataComponentDirectWriteSelectionTest (compression gate)

Removing an UNSUPPORTED_CORRECTNESS entry requires code changes; UNSUPPORTED_POLICY is a policy decision.

Key code

  • io/DirectIoSupport.java — eligibility enum (SUPPORTED / UNSUPPORTED_CORRECTNESS /
    UNSUPPORTED_POLICY / NOT_APPLICABLE).
  • io/sstable/format/DataComponent.java — selection, allowlist, exhaustiveness check.
  • io/compress/DirectCompressedSequentialWriter.java — new writer; aligned buffers,
    mark()/resetAndTruncate() unsupported; durable-offset tracking (chunk-boundary ring
    buffer + post-flush listener) so preemptive early open binds only to whole-block durable
    offsets.
  • io/compress/CompressedSequentialWriter.java — refactored so the DIO subclass can
    override the write-chunk path; writeChunk contract documented and asserted.
  • config/Config.java, config/DatabaseDescriptor.java — new knobs, validation, startup
    wiring; buffer size aligned to FS block size, auto-grown to chunk length.
  • service/StartupChecks.java — fails fast if direct is requested on a platform/FS
    that does not support O_DIRECT.

Tests introduced

  • Property-based DIO-writer sweep — write/read integrity and on-disk byte-identity
    vs. the buffered writer over compressors × chunk lengths × random payload sizes;
    seed-logged for repro (DirectCompressedSequentialWriterTest).
  • Durable-offset + fault-injection tests — property-test that the post-flush listener
    reports exactly the on-disk whole-block offsets (including under preemptive early open);
    fault-injecting FileChannel verifies write/truncate/position failures surface as
    FSWriteError/FSReadError with clean abort (DirectCompressedSequentialWriterTest).
  • Parameterised buffer-size tests — three regimes pinning the distinct branches of
    flushCompleteBlocks (DirectCompressedSequentialWriterTest).
  • Selection-matrix tests — per-OperationType eligibility, allowlist exhaustiveness,
    compression gate, config-mode gate (DataComponentDirectWriteSelectionTest).
  • End-to-end coverage per allowlist armWRITE (CQLSSTableWriterDaemonTest) and
    the compaction family + ANTICOMPACTION (extended CompactionsTest,
    AntiCompactionTest).
  • Regression guards — constructor channel-leak protection, non-power-of-two
    block-size rejection, once-per-JVM undersized-buffer warn, SCRUB-gating canaries
    (DirectCompressedSequentialWriterTest).
  • Resource-leak detectionBufferPoolMXBean check that the off-heap aligned
    buffer is returned on close (DirectCompressedSequentialWriterTest).
  • Config validation — new YAML knobs (mode parsing, buffer-size bounds, defaults)
    in DatabaseDescriptorTest.

Not in scope

  • Uncompressed SSTable writers.
  • ZCS streaming.

Reviewer notes

Findings from the Cassandra bug-hunting skills (Opus 4.7 xhigh & kimi-k2.6:cloud) were addressed prior to
review.

@samueldlightfoot samueldlightfoot force-pushed the CASSANDRA-21134-direct-compaction-writes branch from 3349322 to 005f4e1 Compare May 17, 2026 09:25
@samueldlightfoot samueldlightfoot marked this pull request as ready for review May 17, 2026 09:25
@samueldlightfoot samueldlightfoot force-pushed the CASSANDRA-21134-direct-compaction-writes branch 5 times, most recently from a49806c to ca8ef09 Compare May 19, 2026 17:23
@aweisberg aweisberg self-requested a review May 28, 2026 20:46

@aweisberg aweisberg left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just nits but at least one or two things worth doing and then I shared a pastebin with you of an analysis of the DirectCompressedSequentialWriterTest code and boundary coverage that is worth considering.

Comment thread conf/cassandra.yaml Outdated
Comment thread src/java/org/apache/cassandra/config/DatabaseDescriptor.java Outdated
Comment thread src/java/org/apache/cassandra/io/compress/CompressedSequentialWriter.java Outdated
Comment thread src/java/org/apache/cassandra/io/sstable/format/DataComponent.java Outdated
Comment thread src/java/org/apache/cassandra/io/compress/DirectCompressedSequentialWriter.java Outdated
Comment thread test/unit/org/apache/cassandra/config/DatabaseDescriptorTest.java
Comment thread test/unit/org/apache/cassandra/io/sstable/CQLSSTableWriterTest.java Outdated
Comment thread test/unit/org/apache/cassandra/io/sstable/CQLSSTableWriterTest.java Outdated
@samueldlightfoot samueldlightfoot force-pushed the CASSANDRA-21134-direct-compaction-writes branch 7 times, most recently from 8af930a to 9ab9c8d Compare June 7, 2026 15:21
@samueldlightfoot samueldlightfoot force-pushed the CASSANDRA-21134-direct-compaction-writes branch from a5c418e to aee37bc Compare June 9, 2026 20:57

@aweisberg aweisberg left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor nits left. TY!

Comment thread src/java/org/apache/cassandra/config/DatabaseDescriptor.java Outdated
Comment thread test/unit/org/apache/cassandra/config/DatabaseDescriptorTest.java Outdated
Comment thread conf/cassandra.yaml
Comment thread conf/cassandra.yaml
Comment thread src/java/org/apache/cassandra/io/compress/DirectCompressedSequentialWriter.java Outdated
throw new IllegalStateException("Filesystem block size must be a power of two for Direct IO. " +
"Block size: " + blockSize);

int configuredSize = DatabaseDescriptor.getDirectWriteBufferSize().toBytes();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to add a validation for the max value of this configuration parameter, too large values can cause int overflow when a buffer size is calculated

Comment thread src/java/org/apache/cassandra/io/sstable/format/DataComponent.java
Adds an opt-in O_DIRECT write path for background SSTable producers,
bypassing the OS page cache for data that is unlikely to be re-read
soon after being written. Memtable flushes remain buffered.

Enabled via two new YAML knobs:
 - background_write_disk_access_mode: standard (default) | direct
 - direct_write_buffer_size: 1MiB (default; aligned up to FS block
   size, auto-grown to fit a worst-case compressed chunk)

The path is gated by config, table compression being enabled, and an
OperationType allowlist in DataComponent. The allowlist is exhaustive
over OperationType: any new value left unclassified fails static
initialization.

Operations on the DIO path: COMPACTION, MAJOR_COMPACTION,
TOMBSTONE_COMPACTION, ANTICOMPACTION, GARBAGE_COLLECT, CLEANUP,
UPGRADE_SSTABLES, WRITE, STREAM (chunked receiver only), RELOCATE,
UNKNOWN (offline sstablesplit).

Operations off the DIO path:
 - FLUSH (policy: just-flushed data is hot, keep in page cache)
 - SCRUB (correctness: tryAppend needs mark/resetAndTruncate)
 - Zero-Copy Streaming (bypasses DataComponent.buildWriter)
 - Uncompressed writers (only CompressedSequentialWriter has a DIO
   subclass in this change)

StartupChecks fails fast if 'direct' is requested on a platform/FS
that does not support O_DIRECT.

patch by Sam Lightfoot; reviewed by Ariel Weisberg, Dmitry Konstantinov for CASSANDRA-21134
@samueldlightfoot samueldlightfoot force-pushed the CASSANDRA-21134-direct-compaction-writes branch from 5437990 to c0beaa7 Compare June 24, 2026 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants