@@ -81,10 +81,18 @@ Native shuffle (`CometExchange`) is selected when all of the following condition
8181└─────────────────────────────────────────────────────────────────────────────┘
8282 │ │
8383 ▼ ▼
84- ┌───────────────────────────────────┐ ┌───────────────────────────────────┐
85- │ MultiPartitionShuffleRepartitioner │ │ SinglePartitionShufflePartitioner │
86- │ (hash/range partitioning) │ │ (single partition case) │
87- └───────────────────────────────────┘ └───────────────────────────────────┘
84+ ┌───────────────────────────────────────────────────────────────────────┐
85+ │ Partitioner Selection │
86+ │ Controlled by spark.comet.exec.shuffle.partitionerMode │
87+ ├───────────────────────────┬───────────────────────────────────────────┤
88+ │ immediate (default) │ buffered │
89+ │ ImmediateModePartitioner │ MultiPartitionShuffleRepartitioner │
90+ │ (hash/range/round-robin) │ (hash/range/round-robin) │
91+ │ Writes IPC blocks as │ Buffers all rows in memory │
92+ │ batches arrive │ before writing │
93+ ├───────────────────────────┴───────────────────────────────────────────┤
94+ │ SinglePartitionShufflePartitioner (single partition case) │
95+ └───────────────────────────────────────────────────────────────────────┘
8896 │
8997 ▼
9098┌───────────────────────────────────┐
@@ -113,11 +121,13 @@ Native shuffle (`CometExchange`) is selected when all of the following condition
113121
114122### Rust Side
115123
116- | File | Location | Description |
117- | ----------------------- | ------------------------------------ | ------------------------------------------------------------------------------------ |
118- | ` shuffle_writer.rs ` | ` native/core/src/execution/shuffle/ ` | ` ShuffleWriterExec ` plan and partitioners. Main shuffle logic. |
119- | ` codec.rs ` | ` native/core/src/execution/shuffle/ ` | ` ShuffleBlockWriter ` for Arrow IPC encoding with compression. Also handles decoding. |
120- | ` comet_partitioning.rs ` | ` native/core/src/execution/shuffle/ ` | ` CometPartitioning ` enum defining partition schemes (Hash, Range, Single). |
124+ | File | Location | Description |
125+ | ----------------------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
126+ | ` shuffle_writer.rs ` | ` native/shuffle/src/ ` | ` ShuffleWriterExec ` plan. Selects partitioner based on ` immediate_mode ` flag. |
127+ | ` immediate_mode.rs ` | ` native/shuffle/src/partitioners/ ` | ` ImmediateModePartitioner ` . Scatter-writes rows into per-partition Arrow builders and flushes IPC blocks to in-memory buffers eagerly. |
128+ | ` multi_partition.rs ` | ` native/shuffle/src/partitioners/ ` | ` MultiPartitionShuffleRepartitioner ` . Buffers all rows in memory, then writes partitions. |
129+ | ` codec.rs ` | ` native/shuffle/src/ ` | ` ShuffleBlockWriter ` for Arrow IPC encoding with compression. Also handles decoding. |
130+ | ` comet_partitioning.rs ` | ` native/shuffle/src/ ` | ` CometPartitioning ` enum defining partition schemes (Hash, Range, Single). |
121131
122132## Data Flow
123133
@@ -129,23 +139,33 @@ Native shuffle (`CometExchange`) is selected when all of the following condition
129139
1301402 . ** Native execution** : ` CometExec.getCometIterator() ` executes the plan in Rust.
131141
132- 3 . ** Partitioning** : ` ShuffleWriterExec ` receives batches and routes to the appropriate partitioner:
133- - ` MultiPartitionShuffleRepartitioner ` : For hash/range/round-robin partitioning
134- - ` SinglePartitionShufflePartitioner ` : For single partition (simpler path)
142+ 3 . ** Partitioning** : ` ShuffleWriterExec ` receives batches and routes to the appropriate partitioner
143+ based on the ` partitionerMode ` configuration:
144+ - ** Immediate mode** (` ImmediateModePartitioner ` ): For hash/range/round-robin partitioning.
145+ As each batch arrives, rows are scattered into per-partition Arrow array builders. When a
146+ partition's builder reaches the target batch size, it is flushed as a compressed Arrow IPC
147+ block to an in-memory buffer. Under memory pressure, these buffers are spilled to
148+ per-partition temporary files. This keeps memory usage much lower than buffered mode since
149+ data is encoded into compact IPC format eagerly rather than held as raw Arrow arrays.
135150
136- 4 . ** Buffering and spilling** : The partitioner buffers rows per partition. When memory pressure
137- exceeds the threshold, partitions spill to temporary files.
151+ - ** Buffered mode** (` MultiPartitionShuffleRepartitioner ` ): For hash/range/round-robin
152+ partitioning. Buffers all input ` RecordBatch ` es in memory, then partitions and writes
153+ them in a single pass. When memory pressure exceeds the threshold, partitions spill to
154+ temporary files.
138155
139- 5 . ** Encoding** : ` ShuffleBlockWriter ` encodes each partition's data as compressed Arrow IPC:
156+ - ` SinglePartitionShufflePartitioner ` : For single partition (simpler path, used regardless
157+ of partitioner mode).
158+
159+ 4 . ** Encoding** : ` ShuffleBlockWriter ` encodes each partition's data as compressed Arrow IPC:
140160 - Writes compression type header
141161 - Writes field count header
142162 - Writes compressed IPC stream
143163
144- 6 . ** Output files** : Two files are produced:
164+ 5 . ** Output files** : Two files are produced:
145165 - ** Data file** : Concatenated partition data
146166 - ** Index file** : Array of 8-byte little-endian offsets marking partition boundaries
147167
148- 7 . ** Commit** : Back in JVM, ` CometNativeShuffleWriter ` reads the index file to get partition
168+ 6 . ** Commit** : Back in JVM, ` CometNativeShuffleWriter ` reads the index file to get partition
149169 lengths and commits via Spark's ` IndexShuffleBlockResolver ` .
150170
151171### Read Path
@@ -201,10 +221,31 @@ sizes.
201221
202222## Memory Management
203223
204- Native shuffle uses DataFusion's memory management with spilling support:
224+ Native shuffle uses DataFusion's memory management. The memory characteristics differ
225+ between the two partitioner modes:
226+
227+ ### Immediate Mode
228+
229+ Immediate mode keeps memory usage low by partitioning and encoding data eagerly as it arrives,
230+ rather than buffering all input rows before writing:
231+
232+ - ** Per-partition builders** : Each partition has a set of Arrow array builders sized to the
233+ target batch size. When a builder fills up, it is flushed as a compressed IPC block to an
234+ in-memory buffer.
235+ - ** Memory footprint** : Proportional to ` num_partitions × batch_size ` for the builders, plus
236+ the accumulated IPC buffers. This is typically much smaller than buffered mode since IPC
237+ encoding is more compact than raw Arrow arrays.
238+ - ** Spilling** : When memory pressure is detected via DataFusion's ` MemoryConsumer ` trait,
239+ partition builders are flushed and all IPC buffers are drained to per-partition temporary
240+ files on disk.
241+
242+ ### Buffered Mode
243+
244+ Buffered mode holds all input data in memory before writing:
205245
206- - ** Memory pool** : Tracks memory usage across the shuffle operation.
207- - ** Spill threshold** : When buffered data exceeds the threshold, partitions spill to disk.
246+ - ** Buffered batches** : All incoming ` RecordBatch ` es are accumulated in a ` Vec ` .
247+ - ** Spill threshold** : When buffered data exceeds the memory threshold, partitions spill to
248+ temporary files on disk.
208249- ** Per-partition spilling** : Each partition has its own spill file. Multiple spills for a
209250 partition are concatenated when writing the final output.
210251- ** Scratch space** : Reusable buffers for partition ID computation to reduce allocations.
@@ -232,14 +273,15 @@ independently compressed, allowing parallel decompression during reads.
232273
233274## Configuration
234275
235- | Config | Default | Description |
236- | ------------------------------------------------- | ------- | ---------------------------------------- |
237- | ` spark.comet.exec.shuffle.enabled ` | ` true ` | Enable Comet shuffle |
238- | ` spark.comet.exec.shuffle.mode ` | ` auto ` | Shuffle mode: ` native ` , ` jvm ` , or ` auto ` |
239- | ` spark.comet.exec.shuffle.compression.codec ` | ` zstd ` | Compression codec |
240- | ` spark.comet.exec.shuffle.compression.zstd.level ` | ` 1 ` | Zstd compression level |
241- | ` spark.comet.shuffle.write.buffer.size ` | ` 1MB ` | Write buffer size |
242- | ` spark.comet.columnar.shuffle.batch.size ` | ` 8192 ` | Target rows per batch |
276+ | Config | Default | Description |
277+ | ------------------------------------------------- | ----------- | ------------------------------------------- |
278+ | ` spark.comet.exec.shuffle.enabled ` | ` true ` | Enable Comet shuffle |
279+ | ` spark.comet.exec.shuffle.mode ` | ` auto ` | Shuffle mode: ` native ` , ` jvm ` , or ` auto ` |
280+ | ` spark.comet.exec.shuffle.partitionerMode ` | ` immediate ` | Partitioner mode: ` immediate ` or ` buffered ` |
281+ | ` spark.comet.exec.shuffle.compression.codec ` | ` zstd ` | Compression codec |
282+ | ` spark.comet.exec.shuffle.compression.zstd.level ` | ` 1 ` | Zstd compression level |
283+ | ` spark.comet.shuffle.write.buffer.size ` | ` 1MB ` | Write buffer size |
284+ | ` spark.comet.columnar.shuffle.batch.size ` | ` 8192 ` | Target rows per batch |
243285
244286## Comparison with JVM Shuffle
245287
0 commit comments