Fix memory reservation starvation in sort-merge by xudong963 · Pull Request #20642 · apache/datafusion

xudong963 · 2026-03-02T09:47:57Z

Which issue does this PR close?

Closes #.

Rationale for this change

This PR fixes memory reservation starvation in sort-merge when multiple sort partitions share a GreedyMemoryPool.

When multiple ExternalSorter instances run concurrently and share a single memory pool, the merge phase starves:

Each partition pre-reserves sort_spill_reservation_bytes via merge_reservation
When entering the merge phase, new_empty() was used to create a new reservation starting at 0 bytes, while the pre-reserved bytes sat idle in ExternalSorter.merge_reservation
Those freed bytes were immediately consumed by other partitions racing for memory
The merge could no longer allocate memory from the pool → OOM / starvation

What changes are included in this PR?

Are these changes tested?

~~I can't find a deterministic way to reproduce the bug, but it occurs in our production.~~ Add an end-to-end test to verify the fix

Are there any user-facing changes?

xudong963 · 2026-03-02T10:10:39Z

~~I can't find a deterministic way to reproduce the bug now~~ but it occurs in our production. I'd like to get more eyes for the PR!

Update: I added an end-to-end test which fails on main

rluvaton · 2026-03-03T15:17:38Z

datafusion/physical-plan/src/sorts/sort.rs

+            // Transfer the pre-reserved merge memory to the streaming merge
+            // using `take()` instead of `new_empty()`. This ensures the merge
+            // stream starts with `sort_spill_reservation_bytes` already
+            // allocated, preventing starvation when concurrent sort partitions
+            // compete for pool memory. `take()` moves the bytes atomically
+            // without releasing them back to the pool, so other partitions
+            // cannot race to consume the freed memory.


The pre reserved merge memory should be used as part of the sort merge stream.
I mean that if x pre reserved merge memory was reserved the sort merge stream should know about that so it wont think it starting from 0, otherwise this just reserve for unaccounted memory

Thanks! Good point — just take() alone wouldn't be enough if the merge stream doesn't know about the pre-reserved bytes.

The PR does address this in the other changed files:

In builder.rs: BatchBuilder now tracks batches_mem_used separately and only calls try_grow() when actual usage exceeds the current reservation size. It also records initial_reservation so
it never shrinks below that during build_output. This way the pre-reserved bytes are used as the initial budget rather than requesting from the pool on top of them.

In multi_level_merge.rs: get_sorted_spill_files_to_merge now tracks total_needed and only requests additional pool memory when total_needed > reservation.size(), so spill file buffers
covered by the pre-reserved bytes don't trigger extra pool allocations.

So the merge stream is aware of the pre-reserved bytes and uses them as its starting budget — it doesn't think it's starting from 0.

DataFusion 52.1.0 has a TOCTOU race in ExternalSorter where merge reservations are freed and re-created empty, letting other partitions steal the memory (apache/datafusion#20642). Until the upstream fix lands, compute a data-aware sort_spill_reservation_bytes by sampling actual Arrow row sizes from the input, estimating spill file count, and reserving enough for the merge phase.

* Sample-based sort spill reservation to mitigate merge OOM DataFusion 52.1.0 has a TOCTOU race in ExternalSorter where merge reservations are freed and re-created empty, letting other partitions steal the memory (apache/datafusion#20642). Until the upstream fix lands, compute a data-aware sort_spill_reservation_bytes by sampling actual Arrow row sizes from the input, estimating spill file count, and reserving enough for the merge phase. * Add more tests * formatting * Allow some truncation here * Handle when budgets are tight * Better estimate in-memory size using row count and avg row size * Remove dead code * Lint fix * linting * Fix low memory situations

cetra3 · 2026-03-11T06:24:40Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs


        for spill in &self.sorted_spill_files {
-            // For memory pools that are not shared this is good, for other this is not
-            // and there should be some upper limit to memory reservation so we won't starve the system


I think this comment still applies: if you have multiple partitions running, one partition will still be able to starve the others

yes, I'll keep the comment

cetra3 · 2026-03-11T06:25:29Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs

    ) -> Result<(Vec<SortedSpillFile>, usize)> {
        assert_ne!(buffer_len, 0, "Buffer length must be greater than 0");
        let mut number_of_spills_to_read_for_current_phase = 0;
+        // Track total memory needed for spill file buffers. When the


It feels like this whole method is ripe for a refactor, and introducing a memory floor is making it even more complex. Is there a way to incorporate the memory floor, but also simplify this a little bit

add a try_grow_reservation_to_at_least help to reduce complexity

kosiew

@xudong963

Thanks for working on this.

kosiew · 2026-03-11T10:02:30Z

datafusion/physical-plan/src/sorts/multi_level_merge.rs

+                // concurrent sort partitions compete for pool memory: the pre-reserved
+                // bytes cover spill file buffer reservations without additional pool
+                // allocation.
+                let mut memory_reservation = self.reservation.take();


It looks like merge_sorted_runs_within_mem_limit() is transferring self.reservation into memory_reservation before it actually knows whether any spill files will be merged. If the builder already has enough in-memory streams to satisfy minimum_number_of_required_streams, but the first spill file still cannot fit, then get_sorted_spill_files_to_merge() could legitimately return zero spill files.

In that situation, is_only_merging_memory_streams would become true, but memory_reservation would still contain the bytes taken from self.reservation. That seems like it could trigger the assertion at lines 297–302 even though falling back to an all-in-memory merge is valid.

My understanding is that this creates a behavior regression in the mixed {sorted_streams + sorted_spill_files} path. Should the reservation transfer instead happen only after at least one spill file is selected, or should the unused reservation be returned to the all-in-memory merge path rather than being asserted away? 🤔

kosiew · 2026-03-11T10:22:21Z

datafusion/physical-plan/src/sorts/builder.rs

+        if self.batches_mem_used > self.reservation.size() {
+            self.reservation
+                .try_grow(self.batches_mem_used - self.reservation.size())?;


This “grow only when usage exceeds current reservation” pattern is also checked at get_sorted_spill_files_to_merge in multi_level_merge.rs.
I think extracting this into a helper will make the intended invariant easier to check.

kosiew

lgtm

xudong963 · 2026-03-16T06:35:29Z

I plan to merge the PR tomorrow if there are no more comments

alamb · 2026-03-18T11:32:56Z

I am merging to try and keep the code flowing

alamb · 2026-03-18T11:33:21Z

Let's address any additional comments as follow on PRs

- Closes #.  This PR fixes memory reservation starvation in sort-merge when multiple sort partitions share a GreedyMemoryPool. When multiple `ExternalSorter` instances run concurrently and share a single memory pool, the merge phase starves: 1. Each partition pre-reserves sort_spill_reservation_bytes via merge_reservation 2. When entering the merge phase, new_empty() was used to create a new reservation starting at 0 bytes, while the pre-reserved bytes sat idle in ExternalSorter.merge_reservation 3. Those freed bytes were immediately consumed by other partitions racing for memory 4. The merge could no longer allocate memory from the pool → OOM / starvation   ~~I can't find a deterministic way to reproduce the bug, but it occurs in our production.~~ Add an end-to-end test to verify the fix

## Which issue does this PR close?  - Closes #. ## Rationale for this change  This PR fixes memory reservation starvation in sort-merge when multiple sort partitions share a GreedyMemoryPool. When multiple `ExternalSorter` instances run concurrently and share a single memory pool, the merge phase starves: 1. Each partition pre-reserves sort_spill_reservation_bytes via merge_reservation 2. When entering the merge phase, new_empty() was used to create a new reservation starting at 0 bytes, while the pre-reserved bytes sat idle in ExternalSorter.merge_reservation 3. Those freed bytes were immediately consumed by other partitions racing for memory 4. The merge could no longer allocate memory from the pool → OOM / starvation ## What changes are included in this PR?  ## Are these changes tested?  ~~I can't find a deterministic way to reproduce the bug, but it occurs in our production.~~ Add an end-to-end test to verify the fix ## Are there any user-facing changes?

github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 2, 2026

xudong963 marked this pull request as draft March 2, 2026 09:49

xudong963 marked this pull request as ready for review March 2, 2026 10:09

xudong963 force-pushed the fix/sort-merge-reservation-starvation branch from 0f2140c to 651e9c9 Compare March 2, 2026 21:15

xudong963 requested review from 2010YOUY01, kosiew and rluvaton March 2, 2026 21:52

rluvaton reviewed Mar 3, 2026

View reviewed changes

coracuity mentioned this pull request Mar 4, 2026

Partially address sorting memory issues acuitymd/silk-chiffon#86

Merged

xudong963 added 4 commits March 10, 2026 18:46

try to fix the sort merge oom

017ad87

remove test

76f7d79

add an end-to-end test

f4e3737

update

acd7caa

xudong963 force-pushed the fix/sort-merge-reservation-starvation branch from 01ae92d to acd7caa Compare March 10, 2026 12:08

cetra3 reviewed Mar 11, 2026

View reviewed changes

kosiew reviewed Mar 11, 2026

View reviewed changes

address comments

c478d43

kosiew approved these changes Mar 14, 2026

View reviewed changes

xudong963 requested review from cetra3 and rluvaton March 17, 2026 01:42

alamb added this pull request to the merge queue Mar 18, 2026

Merged via the queue into apache:main with commit a6a4df9 Mar 18, 2026
34 checks passed

xudong963 mentioned this pull request Mar 23, 2026

Cherry-pick sort-merge fixes to branch-52 #21106

Closed

2 tasks

Conversation

xudong963 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

xudong963 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

xudong963 commented Mar 16, 2026

Uh oh!

Uh oh!

alamb commented Mar 18, 2026

Uh oh!

alamb commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xudong963 commented Mar 2, 2026 •

edited

Loading

xudong963 commented Mar 2, 2026 •

edited

Loading