Skip to content

fix: Ensure target file size is properly respected in vortex sink#33

Merged
peasee merged 20 commits into
spiceai-52from
peasee/260310-partitioned-file-split-target-size
Mar 12, 2026
Merged

fix: Ensure target file size is properly respected in vortex sink#33
peasee merged 20 commits into
spiceai-52from
peasee/260310-partitioned-file-split-target-size

Conversation

@peasee
Copy link
Copy Markdown

@peasee peasee commented Mar 11, 2026

  • Replaces the DataFusion demuxer with our own handling implementation to ensure we track files across streams
  • Ensures target file size is respected in cases with no partitioning columns as well as with partitioning columns
  • Adds a comprehensive test suite to check different data insertion scenarios
  • Ensures all error cases are properly handled and returned to the caller, so that an invalid data state can never be entered

@peasee peasee self-assigned this Mar 11, 2026
@peasee peasee added the bug Something isn't working label Mar 11, 2026
Copilot AI review requested due to automatic review settings March 11, 2026 10:54
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Vortex DataFusion sink write path to better respect target_file_size across both partitioned and non-partitioned outputs, replacing reliance on DataFusion’s demux behavior with sink-managed routing and file rotation. It also adds an expanded test suite covering file splitting and multi-partition write scenarios.

Changes:

  • Implement custom file rotation and partition routing in VortexSink, including a compression-ratio-based estimate to approximate target compressed size.
  • Add a concurrency cap for partition writers and new tests for partitioned writes, file splitting, and write-id consistency.
  • Force a single sink input stream via a repartition step to ensure one coordinated write per statement.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
vortex-datafusion/src/tests/nested_projection.rs Marks nested projection tests as ignored.
vortex-datafusion/src/persistent/sink.rs Replaces the prior sink writing/demux approach with custom per-file/per-partition writers, rotation logic, and extensive tests.
vortex-datafusion/src/persistent/format.rs Adds a repartition-to-1 step before DataSinkExec to ensure a single coordinated sink write.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vortex-datafusion/src/tests/nested_projection.rs Outdated
Comment thread vortex-datafusion/src/persistent/sink.rs Outdated
Comment thread vortex-datafusion/src/persistent/sink.rs Outdated
Comment thread vortex-datafusion/src/persistent/sink.rs Outdated
Comment thread vortex-datafusion/src/persistent/sink.rs Outdated
Comment thread vortex-datafusion/src/persistent/format.rs Outdated
Comment thread vortex-datafusion/src/persistent/sink.rs Outdated
Comment thread vortex-datafusion/src/tests/nested_projection.rs Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 11, 2026 11:03
peasee and others added 2 commits March 11, 2026 21:03
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
peasee and others added 3 commits March 11, 2026 21:03
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vortex-datafusion/src/persistent/sink.rs Outdated
Comment thread vortex-datafusion/src/persistent/sink.rs Outdated
Comment thread vortex-datafusion/src/persistent/format.rs Outdated
Comment thread vortex-datafusion/src/persistent/sink.rs
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 11, 2026 11:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vortex-datafusion/src/persistent/sink.rs Outdated
Comment thread vortex-datafusion/src/persistent/format.rs Outdated
Copilot AI review requested due to automatic review settings March 11, 2026 22:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread vortex-datafusion/src/persistent/sink.rs
Comment thread vortex-datafusion/src/persistent/sink.rs
Comment thread vortex-datafusion/src/persistent/sink.rs Outdated
Comment thread vortex-datafusion/src/persistent/sink.rs Outdated
Copilot AI review requested due to automatic review settings March 12, 2026 04:44
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread vortex-datafusion/src/persistent/sink.rs
@peasee peasee merged commit 8a09bc7 into spiceai-52 Mar 12, 2026
17 of 46 checks passed
@peasee peasee deleted the peasee/260310-partitioned-file-split-target-size branch March 12, 2026 05:05
krinart pushed a commit that referenced this pull request Mar 31, 2026
* fix: Ensure target file sizes are respected

* fix: Improve robustness for streams split over multiple cores

* wip

* wip

* wip

* wip

* refactor: Improve robustness

* refactor: Improve concurrent writer robustness for memory bounds

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/persistent/format.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/tests/nested_projection.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* review: Address comments

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* review: Address comments

* Revert "Update vortex-datafusion/src/persistent/sink.rs"

This reverts commit 6c3be0f.

* refactor: Simplify

* test: Add file writer finish notify test

* review: Address comments

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@krinart krinart mentioned this pull request Mar 31, 2026
krinart added a commit that referenced this pull request Mar 31, 2026
* fix: Ensure target file size is properly respected in vortex sink (#33)

* fix: Ensure target file sizes are respected

* fix: Improve robustness for streams split over multiple cores

* wip

* wip

* wip

* wip

* refactor: Improve robustness

* refactor: Improve concurrent writer robustness for memory bounds

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/persistent/format.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/tests/nested_projection.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* review: Address comments

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* review: Address comments

* Revert "Update vortex-datafusion/src/persistent/sink.rs"

This reverts commit 6c3be0f.

* refactor: Simplify

* test: Add file writer finish notify test

* review: Address comments

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update cargo.lock

---------

Co-authored-by: William <98815791+peasee@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants