Skip to content

Backport file size fix#35

Merged
krinart merged 2 commits intospiceai-51from
viktor/backport-file-size-fix
Mar 31, 2026
Merged

Backport file size fix#35
krinart merged 2 commits intospiceai-51from
viktor/backport-file-size-fix

Conversation

@krinart
Copy link
Copy Markdown

@krinart krinart commented Mar 31, 2026

Cherry-pick of #33 from spiceai-52.

  • Replaces the DataFusion FileSink demuxer-based write path with a custom write_all implementation that splits output files based on estimated compressed size rather than uncompressed Arrow memory.

  • Adds compression ratio tracking across file rolls so that compressible data produces correctly-sized files instead of many tiny ones. Non-partitioned writes are coalesced into a single input stream to ensure one coordinated write per statement

Note: tests (miri) CI check is failing in target spiceai-51 branch: #36

peasee and others added 2 commits March 31, 2026 09:59
* fix: Ensure target file sizes are respected

* fix: Improve robustness for streams split over multiple cores

* wip

* wip

* wip

* wip

* refactor: Improve robustness

* refactor: Improve concurrent writer robustness for memory bounds

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/persistent/format.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vortex-datafusion/src/tests/nested_projection.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* review: Address comments

* Update vortex-datafusion/src/persistent/sink.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* review: Address comments

* Revert "Update vortex-datafusion/src/persistent/sink.rs"

This reverts commit 6c3be0f.

* refactor: Simplify

* test: Add file writer finish notify test

* review: Address comments

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@krinart krinart marked this pull request as ready for review March 31, 2026 19:58
Copilot AI review requested due to automatic review settings March 31, 2026 19:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR backports a Vortex/DataFusion sink fix to split output files based on estimated compressed size (instead of Arrow in-memory size) and to coordinate non-partitioned writes through a single sink stream for consistent output file sizing/naming.

Changes:

  • Replaced the previous DataFusion FileSink-based writer path with a custom rolling writer that estimates compressed size and rotates files accordingly.
  • Added target_file_size_mb as a configurable Vortex format option and coalesced non-partitioned write inputs to a single stream.
  • Added extensive end-to-end tests for file splitting behavior plus shared test utilities and the uuid dependency for write IDs.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vortex-datafusion/src/persistent/sink.rs Implements rolling file writer, compressed-size estimation, cleanup on failure, and adds new splitting/coalescing tests.
vortex-datafusion/src/persistent/format.rs Adds target_file_size_mb option and coalesces partitions for non-partitioned writes; plumbs target size into VortexSink.
vortex-datafusion/src/lib.rs Introduces common_tests::TestSessionContext helper for integration-style sink tests.
vortex-datafusion/Cargo.toml Adds uuid dependency (v7 feature) for per-write IDs.
Cargo.lock Locks uuid addition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vortex-datafusion/src/lib.rs
Comment thread vortex-datafusion/src/persistent/sink.rs
Comment thread vortex-datafusion/src/persistent/sink.rs
@krinart krinart merged commit 14aeae4 into spiceai-51 Mar 31, 2026
15 of 46 checks passed
@krinart krinart deleted the viktor/backport-file-size-fix branch March 31, 2026 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants