Backport file size fix#35
Merged
krinart merged 2 commits intospiceai-51from Mar 31, 2026
Merged
Conversation
* fix: Ensure target file sizes are respected * fix: Improve robustness for streams split over multiple cores * wip * wip * wip * wip * refactor: Improve robustness * refactor: Improve concurrent writer robustness for memory bounds * Update vortex-datafusion/src/persistent/sink.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update vortex-datafusion/src/persistent/format.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update vortex-datafusion/src/persistent/sink.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update vortex-datafusion/src/persistent/sink.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update vortex-datafusion/src/tests/nested_projection.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * review: Address comments * Update vortex-datafusion/src/persistent/sink.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * review: Address comments * Revert "Update vortex-datafusion/src/persistent/sink.rs" This reverts commit 6c3be0f. * refactor: Simplify * test: Add file writer finish notify test * review: Address comments --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR backports a Vortex/DataFusion sink fix to split output files based on estimated compressed size (instead of Arrow in-memory size) and to coordinate non-partitioned writes through a single sink stream for consistent output file sizing/naming.
Changes:
- Replaced the previous DataFusion
FileSink-based writer path with a custom rolling writer that estimates compressed size and rotates files accordingly. - Added
target_file_size_mbas a configurable Vortex format option and coalesced non-partitioned write inputs to a single stream. - Added extensive end-to-end tests for file splitting behavior plus shared test utilities and the
uuiddependency for write IDs.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vortex-datafusion/src/persistent/sink.rs | Implements rolling file writer, compressed-size estimation, cleanup on failure, and adds new splitting/coalescing tests. |
| vortex-datafusion/src/persistent/format.rs | Adds target_file_size_mb option and coalesces partitions for non-partitioned writes; plumbs target size into VortexSink. |
| vortex-datafusion/src/lib.rs | Introduces common_tests::TestSessionContext helper for integration-style sink tests. |
| vortex-datafusion/Cargo.toml | Adds uuid dependency (v7 feature) for per-write IDs. |
| Cargo.lock | Locks uuid addition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Jeadie
approved these changes
Mar 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-pick of #33 from
spiceai-52.Replaces the
DataFusionFileSinkdemuxer-based write path with a customwrite_allimplementation that splits output files based on estimated compressed size rather than uncompressed Arrow memory.Adds compression ratio tracking across file rolls so that compressible data produces correctly-sized files instead of many tiny ones. Non-partitioned writes are coalesced into a single input stream to ensure one coordinated write per statement
Note:
tests (miri)CI check is failing in targetspiceai-51branch: #36