fix: short-circuit on empty micropartitions#6956
Conversation
Rust Dependency DiffHead: ✅ OK: Within budget.
|
Greptile SummaryThis PR short-circuits
Confidence Score: 3/5The write() short-circuit is correct, but close() is not updated to match, leaving two crash paths and a phantom-file path in the returned metadata. When a writer receives only empty partitions, IcebergWriter.close() and DeltalakeWriter.close() will raise IndexError on metadata_collector[0] because no writer was ever opened. ParquetFileWriter and CSVFileWriter avoid the crash but return a path to a file that does not exist on disk, which any downstream read would fail on. Tests do not cover Iceberg or Delta Lake writers in the empty case, so these paths are unvalidated. daft/io/writer.py — specifically IcebergWriter.close() and DeltalakeWriter.close(), which are unguarded when metadata_collector is empty. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["write(table)"] --> B{"len(table) == 0?"}
B -- Yes --> C["return 0 - short-circuit NEW"]
B -- No --> D{"current_writer is None?"}
D -- Yes --> E["_create_writer()"]
D -- No --> F["write_table()"]
E --> F
F --> G["return bytes_written"]
H["close()"] --> I{"current_writer is None?"}
I -- "Yes - all writes were empty" --> J["Parquet/CSV: return RecordBatch with phantom path"]
I -- "Yes - all writes were empty" --> K["Iceberg/Deltalake: metadata_collector index 0 - IndexError"]
I -- No --> L["writer.close() - return RecordBatch with real path"]
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6956 +/- ##
==========================================
+ Coverage 73.92% 75.61% +1.69%
==========================================
Files 1136 1141 +5
Lines 161625 161728 +103
==========================================
+ Hits 119475 122294 +2819
+ Misses 42150 39434 -2716
🚀 New features to boost your workflow:
|
colin-ho
left a comment
There was a problem hiding this comment.
the greptile comment:
IcebergWriter.close() and DeltalakeWriter.close() access self.metadata_collector[0] without guarding against an empty list, which will raise IndexError when the writer was fed only empty partitions and current_writer was therefore never initialised.
Can we add a defensive check on self.metadata_collector[0] accesses in the iceberg / delta writers?
Changes Made
Calling .to_arrow() on an empty micropartition triggers a ValueError because of empty row groups. This short-circuits the write for empty micropartitions.
Related Issues
ValueError: Row group size cannot be 0in parquet writer #6927