Skip to content

fix: short-circuit on empty micropartitions#6956

Merged
rchowell merged 2 commits into
mainfrom
rchowell/guard-write
May 19, 2026
Merged

fix: short-circuit on empty micropartitions#6956
rchowell merged 2 commits into
mainfrom
rchowell/guard-write

Conversation

@rchowell
Copy link
Copy Markdown
Contributor

Changes Made

Calling .to_arrow() on an empty micropartition triggers a ValueError because of empty row groups. This short-circuits the write for empty micropartitions.

Related Issues

@github-actions github-actions Bot added the fix label May 18, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

Rust Dependency Diff

Head: ea829e8aabb46a598526176a93a6a7da797c7e19 vs Base: 44df2d4685fcf73057f35aef90417b6b501a57f0.

OK: Within budget.

  • New Crates: 0
  • Removed Crates: 0

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 18, 2026

Greptile Summary

This PR short-circuits write() in all four file writers (ParquetFileWriter, CSVFileWriter, IcebergWriter, DeltalakeWriter) to return 0 immediately when an empty MicroPartition is passed, avoiding a ValueError triggered by calling .to_arrow() on an empty partition.

  • All four write() methods gain an early if len(table) == 0: return 0 guard, with unit tests added for the Parquet and CSV writers.
  • IcebergWriter.close() and DeltalakeWriter.close() access self.metadata_collector[0] without guarding against an empty list, which will raise IndexError when the writer was fed only empty partitions and current_writer was therefore never initialised.
  • ParquetFileWriter.close() and CSVFileWriter.close() return a RecordBatch containing a path to a file that was never written to disk in the all-empty case, which downstream readers would find missing.

Confidence Score: 3/5

The write() short-circuit is correct, but close() is not updated to match, leaving two crash paths and a phantom-file path in the returned metadata.

When a writer receives only empty partitions, IcebergWriter.close() and DeltalakeWriter.close() will raise IndexError on metadata_collector[0] because no writer was ever opened. ParquetFileWriter and CSVFileWriter avoid the crash but return a path to a file that does not exist on disk, which any downstream read would fail on. Tests do not cover Iceberg or Delta Lake writers in the empty case, so these paths are unvalidated.

daft/io/writer.py — specifically IcebergWriter.close() and DeltalakeWriter.close(), which are unguarded when metadata_collector is empty.

Important Files Changed

Filename Overview
daft/io/writer.py Adds empty-table short-circuit to all four writer write() methods, but IcebergWriter.close() and DeltalakeWriter.close() will IndexError on metadata_collector[0] if no actual data was ever written, and the Parquet/CSV writers return a path to a file that was never created.
tests/io/test_parquet_roundtrip.py Adds a unit test for ParquetFileWriter with an empty micropartition; verifies 0 bytes written and that close() does not raise.
tests/io/test_csv.py Adds a unit test for CSVFileWriter with an empty micropartition; verifies 0 bytes written and that close() does not raise. IcebergWriter and DeltalakeWriter are not covered.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["write(table)"] --> B{"len(table) == 0?"}
    B -- Yes --> C["return 0 - short-circuit NEW"]
    B -- No --> D{"current_writer is None?"}
    D -- Yes --> E["_create_writer()"]
    D -- No --> F["write_table()"]
    E --> F
    F --> G["return bytes_written"]
    H["close()"] --> I{"current_writer is None?"}
    I -- "Yes - all writes were empty" --> J["Parquet/CSV: return RecordBatch with phantom path"]
    I -- "Yes - all writes were empty" --> K["Iceberg/Deltalake: metadata_collector index 0 - IndexError"]
    I -- No --> L["writer.close() - return RecordBatch with real path"]
Loading

Comments Outside Diff (2)

  1. daft/io/writer.py, line 321-339 (link)

    P1 IndexError in IcebergWriter.close() after empty-only writes

    If every call to write() short-circuits on an empty table (introduced by this PR), self.current_writer is never created and self.metadata_collector stays as []. Then close() crashes on self.metadata_collector[0] with IndexError: list index out of range. The same pattern exists in DeltalakeWriter.close() (line 393). The empty-micropartition fix is incomplete for Iceberg and Delta Lake writers without also handling the "no data was ever written" case in close().

  2. daft/io/writer.py, line 169-178 (link)

    P1 ParquetFileWriter.close() returns a phantom path after all-empty writes

    When every write() call short-circuits on an empty table, no file is ever created at self.full_path, yet close() still returns a RecordBatch with that path. Any downstream consumer (e.g., daft.read_parquet()) that trusts the returned path will encounter a missing file. CSVFileWriter.close() has the same behaviour. Callers likely need a None / empty record or a flag indicating that no file was materialised.

Reviews (1): Last reviewed commit: "fix: short-circuit on empty micropartiti..." | Re-trigger Greptile

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 87.50000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.61%. Comparing base (44df2d4) to head (f121b2d).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
daft/io/writer.py 87.50% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6956      +/-   ##
==========================================
+ Coverage   73.92%   75.61%   +1.69%     
==========================================
  Files        1136     1141       +5     
  Lines      161625   161728     +103     
==========================================
+ Hits       119475   122294    +2819     
+ Misses      42150    39434    -2716     
Files with missing lines Coverage Δ
daft/io/writer.py 73.01% <87.50%> (+16.44%) ⬆️

... and 111 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Collaborator

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the greptile comment:
IcebergWriter.close() and DeltalakeWriter.close() access self.metadata_collector[0] without guarding against an empty list, which will raise IndexError when the writer was fed only empty partitions and current_writer was therefore never initialised.

Can we add a defensive check on self.metadata_collector[0] accesses in the iceberg / delta writers?

@rchowell rchowell merged commit 3e3deba into main May 19, 2026
38 checks passed
@rchowell rchowell deleted the rchowell/guard-write branch May 19, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Empty MicroPartitions from distributed sort trip ValueError: Row group size cannot be 0 in parquet writer

2 participants