Skip to content

Improve iterator streaming and reduce eager materialization in scan/write paths #35

@kevinjqliu

Description

@kevinjqliu

Summary

Several hot paths eagerly materialize iterator outputs using list(...) (or equivalent full consumption patterns), increasing peak memory and reducing streaming behavior. This issue proposes targeted refactors to preserve iterator semantics where possible and reduce intermediate allocations.

All links below are immutable permalinks pinned to commit:
7425bc4657cd0e1f8a3003cc2f6493300e0b1d60


Problem areas and proposed improvements

1) ArrowScan task-level batch materialization

2) Manifest list read materialization

3) Append path: _dataframe_to_data_files eagerly materialized

4) Dynamic partition overwrite: _dataframe_to_data_files eager list

5) Delete rewrite path: _dataframe_to_data_files eager list


Equivalent eager-consumption improvements identified

6) Ancestor membership check materialization

7) latest_ancestor_before_timestamp full scan when early exit is possible

8) _validation_history temporary list allocation in extend([...])

9) chain.from_iterable(scan.scan_plan_helper()) readability and streaming clarity

10) InspectTable.history minor optimization


Expected impact

  • Lower peak memory in scan/write-heavy workloads.
  • Better streaming behavior and earlier first-result availability in key paths.
  • Reduced temporary allocations in history/validation helpers.

Suggested implementation plan

  1. Low-risk changes first:
    • Append path streaming in Transaction.append.
    • manifest.py direct iteration without temporary list.
    • validate.py generator expression.
    • snapshots.py early return for latest_ancestor_before_timestamp.
  2. Medium-risk:
    • Delete rewrite path incremental appends.
  3. Higher-risk/high-impact:
    • Queue-based threaded streaming for ArrowScan.to_record_batches replacing per-task list(...) materialization.

Acceptance criteria

  • No regression in existing tests.
  • Memory benchmarks show reduced peak RSS for large scans/writes.
  • to_record_batches preserves row ordering and limit behavior.
  • Snapshot rollback/validation semantics unchanged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions