Skip to content

_task_to_record_batches stale PyArrow workaround causes SIGSEGV on Apple Silicon #3272

@cpixton9

Description

@cpixton9

Apache Iceberg version

0.11.0 (latest release)

Please describe the bug 🐞

Description

pyiceberg/io/pyarrow.py contains the following workaround inside _task_to_record_batches:

# Temporary fix until PyArrow 21 is released ( https://github.com/apache/arrow/pull/46057 )
table = pa.Table.from_batches([current_batch])
table = table.filter(pyarrow_filter)
if table.num_rows == 0:
    continue
current_batch = table.combine_chunks().to_batches()[0]

PyArrow 21.0.0 was released on July 17, 2025. The condition described in the comment has been met, but the workaround was not removed. It is present in all released versions through the current latest (0.11.1).

Impact

On Apple Silicon Macs (M-series), all expression-based PyArrow filtering routes through pyarrow.acero._filter_table, which may crash with a fatal SIGSEGV. This affects both:

  • Native linux/arm64 containers (untested by us, but suspected based on the ARM64 memory crash in issue Pyiceberg leaks memory on table write #1809)
  • linux/amd64 containers running under Rosetta 2 on Apple Silicon — confirmed

The crash is not limited to the stale workaround line. During debugging we found that pyarrow.acero._filter_table is also invoked by:

  • ds.Scanner.from_fragment(filter=pyarrow_filter, ...).to_batches() — the scanner applies the filter lazily through Acero during batch iteration
  • RecordBatch.filter(pyarrow_filter) — converts the batch to a Table internally and then calls through Acero

This means simply removing the stale Table.from_batches().filter() workaround is necessary but not sufficient: any code path that evaluates a pyarrow.compute.Expression via Acero may crash on this platform. The upsert() method is particularly affected because it calls expression-based filtering in multiple places beyond _task_to_record_batches.

Expression complexity correlates with crash probability. The match predicate passed to _task_to_record_batches is a disjunction built from the incoming DataFrame — one AND clause per unique join-key combination in df. Upserting a large batch (many key combinations) produces a large expression tree, and we observed that breaking the upsert into smaller per-metric calls appeared to delay the crash. This is consistent with Acero's memory allocation or execution plan complexity being the trigger: a more complex expression makes the crash more likely, but does not eliminate it.

Steps to reproduce

  1. Run on Apple Silicon Mac with a linux/amd64 Docker container (Rosetta 2) and PyArrow >= 21
  2. Call table.upsert(df, join_cols=[...]) on a table with existing data and a multi-column join key. We observed the error with four columns, a total of ~10,000 combinations of join key values.
  3. Observe SIGSEGV in pyarrow/acero.py::_filter_table

Expected behavior

No crash. The filter is already applied by the fragment scanner (when no positional deletes are present) and should not be reapplied via a secondary Acero call.

Related issues

PR #1621 implemented the correct fix (current_batch = current_batch.filter(pyarrow_filter)) and was merged February 13, 2025. It was subsequently reverted by PR #1901 due to an unrelated "list index out of range" issue with empty batch handling (issue #1804). Now that PyArrow 21 is available, the original approach from #1621 should be revisited. However, note that RecordBatch.filter(pyarrow_filter) also routes through Acero on the affected platform — the fix may need to evaluate the filter expression to a boolean array first and use RecordBatch.filter(boolean_array) instead, which bypasses Acero.

Workaround

We worked around the issue by replacing table.upsert() with a custom implementation that performs a full table scan with no row_filter (so pyarrow_filter is never constructed) and merges the result using polars before calling table.overwrite(). This avoids all Acero expression evaluation.

Environment

  • PyIceberg: 0.9.1 (confirmed present through 0.11.1)
  • PyArrow: 23.0.1
  • Platform: linux/amd64 Docker container on Apple Silicon M-series Mac via Rosetta 2 (OrbStack)
  • Native linux/arm64 suspected to be affected but not directly tested

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions