You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pyiceberg/io/pyarrow.py contains the following workaround inside _task_to_record_batches:
# Temporary fix until PyArrow 21 is released ( https://github.com/apache/arrow/pull/46057 )table=pa.Table.from_batches([current_batch])
table=table.filter(pyarrow_filter)
iftable.num_rows==0:
continuecurrent_batch=table.combine_chunks().to_batches()[0]
PyArrow 21.0.0 was released on July 17, 2025. The condition described in the comment has been met, but the workaround was not removed. It is present in all released versions through the current latest (0.11.1).
Impact
On Apple Silicon Macs (M-series), all expression-based PyArrow filtering routes through pyarrow.acero._filter_table, which may crash with a fatal SIGSEGV. This affects both:
linux/amd64 containers running under Rosetta 2 on Apple Silicon — confirmed
The crash is not limited to the stale workaround line. During debugging we found that pyarrow.acero._filter_table is also invoked by:
ds.Scanner.from_fragment(filter=pyarrow_filter, ...).to_batches() — the scanner applies the filter lazily through Acero during batch iteration
RecordBatch.filter(pyarrow_filter) — converts the batch to a Table internally and then calls through Acero
This means simply removing the stale Table.from_batches().filter() workaround is necessary but not sufficient: any code path that evaluates a pyarrow.compute.Expression via Acero may crash on this platform. The upsert() method is particularly affected because it calls expression-based filtering in multiple places beyond _task_to_record_batches.
Expression complexity correlates with crash probability. The match predicate passed to _task_to_record_batches is a disjunction built from the incoming DataFrame — one AND clause per unique join-key combination in df. Upserting a large batch (many key combinations) produces a large expression tree, and we observed that breaking the upsert into smaller per-metric calls appeared to delay the crash. This is consistent with Acero's memory allocation or execution plan complexity being the trigger: a more complex expression makes the crash more likely, but does not eliminate it.
Steps to reproduce
Run on Apple Silicon Mac with a linux/amd64 Docker container (Rosetta 2) and PyArrow >= 21
Call table.upsert(df, join_cols=[...]) on a table with existing data and a multi-column join key. We observed the error with four columns, a total of ~10,000 combinations of join key values.
Observe SIGSEGV in pyarrow/acero.py::_filter_table
Expected behavior
No crash. The filter is already applied by the fragment scanner (when no positional deletes are present) and should not be reapplied via a secondary Acero call.
Related issues
PR #1621 implemented the correct fix (current_batch = current_batch.filter(pyarrow_filter)) and was merged February 13, 2025. It was subsequently reverted by PR #1901 due to an unrelated "list index out of range" issue with empty batch handling (issue #1804). Now that PyArrow 21 is available, the original approach from #1621 should be revisited. However, note that RecordBatch.filter(pyarrow_filter) also routes through Acero on the affected platform — the fix may need to evaluate the filter expression to a boolean array first and use RecordBatch.filter(boolean_array) instead, which bypasses Acero.
Workaround
We worked around the issue by replacing table.upsert() with a custom implementation that performs a full table scan with no row_filter (so pyarrow_filter is never constructed) and merges the result using polars before calling table.overwrite(). This avoids all Acero expression evaluation.
Environment
PyIceberg: 0.9.1 (confirmed present through 0.11.1)
PyArrow: 23.0.1
Platform: linux/amd64 Docker container on Apple Silicon M-series Mac via Rosetta 2 (OrbStack)
Native linux/arm64 suspected to be affected but not directly tested
Willingness to contribute
I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time
Apache Iceberg version
0.11.0 (latest release)
Please describe the bug 🐞
Description
pyiceberg/io/pyarrow.pycontains the following workaround inside_task_to_record_batches:PyArrow 21.0.0 was released on July 17, 2025. The condition described in the comment has been met, but the workaround was not removed. It is present in all released versions through the current latest (0.11.1).
Impact
On Apple Silicon Macs (M-series), all expression-based PyArrow filtering routes through
pyarrow.acero._filter_table, which may crash with a fatal SIGSEGV. This affects both:linux/arm64containers (untested by us, but suspected based on the ARM64 memory crash in issue Pyiceberg leaks memory on table write #1809)linux/amd64containers running under Rosetta 2 on Apple Silicon — confirmedThe crash is not limited to the stale workaround line. During debugging we found that
pyarrow.acero._filter_tableis also invoked by:ds.Scanner.from_fragment(filter=pyarrow_filter, ...).to_batches()— the scanner applies the filter lazily through Acero during batch iterationRecordBatch.filter(pyarrow_filter)— converts the batch to a Table internally and then calls through AceroThis means simply removing the stale
Table.from_batches().filter()workaround is necessary but not sufficient: any code path that evaluates apyarrow.compute.Expressionvia Acero may crash on this platform. Theupsert()method is particularly affected because it calls expression-based filtering in multiple places beyond_task_to_record_batches.Expression complexity correlates with crash probability. The match predicate passed to
_task_to_record_batchesis a disjunction built from the incoming DataFrame — oneANDclause per unique join-key combination indf. Upserting a large batch (many key combinations) produces a large expression tree, and we observed that breaking the upsert into smaller per-metric calls appeared to delay the crash. This is consistent with Acero's memory allocation or execution plan complexity being the trigger: a more complex expression makes the crash more likely, but does not eliminate it.Steps to reproduce
linux/amd64Docker container (Rosetta 2) and PyArrow >= 21table.upsert(df, join_cols=[...])on a table with existing data and a multi-column join key. We observed the error with four columns, a total of ~10,000 combinations of join key values.pyarrow/acero.py::_filter_tableExpected behavior
No crash. The filter is already applied by the fragment scanner (when no positional deletes are present) and should not be reapplied via a secondary Acero call.
Related issues
PR #1621 implemented the correct fix (
current_batch = current_batch.filter(pyarrow_filter)) and was merged February 13, 2025. It was subsequently reverted by PR #1901 due to an unrelated "list index out of range" issue with empty batch handling (issue #1804). Now that PyArrow 21 is available, the original approach from #1621 should be revisited. However, note thatRecordBatch.filter(pyarrow_filter)also routes through Acero on the affected platform — the fix may need to evaluate the filter expression to a boolean array first and useRecordBatch.filter(boolean_array)instead, which bypasses Acero.Workaround
We worked around the issue by replacing
table.upsert()with a custom implementation that performs a full table scan with norow_filter(sopyarrow_filteris never constructed) and merges the result using polars before callingtable.overwrite(). This avoids all Acero expression evaluation.Environment
linux/amd64Docker container on Apple Silicon M-series Mac via Rosetta 2 (OrbStack)linux/arm64suspected to be affected but not directly testedWillingness to contribute