`_task_to_record_batches` stale PyArrow workaround causes SIGSEGV on Apple Silicon

### Apache Iceberg version

0.11.0 (latest release)

### Please describe the bug 🐞


## Description

`pyiceberg/io/pyarrow.py` contains the following workaround inside `_task_to_record_batches`:

```python
# Temporary fix until PyArrow 21 is released ( https://github.com/apache/arrow/pull/46057 )
table = pa.Table.from_batches([current_batch])
table = table.filter(pyarrow_filter)
if table.num_rows == 0:
    continue
current_batch = table.combine_chunks().to_batches()[0]
```

PyArrow 21.0.0 was released on July 17, 2025. The condition described in the comment has been met, but the workaround was not removed. It is present in all released versions through the current latest (0.11.1).

## Impact

On Apple Silicon Macs (M-series), all expression-based PyArrow filtering routes through `pyarrow.acero._filter_table`, which may crash with a fatal SIGSEGV. This affects both:

- Native `linux/arm64` containers (untested by us, but suspected based on the ARM64 memory crash in issue #1809)
- `linux/amd64` containers running under Rosetta 2 on Apple Silicon — confirmed

The crash is not limited to the stale workaround line. During debugging we found that `pyarrow.acero._filter_table` is also invoked by:

- `ds.Scanner.from_fragment(filter=pyarrow_filter, ...).to_batches()` — the scanner applies the filter lazily through Acero during batch iteration
- `RecordBatch.filter(pyarrow_filter)` — converts the batch to a Table internally and then calls through Acero

This means simply removing the stale `Table.from_batches().filter()` workaround is necessary but not sufficient: any code path that evaluates a `pyarrow.compute.Expression` via Acero may crash on this platform. The `upsert()` method is particularly affected because it calls expression-based filtering in multiple places beyond `_task_to_record_batches`.

**Expression complexity correlates with crash probability.** The match predicate passed to `_task_to_record_batches` is a disjunction built from the incoming DataFrame — one `AND` clause per unique join-key combination in `df`. Upserting a large batch (many key combinations) produces a large expression tree, and we observed that breaking the upsert into smaller per-metric calls appeared to delay the crash. This is consistent with Acero's memory allocation or execution plan complexity being the trigger: a more complex expression makes the crash more likely, but does not eliminate it.

## Steps to reproduce

1. Run on Apple Silicon Mac with a `linux/amd64` Docker container (Rosetta 2) and PyArrow >= 21
2. Call `table.upsert(df, join_cols=[...])` on a table with existing data and a multi-column join key. We observed the error with four columns, a total of ~10,000 combinations of join key values.
3. Observe SIGSEGV in `pyarrow/acero.py::_filter_table`

## Expected behavior

No crash. The filter is already applied by the fragment scanner (when no positional deletes are present) and should not be reapplied via a secondary Acero call.

## Related issues

PR #1621 implemented the correct fix (`current_batch = current_batch.filter(pyarrow_filter)`) and was merged February 13, 2025. It was subsequently reverted by PR #1901 due to an unrelated "list index out of range" issue with empty batch handling (issue #1804). Now that PyArrow 21 is available, the original approach from #1621 should be revisited. However, note that `RecordBatch.filter(pyarrow_filter)` also routes through Acero on the affected platform — the fix may need to evaluate the filter expression to a boolean array first and use `RecordBatch.filter(boolean_array)` instead, which bypasses Acero.

## Workaround

We worked around the issue by replacing `table.upsert()` with a custom implementation that performs a full table scan with no `row_filter` (so `pyarrow_filter` is never constructed) and merges the result using polars before calling `table.overwrite()`. This avoids all Acero expression evaluation.

## Environment

- PyIceberg: 0.9.1 (confirmed present through 0.11.1)
- PyArrow: 23.0.1
- Platform: `linux/amd64` Docker container on Apple Silicon M-series Mac via Rosetta 2 (OrbStack)
- Native `linux/arm64` suspected to be affected but not directly tested


### Willingness to contribute

- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [x] I cannot contribute a fix for this bug at this time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`_task_to_record_batches` stale PyArrow workaround causes SIGSEGV on Apple Silicon #3272

Apache Iceberg version

Please describe the bug 🐞

Description

Impact

Steps to reproduce

Expected behavior

Related issues

Workaround

Environment

Willingness to contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

_task_to_record_batches stale PyArrow workaround causes SIGSEGV on Apple Silicon #3272

Description

Apache Iceberg version

Please describe the bug 🐞

Description

Impact

Steps to reproduce

Expected behavior

Related issues

Workaround

Environment

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`_task_to_record_batches` stale PyArrow workaround causes SIGSEGV on Apple Silicon #3272