|
| 1 | +# Spark Ticket Proposal |
| 2 | + |
| 3 | +## Title |
| 4 | +[SPARK-XXXXX][PYTHON] Remove SPARK-51112 workaround for empty table toPandas conversion |
| 5 | + |
| 6 | +## Summary |
| 7 | +Remove the workaround added in SPARK-51112 that bypasses PyArrow's `to_pandas()` for empty tables. This workaround is no longer necessary after SPARK-55056 fixed the root cause in `ArrayWriter.finish()`. |
| 8 | + |
| 9 | +## Background |
| 10 | + |
| 11 | +SPARK-51112 added a workaround in `python/pyspark/sql/pandas/conversion.py`: |
| 12 | + |
| 13 | +```python |
| 14 | +# SPARK-51112: If the table is empty, we avoid using pyarrow to_pandas to create the |
| 15 | +# DataFrame, as it may fail with a segmentation fault. |
| 16 | +if arrow_table.num_rows == 0: |
| 17 | + # For empty tables, create empty Series to preserve dtypes |
| 18 | + column_data = ( |
| 19 | + pd.Series([], name=temp_col_names[i], dtype="object") for i in range(len(schema.fields)) |
| 20 | + ) |
| 21 | +``` |
| 22 | + |
| 23 | +This workaround avoided a SIGSEGV when converting empty DataFrames with nested array columns to Pandas. |
| 24 | + |
| 25 | +## Why Remove It |
| 26 | + |
| 27 | +SPARK-55056 fixed the root cause: `ArrayWriter.finish()` now properly initializes the Arrow ListArray offset buffer even when `count == 0`. This ensures all Arrow data generated by Spark is valid, regardless of nesting depth or empty arrays. |
| 28 | + |
| 29 | +**Tested scenarios that now work without the workaround:** |
| 30 | +- Empty table with `Array<Array<String>>` |
| 31 | +- Empty table with `Array<Array<Array<String>>>` |
| 32 | +- PyArrow's native `empty_table().to_pandas()` with nested arrays |
| 33 | +- All combinations of nested Array/Map structures |
| 34 | + |
| 35 | +## Proposed Change |
| 36 | + |
| 37 | +Remove the `if arrow_table.num_rows == 0` special case in `_convert_arrow_table_to_pandas()` and let all conversions go through PyArrow's standard path. |
| 38 | + |
| 39 | +## Benefits |
| 40 | +1. Simpler code - removes special case handling |
| 41 | +2. Better type preservation - PyArrow's `to_pandas()` handles types more accurately than manually creating empty Series with `dtype="object"` |
| 42 | +3. Consistent code path for empty and non-empty tables |
| 43 | + |
| 44 | +## Testing |
| 45 | +- Verify existing SPARK-51112 test case still passes |
| 46 | +- Add test for empty table with triple-nested array schema |
0 commit comments