Skip to content

Commit 1f8c172

Browse files
committed
fix: add the same for reset
1 parent 9c926d3 commit 1f8c172

4 files changed

Lines changed: 92 additions & 0 deletions

File tree

SPARK_TICKET.txt

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
Title: [PYTHON] Remove redundant _accumulatorRegistry.clear() call in worker.py
2+
3+
Type: Improvement
4+
5+
---
6+
7+
In {{worker.py}}, {{_accumulatorRegistry.clear()}} is called twice with no accumulator-modifying code in between:
8+
9+
{code:python}
10+
shuffle.MemoryBytesSpilled = 0
11+
shuffle.DiskBytesSpilled = 0
12+
_accumulatorRegistry.clear() # first call
13+
14+
setup_spark_files(infile)
15+
setup_broadcasts(infile)
16+
17+
_accumulatorRegistry.clear() # second call (redundant)
18+
{code}
19+
20+
Neither {{setup_spark_files}} nor {{setup_broadcasts}} adds anything to {{_accumulatorRegistry}}, so the first {{clear()}} is redundant.
21+
22+
This happened because SPARK-3463 (2014) added the first {{clear()}} and SPARK-3030 (2014) added the second one. When SPARK-44533 (2023) refactored the code to extract {{setup_spark_files}} and {{setup_broadcasts}}, both {{clear()}} calls were preserved even though they became redundant.

SPARK_TICKET_51112_CLEANUP.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Spark Ticket Proposal
2+
3+
## Title
4+
[SPARK-XXXXX][PYTHON] Remove SPARK-51112 workaround for empty table toPandas conversion
5+
6+
## Summary
7+
Remove the workaround added in SPARK-51112 that bypasses PyArrow's `to_pandas()` for empty tables. This workaround is no longer necessary after SPARK-55056 fixed the root cause in `ArrayWriter.finish()`.
8+
9+
## Background
10+
11+
SPARK-51112 added a workaround in `python/pyspark/sql/pandas/conversion.py`:
12+
13+
```python
14+
# SPARK-51112: If the table is empty, we avoid using pyarrow to_pandas to create the
15+
# DataFrame, as it may fail with a segmentation fault.
16+
if arrow_table.num_rows == 0:
17+
# For empty tables, create empty Series to preserve dtypes
18+
column_data = (
19+
pd.Series([], name=temp_col_names[i], dtype="object") for i in range(len(schema.fields))
20+
)
21+
```
22+
23+
This workaround avoided a SIGSEGV when converting empty DataFrames with nested array columns to Pandas.
24+
25+
## Why Remove It
26+
27+
SPARK-55056 fixed the root cause: `ArrayWriter.finish()` now properly initializes the Arrow ListArray offset buffer even when `count == 0`. This ensures all Arrow data generated by Spark is valid, regardless of nesting depth or empty arrays.
28+
29+
**Tested scenarios that now work without the workaround:**
30+
- Empty table with `Array<Array<String>>`
31+
- Empty table with `Array<Array<Array<String>>>`
32+
- PyArrow's native `empty_table().to_pandas()` with nested arrays
33+
- All combinations of nested Array/Map structures
34+
35+
## Proposed Change
36+
37+
Remove the `if arrow_table.num_rows == 0` special case in `_convert_arrow_table_to_pandas()` and let all conversions go through PyArrow's standard path.
38+
39+
## Benefits
40+
1. Simpler code - removes special case handling
41+
2. Better type preservation - PyArrow's `to_pandas()` handles types more accurately than manually creating empty Series with `dtype="object"`
42+
3. Consistent code path for empty and non-empty tables
43+
44+
## Testing
45+
- Verify existing SPARK-51112 test case still passes
46+
- Add test for empty table with triple-nested array schema

SPARK_TICKET_51112_CLEANUP.txt

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
Title: [FOLLOWUP] Remove SPARK-51112 workaround after SPARK-55056 fix
2+
3+
Type: Improvement
4+
5+
Component: PySpark
6+
7+
---
8+
9+
SPARK-51112 added a workaround in {{_convert_arrow_table_to_pandas()}} to avoid segfault when converting empty tables with nested array columns:
10+
11+
{code:python}
12+
# SPARK-51112: If the table is empty, we avoid using pyarrow to_pandas to create the
13+
# DataFrame, as it may fail with a segmentation fault.
14+
if arrow_table.num_rows == 0:
15+
column_data = (
16+
pd.Series([], name=temp_col_names[i], dtype="object") for i in range(len(schema.fields))
17+
)
18+
{code}
19+
20+
This workaround is no longer necessary after SPARK-55056, which fixed the root cause in {{ArrayWriter.finish()}} by properly initializing the Arrow ListArray offset buffer when {{count == 0}}.
21+
22+
Proposal: Remove the SPARK-51112 workaround and let pyarrow handle empty tables directly.

sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -413,6 +413,8 @@ private[arrow] class ArrayWriter(
413413

414414
override def reset(): Unit = {
415415
super.reset()
416+
// Re-initialize offset buffer after reset (see constructor comment)
417+
valueVector.getOffsetBuffer.setInt(0, 0)
416418
elementWriter.reset()
417419
}
418420
}

0 commit comments

Comments
 (0)