fix: add the same for reset

Yicong-Huang · Yicong-Huang · commit 1f8c172b05df · 2026-01-15T16:44:27.000-08:00
diff --git a/SPARK_TICKET.txt b/SPARK_TICKET.txt
@@ -0,0 +1,22 @@
+Title: [PYTHON] Remove redundant _accumulatorRegistry.clear() call in worker.py
+
+Type: Improvement
+
+---
+
+In {{worker.py}}, {{_accumulatorRegistry.clear()}} is called twice with no accumulator-modifying code in between:
+
+{code:python}
+shuffle.MemoryBytesSpilled = 0
+shuffle.DiskBytesSpilled = 0
+_accumulatorRegistry.clear()  # first call
+
+setup_spark_files(infile)
+setup_broadcasts(infile)
+
+_accumulatorRegistry.clear()  # second call (redundant)
+{code}
+
+Neither {{setup_spark_files}} nor {{setup_broadcasts}} adds anything to {{_accumulatorRegistry}}, so the first {{clear()}} is redundant.
+
+This happened because SPARK-3463 (2014) added the first {{clear()}} and SPARK-3030 (2014) added the second one. When SPARK-44533 (2023) refactored the code to extract {{setup_spark_files}} and {{setup_broadcasts}}, both {{clear()}} calls were preserved even though they became redundant.
diff --git a/SPARK_TICKET_51112_CLEANUP.md b/SPARK_TICKET_51112_CLEANUP.md
@@ -0,0 +1,46 @@
+# Spark Ticket Proposal
+
+## Title
+[SPARK-XXXXX][PYTHON] Remove SPARK-51112 workaround for empty table toPandas conversion
+
+## Summary
+Remove the workaround added in SPARK-51112 that bypasses PyArrow's `to_pandas()` for empty tables. This workaround is no longer necessary after SPARK-55056 fixed the root cause in `ArrayWriter.finish()`.
+
+## Background
+
+SPARK-51112 added a workaround in `python/pyspark/sql/pandas/conversion.py`:
+
+```python
+# SPARK-51112: If the table is empty, we avoid using pyarrow to_pandas to create the
+# DataFrame, as it may fail with a segmentation fault.
+if arrow_table.num_rows == 0:
+    # For empty tables, create empty Series to preserve dtypes
+    column_data = (
+        pd.Series([], name=temp_col_names[i], dtype="object") for i in range(len(schema.fields))
+    )
+```
+
+This workaround avoided a SIGSEGV when converting empty DataFrames with nested array columns to Pandas.
+
+## Why Remove It
+
+SPARK-55056 fixed the root cause: `ArrayWriter.finish()` now properly initializes the Arrow ListArray offset buffer even when `count == 0`. This ensures all Arrow data generated by Spark is valid, regardless of nesting depth or empty arrays.
+
+**Tested scenarios that now work without the workaround:**
+- Empty table with `Array<Array<String>>`
+- Empty table with `Array<Array<Array<String>>>`
+- PyArrow's native `empty_table().to_pandas()` with nested arrays
+- All combinations of nested Array/Map structures
+
+## Proposed Change
+
+Remove the `if arrow_table.num_rows == 0` special case in `_convert_arrow_table_to_pandas()` and let all conversions go through PyArrow's standard path.
+
+## Benefits
+1. Simpler code - removes special case handling
+2. Better type preservation - PyArrow's `to_pandas()` handles types more accurately than manually creating empty Series with `dtype="object"`
+3. Consistent code path for empty and non-empty tables
+
+## Testing
+- Verify existing SPARK-51112 test case still passes
+- Add test for empty table with triple-nested array schema
diff --git a/SPARK_TICKET_51112_CLEANUP.txt b/SPARK_TICKET_51112_CLEANUP.txt
@@ -0,0 +1,22 @@
+Title: [FOLLOWUP] Remove SPARK-51112 workaround after SPARK-55056 fix
+
+Type: Improvement
+
+Component: PySpark
+
+---
+
+SPARK-51112 added a workaround in {{_convert_arrow_table_to_pandas()}} to avoid segfault when converting empty tables with nested array columns:
+
+{code:python}
+# SPARK-51112: If the table is empty, we avoid using pyarrow to_pandas to create the
+# DataFrame, as it may fail with a segmentation fault.
+if arrow_table.num_rows == 0:
+    column_data = (
+        pd.Series([], name=temp_col_names[i], dtype="object") for i in range(len(schema.fields))
+    )
+{code}
+
+This workaround is no longer necessary after SPARK-55056, which fixed the root cause in {{ArrayWriter.finish()}} by properly initializing the Arrow ListArray offset buffer when {{count == 0}}.
+
+Proposal: Remove the SPARK-51112 workaround and let pyarrow handle empty tables directly.
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala
@@ -413,6 +413,8 @@ private[arrow] class ArrayWriter(
 
   override def reset(): Unit = {
     super.reset()
+    // Re-initialize offset buffer after reset (see constructor comment)
+    valueVector.getOffsetBuffer.setInt(0, 0)
     elementWriter.reset()
   }
 }

Original file line number	Diff line number	Diff line change
`@@ -413,6 +413,8 @@ private[arrow] class ArrayWriter(`
`413`	`413`
`414`	`414`	`override def reset(): Unit = {`
`415`	`415`	`super.reset()`
	`416`	`+ // Re-initialize offset buffer after reset (see constructor comment)`
	`417`	`+ valueVector.getOffsetBuffer.setInt(0, 0)`
`416`	`418`	`elementWriter.reset()`
`417`	`419`	`}`
`418`	`420`	`}`