Skip to content

GH-50326: [Python] Speed up to_pylist for list-like and string arrays#50327

Open
viirya wants to merge 1 commit into
apache:mainfrom
viirya:GH-50326-python-bulk-to-pylist
Open

GH-50326: [Python] Speed up to_pylist for list-like and string arrays#50327
viirya wants to merge 1 commit into
apache:mainfrom
viirya:GH-50326-python-bulk-to-pylist

Conversation

@viirya

@viirya viirya commented Jul 1, 2026

Copy link
Copy Markdown
Member

Rationale for this change

Array.to_pylist() on list-typed arrays is 2.5–10x slower than converting the same array via to_pandas() and rebuilding Python lists from the resulting numpy arrays, even though to_pylist does strictly less work. The cause is the per-element conversion loop ([x.as_py() for x in self]): every row allocates a C++ Scalar (Array::GetScalar), a Python Scalar wrapper and, for list types, a Python Array wrapper for the row's values slice plus a fresh generator before recursing per element. Besides the allocation cost, these GC-tracked wrappers repeatedly trigger CPython collections that traverse the ever-growing result list (~20% of runtime in a sample profile; details in #50326).

This hit Apache Spark when it enabled Arrow-serialized Python UDFs by default (apache/spark#56940, apache/spark#56943); working around it via to_pandas() was rejected there because the pandas detour coerces list<int32> with nulls to numpy float64 ([1., nan, 3.] instead of [1, None, 3]).

Benchmarks (macOS arm64, Python 3.11; 2M rows of 2-element lists / 1M rows of nested lists):

benchmark before after speedup
list<string> to_pylist 1.93 s 0.34 s 5.7x
list<list<int32>> to_pylist 2.10 s 0.65 s 3.2x
flat string to_pylist (4M) 0.83 s 0.05 s 16x

For reference, the pandas detour (to_pandas() + per-row tolist()) takes 0.75 s on the list<string> case, so to_pylist goes from 2.5x slower to ~2.2x faster.

What changes are included in this PR?

Bulk to_pylist overrides in array.pxi:

  • ListArray / LargeListArray / FixedSizeListArray: convert the referenced range of child values with a single recursive to_pylist call, then slice the resulting Python list per row using the raw offsets and the validity bitmap. No per-row Scalar, Python Array wrapper or generator. MapArray explicitly keeps the generic scalar-based path (association-tuple / maps_as_pydicts duplicate-key semantics), as do the list-view types (overlapping views must not share sublist objects).
  • StringArray / LargeStringArray: decode values directly from the data buffer (GetValue + PyUnicode_DecodeUTF8), which matches StringScalar.as_py (= str(buf, 'utf8')) exactly.

Output is unchanged, including exact element types: None stays None, values inside numeric lists stay Python ints (never floats/NaN), strings/bytes are unchanged. ChunkedArray.to_pylist, Table.to_pylist and ListScalar.as_py delegate to Array.to_pylist and pick up the speedup automatically.

Follow-up candidates (not in this PR): leaf fast paths for primitive/binary/view types, a bulk path for maps and structs, or a general C++ ToPyList visitor covering all types.

Are these changes tested?

  • New test_to_pylist_bulk_paths compares the bulk paths against the per-scalar conversion ([x.as_py() for x in arr]) for list/large_list/fixed_size_list/nested/map/string/large_string arrays, including sliced, empty and all-null arrays, and asserts exact element types for list<int32> with nulls.
  • Existing suites pass: test_array.py, test_scalars.py, test_convert_builtin.py, test_table.py (1209 passed locally).
  • Additionally verified with a randomized differential test (8 leaf types x list/large_list/fixed_size_list/map, nested lists, list<struct>, list<map>, slices, both maps_as_pydicts modes, multibyte strings) with exact-type comparison: no differences.

Are there any user-facing changes?

No behavior changes, only performance: to_pylist() on list-like and string arrays is several times faster.

This pull request and its description were written by Isaac.

…arrays

Array.to_pylist() converts one element at a time: each row allocates a
C++ Scalar (Array::GetScalar), a Python Scalar wrapper and, for list
types, a Python Array wrapper for the row's values slice plus a fresh
generator, before recursing per element. On top of the allocation cost
itself, these GC-tracked wrappers repeatedly trigger collections that
traverse the growing result list (~20% of runtime). This makes
to_pylist on list-typed arrays several times slower than the bulk
to_pandas conversion path.

Add bulk to_pylist overrides:

* ListArray / LargeListArray / FixedSizeListArray convert the
  referenced range of child values with a single recursive to_pylist
  call, then slice the resulting Python list per row using the raw
  offsets and the validity bitmap. MapArray keeps the generic
  scalar-based path (association-tuple / maps_as_pydicts duplicate-key
  semantics), as do the list-view types (overlapping views must not
  share sublist objects).
* StringArray / LargeStringArray decode values directly from the data
  buffer (GetValue + PyUnicode_DecodeUTF8), matching
  StringScalar.as_py (= str(buf, 'utf8')) exactly.

Semantics are unchanged; values inside numeric lists stay Python
ints/None. Benchmarks (M4 Max, 2M rows of 2-element lists / 1M rows
nested): list<string> 1.93s -> 0.34s, list<list<int32>> 2.10s -> 0.65s,
flat string (4M) 0.83s -> 0.05s.

Co-authored-by: Isaac
Copilot AI review requested due to automatic review settings July 1, 2026 23:30
@viirya viirya requested review from AlenkaF, raulcd and rok as code owners July 1, 2026 23:30
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

⚠️ GitHub issue #50326 has been automatically assigned in GitHub to PR creator.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves pyarrow.Array.to_pylist() performance for list-like arrays and (large) string arrays by adding specialized bulk conversion implementations that avoid per-element Scalar allocation and wrapper overhead, while keeping output semantics unchanged.

Changes:

  • Add bulk to_pylist() implementations for ListArray, LargeListArray, and FixedSizeListArray that convert child values once and slice per row using offsets.
  • Add fast to_pylist() implementations for StringArray and LargeStringArray that decode directly from the value buffer.
  • Add a new test validating bulk-path results against the scalar-based reference across nested, sliced, empty, and all-null inputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
python/pyarrow/array.pxi Adds type-specific to_pylist() fast paths for list-like and string arrays.
python/pyarrow/tests/test_array.py Adds a regression/differential test to ensure bulk paths match scalar-based conversion.

Comment thread python/pyarrow/array.pxi
Comment on lines +4178 to +4195
n = arr.length()
result = []
# Decode values straight from the data buffer instead of creating
# a C++ Scalar and a Python Scalar wrapper per value (see GH-28694).
if arr.null_count() == 0:
for i in range(n):
data = arr.GetValue(i, &length)
result.append(
cp.PyUnicode_DecodeUTF8(<const char*> data, length, NULL))
else:
for i in range(n):
if arr.IsNull(i):
result.append(None)
else:
data = arr.GetValue(i, &length)
result.append(
cp.PyUnicode_DecodeUTF8(<const char*> data, length, NULL))
return result

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null_count() is a one-time vectorized popcount over the validity bitmap (~n/8 bytes, well under a millisecond for 2M rows), computed and cached per ArrayData. In exchange, the no-null branch skips the per-element IsNull() check entirely. Branching on null_bitmap_data() == NULL instead would save that single scan but degrade the common case of a sliced/combined array that has a bitmap yet contains no nulls in range — that would take the per-element IsNull() path forever. So the current form should be the better trade-off in practice.

@gaogaotiantian

Copy link
Copy Markdown

I'm not an expert in Cython but curious about how result.append() works. I think as long as result is not defined as a Cython list, result.append still works as a pure Python object? In that case, each append would trigger a dynamic attribute search and the list would be reallocated a few times during appending.

It would be an interesting experiment to do a full-allocation for the list before assigning the data, as we already know the length of the list. An extra step forward is to declare the return value as a list in Cython so it can optimize setitem even more.

Something like

cdef list result = [None] * n
cdef Py_ssize_t i

for i in range(n):
    result[i] = ...

return result

For a long list this might push the performance even further.

@viirya

viirya commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

Good idea — I tried exactly that (cdef list result = [None] * n + indexed assignment, also letting null rows keep the prefilled None), but it benchmarks the same or slightly worse: list<string> 0.34 s → 0.35–0.37 s, list<list<int32>> 0.65 s → 0.69–0.70 s (best of 3, consistent across reruns).

Two reasons, I think: Cython already lowers result.append(x) on a local it can see is a list to a PyList_Append call, and CPython lists over-allocate geometrically, so the resize cost is amortized and tiny. Meanwhile the preallocated version pays an extra refcount round-trip per slot (every result[i] = x has to decref the prefilled None). The loops are dominated by creating the per-row slice objects / decoding UTF-8 either way, so I kept the simpler append form.

@gaogaotiantian

Copy link
Copy Markdown

Okay if the benchmark is similar this is good. Would defining result as a list help? Like a cdef list for it and do append. Just curious whether cython knows it's a list already - maybe it does and that's why it's fast.

@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 2, 2026
@viirya

viirya commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

Good question — Cython already knows. Its type inference marks result = [] as list, so the untyped version and an explicit cdef list result = [] compile to the identical generated C: both lower result.append(x) to Cython's inlined __Pyx_PyList_Append(), which appends in place via PyList_SET_ITEM while there's spare capacity and only falls back to PyList_Append (geometric resize) otherwise. I verified by cythonizing both variants side by side and diffing the generated C — the loop bodies are byte-identical. That's indeed why the append form is already fast, and why adding cdef list doesn't move the numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants