GH-50326: [Python] Speed up to_pylist for list-like and string arrays by viirya · Pull Request #50327 · apache/arrow

viirya · 2026-07-01T23:30:31Z

Rationale for this change

Array.to_pylist() on list-typed arrays is 2.5–10x slower than converting the same array via to_pandas() and rebuilding Python lists from the resulting numpy arrays, even though to_pylist does strictly less work. The cause is the per-element conversion loop ([x.as_py() for x in self]): every row allocates a C++ Scalar (Array::GetScalar), a Python Scalar wrapper and, for list types, a Python Array wrapper for the row's values slice plus a fresh generator before recursing per element. Besides the allocation cost, these GC-tracked wrappers repeatedly trigger CPython collections that traverse the ever-growing result list (~20% of runtime in a sample profile; details in #50326).

This hit Apache Spark when it enabled Arrow-serialized Python UDFs by default (apache/spark#56940, apache/spark#56943); working around it via to_pandas() was rejected there because the pandas detour coerces list<int32> with nulls to numpy float64 ([1., nan, 3.] instead of [1, None, 3]).

Benchmarks (macOS arm64, Python 3.11; 2M rows of 2-element lists / 1M rows of nested lists):

benchmark	before	after	speedup
`list<string>` to_pylist	1.93 s	0.34 s	5.7x
`list<list<int32>>` to_pylist	2.10 s	0.65 s	3.2x
flat `string` to_pylist (4M)	0.83 s	0.05 s	16x

For reference, the pandas detour (to_pandas() + per-row tolist()) takes 0.75 s on the list<string> case, so to_pylist goes from 2.5x slower to ~2.2x faster.

What changes are included in this PR?

Bulk to_pylist overrides in array.pxi:

ListArray / LargeListArray / FixedSizeListArray: convert the referenced range of child values with a single recursive to_pylist call, then slice the resulting Python list per row using the raw offsets and the validity bitmap. No per-row Scalar, Python Array wrapper or generator. MapArray explicitly keeps the generic scalar-based path (association-tuple / maps_as_pydicts duplicate-key semantics), as do the list-view types (overlapping views must not share sublist objects).
StringArray / LargeStringArray: decode values directly from the data buffer (GetValue + PyUnicode_DecodeUTF8), which matches StringScalar.as_py (= str(buf, 'utf8')) exactly.

Output is unchanged, including exact element types: None stays None, values inside numeric lists stay Python ints (never floats/NaN), strings/bytes are unchanged. ChunkedArray.to_pylist, Table.to_pylist and ListScalar.as_py delegate to Array.to_pylist and pick up the speedup automatically.

Follow-up candidates (not in this PR): leaf fast paths for primitive/binary/view types, a bulk path for maps and structs, or a general C++ ToPyList visitor covering all types.

Are these changes tested?

New test_to_pylist_bulk_paths compares the bulk paths against the per-scalar conversion ([x.as_py() for x in arr]) for list/large_list/fixed_size_list/nested/map/string/large_string arrays, including sliced, empty and all-null arrays, and asserts exact element types for list<int32> with nulls.
Existing suites pass: test_array.py, test_scalars.py, test_convert_builtin.py, test_table.py (1209 passed locally).
Additionally verified with a randomized differential test (8 leaf types x list/large_list/fixed_size_list/map, nested lists, list<struct>, list<map>, slices, both maps_as_pydicts modes, multibyte strings) with exact-type comparison: no differences.

Are there any user-facing changes?

No behavior changes, only performance: to_pylist() on list-like and string arrays is several times faster.

GitHub Issue: [Python] to_pylist() on list-typed arrays is several times slower than converting via to_pandas() #50326

This pull request and its description were written by Isaac.

…arrays Array.to_pylist() converts one element at a time: each row allocates a C++ Scalar (Array::GetScalar), a Python Scalar wrapper and, for list types, a Python Array wrapper for the row's values slice plus a fresh generator, before recursing per element. On top of the allocation cost itself, these GC-tracked wrappers repeatedly trigger collections that traverse the growing result list (~20% of runtime). This makes to_pylist on list-typed arrays several times slower than the bulk to_pandas conversion path. Add bulk to_pylist overrides: * ListArray / LargeListArray / FixedSizeListArray convert the referenced range of child values with a single recursive to_pylist call, then slice the resulting Python list per row using the raw offsets and the validity bitmap. MapArray keeps the generic scalar-based path (association-tuple / maps_as_pydicts duplicate-key semantics), as do the list-view types (overlapping views must not share sublist objects). * StringArray / LargeStringArray decode values directly from the data buffer (GetValue + PyUnicode_DecodeUTF8), matching StringScalar.as_py (= str(buf, 'utf8')) exactly. Semantics are unchanged; values inside numeric lists stay Python ints/None. Benchmarks (M4 Max, 2M rows of 2-element lists / 1M rows nested): list<string> 1.93s -> 0.34s, list<list<int32>> 2.10s -> 0.65s, flat string (4M) 0.83s -> 0.05s. Co-authored-by: Isaac

github-actions · 2026-07-01T23:30:53Z

⚠️ GitHub issue #50326 has been automatically assigned in GitHub to PR creator.

Copilot

Pull request overview

This PR improves pyarrow.Array.to_pylist() performance for list-like arrays and (large) string arrays by adding specialized bulk conversion implementations that avoid per-element Scalar allocation and wrapper overhead, while keeping output semantics unchanged.

Changes:

Add bulk to_pylist() implementations for ListArray, LargeListArray, and FixedSizeListArray that convert child values once and slice per row using offsets.
Add fast to_pylist() implementations for StringArray and LargeStringArray that decode directly from the value buffer.
Add a new test validating bulk-path results against the scalar-based reference across nested, sliced, empty, and all-null inputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
python/pyarrow/array.pxi	Adds type-specific `to_pylist()` fast paths for list-like and string arrays.
python/pyarrow/tests/test_array.py	Adds a regression/differential test to ensure bulk paths match scalar-based conversion.

viirya · 2026-07-02T00:43:05Z

+        n = arr.length()
+        result = []
+        # Decode values straight from the data buffer instead of creating
+        # a C++ Scalar and a Python Scalar wrapper per value (see GH-28694).
+        if arr.null_count() == 0:
+            for i in range(n):
+                data = arr.GetValue(i, &length)
+                result.append(
+                    cp.PyUnicode_DecodeUTF8(<const char*> data, length, NULL))
+        else:
+            for i in range(n):
+                if arr.IsNull(i):
+                    result.append(None)
+                else:
+                    data = arr.GetValue(i, &length)
+                    result.append(
+                        cp.PyUnicode_DecodeUTF8(<const char*> data, length, NULL))
+        return result


null_count() is a one-time vectorized popcount over the validity bitmap (~n/8 bytes, well under a millisecond for 2M rows), computed and cached per ArrayData. In exchange, the no-null branch skips the per-element IsNull() check entirely. Branching on null_bitmap_data() == NULL instead would save that single scan but degrade the common case of a sliced/combined array that has a bitmap yet contains no nulls in range — that would take the per-element IsNull() path forever. So the current form should be the better trade-off in practice.

gaogaotiantian · 2026-07-01T23:52:01Z

I'm not an expert in Cython but curious about how result.append() works. I think as long as result is not defined as a Cython list, result.append still works as a pure Python object? In that case, each append would trigger a dynamic attribute search and the list would be reallocated a few times during appending.

It would be an interesting experiment to do a full-allocation for the list before assigning the data, as we already know the length of the list. An extra step forward is to declare the return value as a list in Cython so it can optimize setitem even more.

Something like

cdef list result = [None] * n
cdef Py_ssize_t i

for i in range(n):
    result[i] = ...

return result

For a long list this might push the performance even further.

viirya · 2026-07-02T00:03:28Z

Good idea — I tried exactly that (cdef list result = [None] * n + indexed assignment, also letting null rows keep the prefilled None), but it benchmarks the same or slightly worse: list<string> 0.34 s → 0.35–0.37 s, list<list<int32>> 0.65 s → 0.69–0.70 s (best of 3, consistent across reruns).

Two reasons, I think: Cython already lowers result.append(x) on a local it can see is a list to a PyList_Append call, and CPython lists over-allocate geometrically, so the resize cost is amortized and tiny. Meanwhile the preallocated version pays an extra refcount round-trip per slot (every result[i] = x has to decref the prefilled None). The loops are dominated by creating the per-row slice objects / decoding UTF-8 either way, so I kept the simpler append form.

gaogaotiantian · 2026-07-02T00:24:31Z

Okay if the benchmark is similar this is good. Would defining result as a list help? Like a cdef list for it and do append. Just curious whether cython knows it's a list already - maybe it does and that's why it's fast.

viirya · 2026-07-02T00:47:52Z

Good question — Cython already knows. Its type inference marks result = [] as list, so the untyped version and an explicit cdef list result = [] compile to the identical generated C: both lower result.append(x) to Cython's inlined __Pyx_PyList_Append(), which appends in place via PyList_SET_ITEM while there's spare capacity and only falls back to PyList_Append (geometric resize) otherwise. I verified by cythonizing both variants side by side and diffing the generated C — the loop bodies are byte-identical. That's indeed why the append form is already fast, and why adding cdef list doesn't move the numbers.

Copilot AI review requested due to automatic review settings July 1, 2026 23:30

viirya requested review from AlenkaF, raulcd and rok as code owners July 1, 2026 23:30

Copilot started reviewing on behalf of viirya July 1, 2026 23:30 View session

github-actions Bot added Component: Python awaiting committer review Awaiting committer review labels Jul 1, 2026

Copilot AI reviewed Jul 1, 2026

View reviewed changes

github-actions Bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-50326: [Python] Speed up to_pylist for list-like and string arrays#50327

GH-50326: [Python] Speed up to_pylist for list-like and string arrays#50327
viirya wants to merge 1 commit into
apache:mainfrom
viirya:GH-50326-python-bulk-to-pylist

viirya commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

viirya Jul 2, 2026

Uh oh!

gaogaotiantian commented Jul 1, 2026

Uh oh!

viirya commented Jul 2, 2026

Uh oh!

gaogaotiantian commented Jul 2, 2026

Uh oh!

viirya commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

viirya commented Jul 1, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

viirya Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian commented Jul 1, 2026

Uh oh!

viirya commented Jul 2, 2026

Uh oh!

gaogaotiantian commented Jul 2, 2026

Uh oh!

viirya commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants