GH-50326: [Python] Speed up to_pylist for list-like and string arrays#50327
GH-50326: [Python] Speed up to_pylist for list-like and string arrays#50327viirya wants to merge 1 commit into
Conversation
…arrays Array.to_pylist() converts one element at a time: each row allocates a C++ Scalar (Array::GetScalar), a Python Scalar wrapper and, for list types, a Python Array wrapper for the row's values slice plus a fresh generator, before recursing per element. On top of the allocation cost itself, these GC-tracked wrappers repeatedly trigger collections that traverse the growing result list (~20% of runtime). This makes to_pylist on list-typed arrays several times slower than the bulk to_pandas conversion path. Add bulk to_pylist overrides: * ListArray / LargeListArray / FixedSizeListArray convert the referenced range of child values with a single recursive to_pylist call, then slice the resulting Python list per row using the raw offsets and the validity bitmap. MapArray keeps the generic scalar-based path (association-tuple / maps_as_pydicts duplicate-key semantics), as do the list-view types (overlapping views must not share sublist objects). * StringArray / LargeStringArray decode values directly from the data buffer (GetValue + PyUnicode_DecodeUTF8), matching StringScalar.as_py (= str(buf, 'utf8')) exactly. Semantics are unchanged; values inside numeric lists stay Python ints/None. Benchmarks (M4 Max, 2M rows of 2-element lists / 1M rows nested): list<string> 1.93s -> 0.34s, list<list<int32>> 2.10s -> 0.65s, flat string (4M) 0.83s -> 0.05s. Co-authored-by: Isaac
|
|
There was a problem hiding this comment.
Pull request overview
This PR improves pyarrow.Array.to_pylist() performance for list-like arrays and (large) string arrays by adding specialized bulk conversion implementations that avoid per-element Scalar allocation and wrapper overhead, while keeping output semantics unchanged.
Changes:
- Add bulk
to_pylist()implementations forListArray,LargeListArray, andFixedSizeListArraythat convert child values once and slice per row using offsets. - Add fast
to_pylist()implementations forStringArrayandLargeStringArraythat decode directly from the value buffer. - Add a new test validating bulk-path results against the scalar-based reference across nested, sliced, empty, and all-null inputs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| python/pyarrow/array.pxi | Adds type-specific to_pylist() fast paths for list-like and string arrays. |
| python/pyarrow/tests/test_array.py | Adds a regression/differential test to ensure bulk paths match scalar-based conversion. |
| n = arr.length() | ||
| result = [] | ||
| # Decode values straight from the data buffer instead of creating | ||
| # a C++ Scalar and a Python Scalar wrapper per value (see GH-28694). | ||
| if arr.null_count() == 0: | ||
| for i in range(n): | ||
| data = arr.GetValue(i, &length) | ||
| result.append( | ||
| cp.PyUnicode_DecodeUTF8(<const char*> data, length, NULL)) | ||
| else: | ||
| for i in range(n): | ||
| if arr.IsNull(i): | ||
| result.append(None) | ||
| else: | ||
| data = arr.GetValue(i, &length) | ||
| result.append( | ||
| cp.PyUnicode_DecodeUTF8(<const char*> data, length, NULL)) | ||
| return result |
There was a problem hiding this comment.
null_count() is a one-time vectorized popcount over the validity bitmap (~n/8 bytes, well under a millisecond for 2M rows), computed and cached per ArrayData. In exchange, the no-null branch skips the per-element IsNull() check entirely. Branching on null_bitmap_data() == NULL instead would save that single scan but degrade the common case of a sliced/combined array that has a bitmap yet contains no nulls in range — that would take the per-element IsNull() path forever. So the current form should be the better trade-off in practice.
|
I'm not an expert in Cython but curious about how It would be an interesting experiment to do a full-allocation for the list before assigning the data, as we already know the length of the list. An extra step forward is to declare the return value as a list in Cython so it can optimize Something like cdef list result = [None] * n
cdef Py_ssize_t i
for i in range(n):
result[i] = ...
return resultFor a long list this might push the performance even further. |
|
Good idea — I tried exactly that ( Two reasons, I think: Cython already lowers |
|
Okay if the benchmark is similar this is good. Would defining result as a list help? Like a cdef list for it and do append. Just curious whether cython knows it's a list already - maybe it does and that's why it's fast. |
|
Good question — Cython already knows. Its type inference marks |
Rationale for this change
Array.to_pylist()on list-typed arrays is 2.5–10x slower than converting the same array viato_pandas()and rebuilding Python lists from the resulting numpy arrays, even thoughto_pylistdoes strictly less work. The cause is the per-element conversion loop ([x.as_py() for x in self]): every row allocates a C++ Scalar (Array::GetScalar), a Python Scalar wrapper and, for list types, a Python Array wrapper for the row's values slice plus a fresh generator before recursing per element. Besides the allocation cost, these GC-tracked wrappers repeatedly trigger CPython collections that traverse the ever-growing result list (~20% of runtime in asampleprofile; details in #50326).This hit Apache Spark when it enabled Arrow-serialized Python UDFs by default (apache/spark#56940, apache/spark#56943); working around it via
to_pandas()was rejected there because the pandas detour coerceslist<int32>with nulls to numpyfloat64([1., nan, 3.]instead of[1, None, 3]).Benchmarks (macOS arm64, Python 3.11; 2M rows of 2-element lists / 1M rows of nested lists):
list<string>to_pylistlist<list<int32>>to_pyliststringto_pylist (4M)For reference, the pandas detour (
to_pandas()+ per-rowtolist()) takes 0.75 s on thelist<string>case, soto_pylistgoes from 2.5x slower to ~2.2x faster.What changes are included in this PR?
Bulk
to_pylistoverrides inarray.pxi:ListArray/LargeListArray/FixedSizeListArray: convert the referenced range of child values with a single recursiveto_pylistcall, then slice the resulting Python list per row using the raw offsets and the validity bitmap. No per-row Scalar, Python Array wrapper or generator.MapArrayexplicitly keeps the generic scalar-based path (association-tuple /maps_as_pydictsduplicate-key semantics), as do the list-view types (overlapping views must not share sublist objects).StringArray/LargeStringArray: decode values directly from the data buffer (GetValue+PyUnicode_DecodeUTF8), which matchesStringScalar.as_py(=str(buf, 'utf8')) exactly.Output is unchanged, including exact element types:
NonestaysNone, values inside numeric lists stay Python ints (never floats/NaN), strings/bytes are unchanged.ChunkedArray.to_pylist,Table.to_pylistandListScalar.as_pydelegate toArray.to_pylistand pick up the speedup automatically.Follow-up candidates (not in this PR): leaf fast paths for primitive/binary/view types, a bulk path for maps and structs, or a general C++
ToPyListvisitor covering all types.Are these changes tested?
test_to_pylist_bulk_pathscompares the bulk paths against the per-scalar conversion ([x.as_py() for x in arr]) for list/large_list/fixed_size_list/nested/map/string/large_string arrays, including sliced, empty and all-null arrays, and asserts exact element types forlist<int32>with nulls.test_array.py,test_scalars.py,test_convert_builtin.py,test_table.py(1209 passed locally).maps_as_pydictsmodes, multibyte strings) with exact-type comparison: no differences.Are there any user-facing changes?
No behavior changes, only performance:
to_pylist()on list-like and string arrays is several times faster.This pull request and its description were written by Isaac.