Describe the enhancement requested
pa.Array.to_pylist() on list-typed arrays is 2.5–10x slower than converting the
same array to pandas and then turning the resulting numpy arrays back into Python
lists — even though to_pylist does strictly less work conceptually.
This matters in practice: Apache Spark switched regular Python UDFs to Arrow
serialization by default and hit a performance regression on array columns caused
by this (see apache/spark#56940, apache/spark#56943). Working around it in Spark
via the pandas detour was rejected because it introduces type-coercion bugs
(e.g. list<int32> with a null element comes back as numpy float64
[1., nan, 3.] instead of [1, None, 3]), so the right fix is making
to_pylist() itself fast.
Reproduction (pyarrow 24.0.0, Python 3.11, macOS arm64; same numbers on current master)
import pyarrow as pa
N = 2_000_000
arr = pa.array([[f"s{j}", f"t{j}"] for j in range(N)], type=pa.list_(pa.string()))
arr.to_pylist() # 1.97 s
arr.to_pandas() # 0.46 s (4.3x faster, does MORE work)
[x.tolist() for x in arr.to_pandas()] # 0.78 s (2.5x faster incl. ndarray->list)
arr.values.to_pylist() # 0.82 s (4M flat strings)
# nested: 1M rows of [[j, j+1], [j+2]] as list<list<int32>>
nested.to_pylist() # 2.00 s
nested.to_pandas() # 0.20 s (10x faster)
Root cause
Array.to_pylist is implemented as a per-element scalar conversion
(python/pyarrow/array.pxi):
return [x.as_py(maps_as_pydicts=maps_as_pydicts) for x in self]
For a list<string> array, every row pays for:
Array.__iter__ → getitem(i) → C++ arrow::Array::GetScalar(i), which
allocates a ListScalar holding a sliced values array;
- a Python
Scalar wrapper (Scalar.wrap);
ListScalar.as_py → the values property wraps the slice in a new Python
Array object (pyarrow_wrap_array), then recursively calls .to_pylist()
on it, which allocates a fresh generator and repeats 1–2 for every element,
where C++ GetScalar on a string array copies each value into a
std::string, wraps it in a Buffer and allocates a StringScalar.
A sample profile of the repro shows where the time goes (~8365 samples):
- ~20% CPython GC (
gc_collect_main): the per-row generator/Scalar/Array
allocations are GC-tracked and repeatedly trigger collections that traverse
the ever-growing result list;
- ~25% C++
Array::GetScalar (per-element scalar allocation + per-row values
slicing);
- most of the rest is Python wrapper allocation and method dispatch
(Scalar.wrap, ListScalar.values → pyarrow_wrap_array, as_py calls);
- the useful work — actually creating the 4M
str objects (unicode_new) —
is only ~7% of samples.
This was diagnosed back in 2021 in #28694 (ARROW-12976): maintainers agreed the
fix is to bypass Scalar creation entirely, but the issue was closed as stale in
Feb 2026 without a fix. #28689 is related.
Prototype fix and results
A ~250-line Cython-level prototype on master (no C++ changes) gives:
| benchmark (2M / 1M rows) |
master |
patched |
speedup |
list<string> to_pylist |
1.93 s |
0.34 s |
5.7x |
list<list<int32>> to_pylist |
2.10 s |
0.65 s |
3.2x |
flat string to_pylist (4M) |
0.83 s |
0.05 s |
16x |
i.e. to_pylist becomes ~2.2x faster than the pandas detour
(0.75 s) instead of 2.5x slower.
Two independent parts:
- Bulk list conversion —
to_pylist overrides on ListArray,
LargeListArray and FixedSizeListArray that convert the referenced range
of child values with a single recursive to_pylist call and then slice the
resulting Python list per row using the raw C offsets and the validity
bitmap. No per-row Scalar, no per-row Python Array wrapper, no per-row
generator. MapArray explicitly keeps the generic path (association-tuple /
maps_as_pydicts duplicate-key semantics).
- String leaf fast path —
to_pylist overrides on StringArray /
LargeStringArray that decode values straight from the data buffer
(GetValue + PyUnicode_DecodeUTF8), matching StringScalar.as_py
(= str(buf, 'utf8')) exactly.
Semantics are unchanged: a differential test comparing the patched to_pylist
against the reference [x.as_py() for x in arr] with exact-type equality passes
for list/large_list/fixed_size_list/map over 8 leaf types, nested lists,
list, list, sliced arrays, all-null/empty arrays, and both
maps_as_pydicts modes; in particular list<int32> [1, None, 3] stays
[1, None, 3] (ints + None). pytest pyarrow/tests/test_array.py test_scalars.py test_convert_builtin.py test_table.py passes (1208 passed).
Natural follow-ups (same pattern): leaf fast paths for primitive/binary types
(would speed up the list<list<int32>> case further), string/binary views,
struct arrays, a bulk path for maps, and list-view types (these need care:
overlapping views should not share mutable sublist objects). Longer-term, a
single C++ ToPyList visitor (like MonthDayNanoIntervalArrayToPyList) could
cover all types without per-class Cython code.
I can submit the prototype as a PR.
Component(s)
Python
Describe the enhancement requested
pa.Array.to_pylist()on list-typed arrays is 2.5–10x slower than converting thesame array to pandas and then turning the resulting numpy arrays back into Python
lists — even though
to_pylistdoes strictly less work conceptually.This matters in practice: Apache Spark switched regular Python UDFs to Arrow
serialization by default and hit a performance regression on array columns caused
by this (see apache/spark#56940, apache/spark#56943). Working around it in Spark
via the pandas detour was rejected because it introduces type-coercion bugs
(e.g.
list<int32>with a null element comes back as numpyfloat64[1., nan, 3.]instead of[1, None, 3]), so the right fix is makingto_pylist()itself fast.Reproduction (pyarrow 24.0.0, Python 3.11, macOS arm64; same numbers on current master)
Root cause
Array.to_pylistis implemented as a per-element scalar conversion(
python/pyarrow/array.pxi):For a
list<string>array, every row pays for:Array.__iter__→getitem(i)→ C++arrow::Array::GetScalar(i), whichallocates a
ListScalarholding a sliced values array;Scalarwrapper (Scalar.wrap);ListScalar.as_py→ thevaluesproperty wraps the slice in a new PythonArrayobject (pyarrow_wrap_array), then recursively calls.to_pylist()on it, which allocates a fresh generator and repeats 1–2 for every element,
where C++
GetScalaron a string array copies each value into astd::string, wraps it in aBufferand allocates aStringScalar.A
sampleprofile of the repro shows where the time goes (~8365 samples):gc_collect_main): the per-row generator/Scalar/Arrayallocations are GC-tracked and repeatedly trigger collections that traverse
the ever-growing result list;
Array::GetScalar(per-element scalar allocation + per-row valuesslicing);
(
Scalar.wrap,ListScalar.values→pyarrow_wrap_array,as_pycalls);strobjects (unicode_new) —is only ~7% of samples.
This was diagnosed back in 2021 in #28694 (ARROW-12976): maintainers agreed the
fix is to bypass Scalar creation entirely, but the issue was closed as stale in
Feb 2026 without a fix. #28689 is related.
Prototype fix and results
A ~250-line Cython-level prototype on master (no C++ changes) gives:
list<string>to_pylistlist<list<int32>>to_pyliststringto_pylist (4M)i.e.
to_pylistbecomes ~2.2x faster than the pandas detour(0.75 s) instead of 2.5x slower.
Two independent parts:
to_pylistoverrides onListArray,LargeListArrayandFixedSizeListArraythat convert the referenced rangeof child values with a single recursive
to_pylistcall and then slice theresulting Python list per row using the raw C offsets and the validity
bitmap. No per-row Scalar, no per-row Python Array wrapper, no per-row
generator.
MapArrayexplicitly keeps the generic path (association-tuple /maps_as_pydictsduplicate-key semantics).to_pylistoverrides onStringArray/LargeStringArraythat decode values straight from the data buffer(
GetValue+PyUnicode_DecodeUTF8), matchingStringScalar.as_py(=
str(buf, 'utf8')) exactly.Semantics are unchanged: a differential test comparing the patched
to_pylistagainst the reference
[x.as_py() for x in arr]with exact-type equality passesfor list/large_list/fixed_size_list/map over 8 leaf types, nested lists,
list, list, sliced arrays, all-null/empty arrays, and both
maps_as_pydictsmodes; in particularlist<int32>[1, None, 3]stays[1, None, 3](ints + None).pytest pyarrow/tests/test_array.py test_scalars.py test_convert_builtin.py test_table.pypasses (1208 passed).Natural follow-ups (same pattern): leaf fast paths for primitive/binary types
(would speed up the
list<list<int32>>case further), string/binary views,struct arrays, a bulk path for maps, and list-view types (these need care:
overlapping views should not share mutable sublist objects). Longer-term, a
single C++
ToPyListvisitor (likeMonthDayNanoIntervalArrayToPyList) couldcover all types without per-class Cython code.
I can submit the prototype as a PR.
Component(s)
Python