Skip to content

[Python] to_pylist() on list-typed arrays is several times slower than converting via to_pandas() #50326

Description

@viirya

Describe the enhancement requested

pa.Array.to_pylist() on list-typed arrays is 2.5–10x slower than converting the
same array to pandas and then turning the resulting numpy arrays back into Python
lists — even though to_pylist does strictly less work conceptually.

This matters in practice: Apache Spark switched regular Python UDFs to Arrow
serialization by default and hit a performance regression on array columns caused
by this (see apache/spark#56940, apache/spark#56943). Working around it in Spark
via the pandas detour was rejected because it introduces type-coercion bugs
(e.g. list<int32> with a null element comes back as numpy float64
[1., nan, 3.] instead of [1, None, 3]), so the right fix is making
to_pylist() itself fast.

Reproduction (pyarrow 24.0.0, Python 3.11, macOS arm64; same numbers on current master)

import pyarrow as pa

N = 2_000_000
arr = pa.array([[f"s{j}", f"t{j}"] for j in range(N)], type=pa.list_(pa.string()))

arr.to_pylist()                          # 1.97 s
arr.to_pandas()                          # 0.46 s  (4.3x faster, does MORE work)
[x.tolist() for x in arr.to_pandas()]    # 0.78 s  (2.5x faster incl. ndarray->list)
arr.values.to_pylist()                   # 0.82 s  (4M flat strings)

# nested: 1M rows of [[j, j+1], [j+2]] as list<list<int32>>
nested.to_pylist()                       # 2.00 s
nested.to_pandas()                       # 0.20 s  (10x faster)

Root cause

Array.to_pylist is implemented as a per-element scalar conversion
(python/pyarrow/array.pxi):

return [x.as_py(maps_as_pydicts=maps_as_pydicts) for x in self]

For a list<string> array, every row pays for:

  1. Array.__iter__getitem(i) → C++ arrow::Array::GetScalar(i), which
    allocates a ListScalar holding a sliced values array;
  2. a Python Scalar wrapper (Scalar.wrap);
  3. ListScalar.as_py → the values property wraps the slice in a new Python
    Array object
    (pyarrow_wrap_array), then recursively calls .to_pylist()
    on it, which allocates a fresh generator and repeats 1–2 for every element,
    where C++ GetScalar on a string array copies each value into a
    std::string, wraps it in a Buffer and allocates a StringScalar.

A sample profile of the repro shows where the time goes (~8365 samples):

  • ~20% CPython GC (gc_collect_main): the per-row generator/Scalar/Array
    allocations are GC-tracked and repeatedly trigger collections that traverse
    the ever-growing result list;
  • ~25% C++ Array::GetScalar (per-element scalar allocation + per-row values
    slicing);
  • most of the rest is Python wrapper allocation and method dispatch
    (Scalar.wrap, ListScalar.valuespyarrow_wrap_array, as_py calls);
  • the useful work — actually creating the 4M str objects (unicode_new) —
    is only ~7% of samples.

This was diagnosed back in 2021 in #28694 (ARROW-12976): maintainers agreed the
fix is to bypass Scalar creation entirely, but the issue was closed as stale in
Feb 2026 without a fix. #28689 is related.

Prototype fix and results

A ~250-line Cython-level prototype on master (no C++ changes) gives:

benchmark (2M / 1M rows) master patched speedup
list<string> to_pylist 1.93 s 0.34 s 5.7x
list<list<int32>> to_pylist 2.10 s 0.65 s 3.2x
flat string to_pylist (4M) 0.83 s 0.05 s 16x

i.e. to_pylist becomes ~2.2x faster than the pandas detour
(0.75 s) instead of 2.5x slower.

Two independent parts:

  1. Bulk list conversionto_pylist overrides on ListArray,
    LargeListArray and FixedSizeListArray that convert the referenced range
    of child values with a single recursive to_pylist call and then slice the
    resulting Python list per row using the raw C offsets and the validity
    bitmap. No per-row Scalar, no per-row Python Array wrapper, no per-row
    generator. MapArray explicitly keeps the generic path (association-tuple /
    maps_as_pydicts duplicate-key semantics).
  2. String leaf fast pathto_pylist overrides on StringArray /
    LargeStringArray that decode values straight from the data buffer
    (GetValue + PyUnicode_DecodeUTF8), matching StringScalar.as_py
    (= str(buf, 'utf8')) exactly.

Semantics are unchanged: a differential test comparing the patched to_pylist
against the reference [x.as_py() for x in arr] with exact-type equality passes
for list/large_list/fixed_size_list/map over 8 leaf types, nested lists,
list, list, sliced arrays, all-null/empty arrays, and both
maps_as_pydicts modes; in particular list<int32> [1, None, 3] stays
[1, None, 3] (ints + None). pytest pyarrow/tests/test_array.py test_scalars.py test_convert_builtin.py test_table.py passes (1208 passed).

Natural follow-ups (same pattern): leaf fast paths for primitive/binary types
(would speed up the list<list<int32>> case further), string/binary views,
struct arrays, a bulk path for maps, and list-view types (these need care:
overlapping views should not share mutable sublist objects). Longer-term, a
single C++ ToPyList visitor (like MonthDayNanoIntervalArrayToPyList) could
cover all types without per-class Cython code.

I can submit the prototype as a PR.

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions