Pyarrow multi-threading support for struct-list parquet file reads

As originally discussed in #414 , it was discovered that pyarrow has multi-threaded parquet reading of columns enabled by default (which propagates to pandas read times when using pyarrow as the engine). However, while this multi-threading works for separate list-array columns, it doesn't work for loading separate columns stored within a single list struct. This is not something that we can change on our end, but this issue will log some reproducer code:

File Generation:

```
# Code block to generate needed parquet files
from nested_pandas.datasets import generate_data

# Generate a parquet dataset with struct-list format
nf = generate_data(100,2000, seed=1)[["nested"]]
nf.to_parquet("nested_parquet.parquet")

# Generate a parquet dataset with list-array format
nf["nested"].to_lists().to_parquet("list_parquet.parquet")
```

Versioning & Storage Context
```
import pyarrow as pa
pa.__version__
> '22.0.0'

# struct of lists storage as read by pyarrow
pa.parquet.read_table("nested_parquet.parquet").field("nested")
> pyarrow.Field<nested: struct<t: list<element: double>, flux: list<element: double>, band: list<element: string>>>

# list storage as read by pyarrow
pa.parquet.read_table("list_parquet.parquet").field("t")
> pyarrow.Field<t: list<element: double>>
```

Single-Thread Timings:

<img width="616" height="165" alt="Image" src="https://github.com/user-attachments/assets/42cf6292-26d5-4c9e-8bb2-255af3f98b08" />

Multi-Thread Timings:

<img width="618" height="166" alt="Image" src="https://github.com/user-attachments/assets/4c3229df-2870-4550-bfcb-f661682bc50c" />

We see that multi-threading improves the read speed for list-arrays, but not for struct-list formatted data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pyarrow multi-threading support for struct-list parquet file reads #421

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Pyarrow multi-threading support for struct-list parquet file reads #421

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions