Skip to content

Pyarrow multi-threading support for struct-list parquet file reads #421

Description

@dougbrn

As originally discussed in #414 , it was discovered that pyarrow has multi-threaded parquet reading of columns enabled by default (which propagates to pandas read times when using pyarrow as the engine). However, while this multi-threading works for separate list-array columns, it doesn't work for loading separate columns stored within a single list struct. This is not something that we can change on our end, but this issue will log some reproducer code:

File Generation:

# Code block to generate needed parquet files
from nested_pandas.datasets import generate_data

# Generate a parquet dataset with struct-list format
nf = generate_data(100,2000, seed=1)[["nested"]]
nf.to_parquet("nested_parquet.parquet")

# Generate a parquet dataset with list-array format
nf["nested"].to_lists().to_parquet("list_parquet.parquet")

Versioning & Storage Context

import pyarrow as pa
pa.__version__
> '22.0.0'

# struct of lists storage as read by pyarrow
pa.parquet.read_table("nested_parquet.parquet").field("nested")
> pyarrow.Field<nested: struct<t: list<element: double>, flux: list<element: double>, band: list<element: string>>>

# list storage as read by pyarrow
pa.parquet.read_table("list_parquet.parquet").field("t")
> pyarrow.Field<t: list<element: double>>

Single-Thread Timings:

Image

Multi-Thread Timings:

Image

We see that multi-threading improves the read speed for list-arrays, but not for struct-list formatted data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    LSDBHighlight ticket for LSDB developmentperformanceAddresses computational performance

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions