Skip to content

Add support for Arrow LargeListArray in Parquet data projection #502

@callmepandey

Description

@callmepandey

Summary

The ProjectRecordBatch function in parquet_data_util.cc only supports ::arrow::ListArray (32-bit offsets) but not ::arrow::LargeListArray (64-bit offsets). This limitation is marked with a FIXME comment at line 151.

Problem

Arrow's LargeListArray uses 64-bit offsets instead of 32-bit, allowing it to handle lists with more than 2^31-1 total child elements. Currently, attempting to project a LargeListArray would fail with an error like:

Expected list type, got: large_list<...>

Proposed Solution

  1. Add templated ProjectListArrayImpl<> function - Generic implementation that works with both ListArray and LargeListArray

  2. Add ProjectLargeListArray wrapper - Calls the template with LargeListArray and LargeListType

  3. Update ProjectNestedArray - Handle both ::arrow::Type::LIST and ::arrow::Type::LARGE_LIST in the TypeId::kList case

  4. Add test case - Verify LargeListArray projection works correctly

Files to Change

  • src/iceberg/parquet/parquet_data_util.cc
  • src/iceberg/test/parquet_data_test.cc

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions