Summary
The ProjectRecordBatch function in parquet_data_util.cc only supports ::arrow::ListArray (32-bit offsets) but not ::arrow::LargeListArray (64-bit offsets). This limitation is marked with a FIXME comment at line 151.
Problem
Arrow's LargeListArray uses 64-bit offsets instead of 32-bit, allowing it to handle lists with more than 2^31-1 total child elements. Currently, attempting to project a LargeListArray would fail with an error like:
Expected list type, got: large_list<...>
Proposed Solution
-
Add templated ProjectListArrayImpl<> function - Generic implementation that works with both ListArray and LargeListArray
-
Add ProjectLargeListArray wrapper - Calls the template with LargeListArray and LargeListType
-
Update ProjectNestedArray - Handle both ::arrow::Type::LIST and ::arrow::Type::LARGE_LIST in the TypeId::kList case
-
Add test case - Verify LargeListArray projection works correctly
Files to Change
src/iceberg/parquet/parquet_data_util.cc
src/iceberg/test/parquet_data_test.cc
References
Summary
The
ProjectRecordBatchfunction inparquet_data_util.cconly supports::arrow::ListArray(32-bit offsets) but not::arrow::LargeListArray(64-bit offsets). This limitation is marked with a FIXME comment at line 151.Problem
Arrow's
LargeListArrayuses 64-bit offsets instead of 32-bit, allowing it to handle lists with more than 2^31-1 total child elements. Currently, attempting to project aLargeListArraywould fail with an error like:Proposed Solution
Add templated
ProjectListArrayImpl<>function - Generic implementation that works with bothListArrayandLargeListArrayAdd
ProjectLargeListArraywrapper - Calls the template withLargeListArrayandLargeListTypeUpdate
ProjectNestedArray- Handle both::arrow::Type::LISTand::arrow::Type::LARGE_LISTin theTypeId::kListcaseAdd test case - Verify
LargeListArrayprojection works correctlyFiles to Change
src/iceberg/parquet/parquet_data_util.ccsrc/iceberg/test/parquet_data_test.ccReferences
src/iceberg/parquet/parquet_data_util.cc:151