Skip to content

Commit 47ae4ac

Browse files
authored
Python: Compute parquet stats (#7831)
* Add function to compute parquet file metadata * Addition of docstring and extra parameter to avoid reading the file unnecessarily * Refactor the statistics computation entirely to use pyarrow metadata This commit makes sure to test the metadata computation both using `pyarrow.parquet.ParqueWriter` and `pyarrow.parquet.write_to_dataset`. * Appease pre-commit hooks * Fix temporary path * Make the metrics mode configurable as documented here: https://iceberg.apache.org/docs/latest/configuration/ * Initialize binary serializers only once * Log arrow not implemented exception * Fix None comparison expression * Add map column to test data * Moving pyarrow specific code to io.pyarrow * type annotation * Refactor the stats collection using the pyarrow visitor * Clean redundant code and add warning message to the log * Address some of the review comments * Add tests to check of the number of columns found by the statistics collector is correct * We don't want to truncate numeric data types * Verify match of Iceberg types with Parquet physical types * Fix truncation of upper bounds * Transform asserts to ValueErrors * Add review suggestions * Address simple code style review comments * Fix potential null write * Apply function name refactoring * Move pyarrow statistics tests to a new file * Disable stats computation for nested types * Modularize the fill_parquet_file_metadata function * Allow metrics modes to have extra whitespace but not other trailing characters * Move upper bound truncation logic to another file * Be defensive with regards to missing row group statistics * Add tests for structs * Remove special treatment of UUIDType * Rely on parquet column path rather than column order This commit adds a visitor to compute a mapping from parquet column path to iceberg field ID. * Change mood to imperative to appease linter * Factor out the logic to obtain the current table schema
1 parent c1e877a commit 47ae4ac

5 files changed

Lines changed: 1318 additions & 0 deletions

File tree

python/pyiceberg/avro/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,5 +16,8 @@
1616
# under the License.
1717
import struct
1818

19+
STRUCT_BOOL = struct.Struct("?")
1920
STRUCT_FLOAT = struct.Struct("<f") # little-endian float
2021
STRUCT_DOUBLE = struct.Struct("<d") # little-endian double
22+
STRUCT_INT32 = struct.Struct("<i")
23+
STRUCT_INT64 = struct.Struct("<q")

0 commit comments

Comments
 (0)