Commit 47ae4ac
authored
Python: Compute parquet stats (#7831)
* Add function to compute parquet file metadata
* Addition of docstring and extra parameter to avoid reading the file
unnecessarily
* Refactor the statistics computation entirely to use pyarrow metadata
This commit makes sure to test the metadata computation both using
`pyarrow.parquet.ParqueWriter` and `pyarrow.parquet.write_to_dataset`.
* Appease pre-commit hooks
* Fix temporary path
* Make the metrics mode configurable as documented here: https://iceberg.apache.org/docs/latest/configuration/
* Initialize binary serializers only once
* Log arrow not implemented exception
* Fix None comparison expression
* Add map column to test data
* Moving pyarrow specific code to io.pyarrow
* type annotation
* Refactor the stats collection using the pyarrow visitor
* Clean redundant code and add warning message to the log
* Address some of the review comments
* Add tests to check of the number of columns found by the statistics
collector is correct
* We don't want to truncate numeric data types
* Verify match of Iceberg types with Parquet physical types
* Fix truncation of upper bounds
* Transform asserts to ValueErrors
* Add review suggestions
* Address simple code style review comments
* Fix potential null write
* Apply function name refactoring
* Move pyarrow statistics tests to a new file
* Disable stats computation for nested types
* Modularize the fill_parquet_file_metadata function
* Allow metrics modes to have extra whitespace but not other trailing
characters
* Move upper bound truncation logic to another file
* Be defensive with regards to missing row group statistics
* Add tests for structs
* Remove special treatment of UUIDType
* Rely on parquet column path rather than column order
This commit adds a visitor to compute a mapping from
parquet column path to iceberg field ID.
* Change mood to imperative to appease linter
* Factor out the logic to obtain the current table schema1 parent c1e877a commit 47ae4ac
5 files changed
Lines changed: 1318 additions & 0 deletions
File tree
- python
- pyiceberg
- avro
- io
- utils
- tests
- io
- utils
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
19 | 20 | | |
20 | 21 | | |
| 22 | + | |
| 23 | + | |
0 commit comments