-
Notifications
You must be signed in to change notification settings - Fork 476
Add support for Bodo DataFrame #2167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
9befbaa
b7b5ba1
f36265b
7c0fbc2
947f503
8f5ba9d
71287d6
cae2425
ccfdd81
74d1393
f99171c
447487b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -451,6 +451,11 @@ def test_dynamic_partition_overwrite_unpartitioned_evolve_to_identity_transform( | |||||
|
|
||||||
| @pytest.mark.integration | ||||||
| def test_summaries_with_null(spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None: | ||||||
| import pyarrow | ||||||
| from packaging import version | ||||||
|
|
||||||
| under_20_arrow = version.parse(pyarrow.__version__) < version.parse("20.0.0") | ||||||
|
|
||||||
| identifier = "default.arrow_table_summaries" | ||||||
|
|
||||||
| try: | ||||||
|
|
@@ -547,14 +552,14 @@ def test_summaries_with_null(spark: SparkSession, session_catalog: Catalog, arro | |||||
| "total-records": "6", | ||||||
| } | ||||||
| assert summaries[5] == { | ||||||
| "removed-files-size": "16174", | ||||||
| "removed-files-size": "15774" if under_20_arrow else "16174", | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lets just do this instead since we're not really testing for the file size
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good. |
||||||
| "changed-partition-count": "2", | ||||||
| "total-equality-deletes": "0", | ||||||
| "deleted-data-files": "4", | ||||||
| "total-position-deletes": "0", | ||||||
| "total-delete-files": "0", | ||||||
| "deleted-records": "4", | ||||||
| "total-files-size": "8884", | ||||||
| "total-files-size": "8684" if under_20_arrow else "8884", | ||||||
| "total-data-files": "2", | ||||||
| "total-records": "2", | ||||||
| } | ||||||
|
|
@@ -564,9 +569,9 @@ def test_summaries_with_null(spark: SparkSession, session_catalog: Catalog, arro | |||||
| "total-equality-deletes": "0", | ||||||
| "added-records": "2", | ||||||
| "total-position-deletes": "0", | ||||||
| "added-files-size": "8087", | ||||||
| "added-files-size": "7887" if under_20_arrow else "8087", | ||||||
| "total-delete-files": "0", | ||||||
| "total-files-size": "16971", | ||||||
| "total-files-size": "16571" if under_20_arrow else "16971", | ||||||
| "total-data-files": "4", | ||||||
| "total-records": "4", | ||||||
| } | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should find another way to make these tests pass instead of branching on pyarrow version
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any ideas? Maybe use a range of "safe" values instead of a single file size value? I'd be happy to open another PR if there is more work for this.
Bodo is currently pinned to Arrow 19 since the current release version of PyIceberg supports up to Arrow 19. Bodo uses Arrow C++, which currently requires pinning to a single version for pip wheels to work (conda-forge builds against 4 latest Arrow versions in this case but pip doesn't support this yet). It'd be great if PyIceberg wouldn't set an upper version for Arrow if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just parameterize the file size. We're not really testing anything related to the size of the file.
yea agreed. lets see if we can remove the upper bound