Fixes #22644: Add Iceberg table support for GCS and S3 Datalake connector#27735
Fixes #22644: Add Iceberg table support for GCS and S3 Datalake connector#27735mohitjeswani01 wants to merge 13 commits intoopen-metadata:mainfrom
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi @harshach sir could you please add a |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
|
thanks @harshach sir i will monitor the checks and work accordingly!🙏 |
🔴 Playwright Results — 5 failure(s), 12 flaky✅ 3980 passed · ❌ 5 failed · 🟡 12 flaky · ⏭️ 86 skipped
Genuine Failures (failed on all attempts)❌
|
oops the sonarqube failed i will check it shortly ! |
…tadata#22644) - Integer version comparison (v10 > v9) via regex group capture - Single-pass listing: eliminates double bucket scan for non-Iceberg buckets - Mixed buckets: regular files outside Iceberg dirs are now yielded - Removes extra head_object/get_blob API calls (use listing size directly) - Fix get_tables_name_and_type return type annotation to 5-tuple - Update tests: remove _get_iceberg_tables direct calls, add v10 regression
59033b6 to
b7b8904
Compare
… tests per review
Code Review 👍 Approved with suggestions 7 resolved / 8 findingsAdds Iceberg table support for GCS and S3 Datalake connectors, resolving memory bloat, redundant API calls, and versioning bugs. Refactor test classes to move away from unittest.TestCase to align with project guidelines. 💡 Quality: Test classes inherit unittest.TestCase contrary to guidelines📄 ingestion/tests/unit/readers/test_json_reader.py:264-278 The new tests in The new tests in ✅ 7 resolved✅ Bug: Lexicographic version comparison fails for v10+ metadata
✅ Bug: Mixed Iceberg + regular files: regular tables silently dropped
✅ Performance: Double bucket listing for non-Iceberg S3/GCS buckets
✅ Quality: Return type annotation still says 4-tuple, yields 5-tuple
✅ Performance: Entire bucket listing materialized in memory before yielding
...and 2 more resolved from earlier reviews 🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
@harshach sir i solved all the problems which appeared in sonarcube but i am unable to notice the exact reason why it is continuously failing could you check that once ? thanks 🙏 |
| def get_tables_name_and_type( # pylint: disable=too-many-branches | ||
| self, | ||
| ) -> Iterable[Tuple[str, TableType, SupportedTypes, Optional[int]]]: # noqa: UP006, UP045 | ||
| ) -> Iterable[Tuple[str, TableType, SupportedTypes, Optional[int], str]]: # noqa: UP006, UP045 | ||
| """ |
There was a problem hiding this comment.
DatalakeSource overrides DatabaseServiceSource.get_tables_name_and_type() / yield_table() with incompatible tuple shapes (5-tuple instead of the base 2-tuple). Ingestion runs basedpyright with reportIncompatibleMethodOverride = "error", so this change is very likely to fail type-checking and break the expected contract for the database topology stages.
Recommendation: keep the override signatures compatible with DatabaseServiceSource (return/accept Tuple[str, TableType]) and carry fetch_key via an internal mapping/context (e.g., store table_name -> key_name during discovery) or by using displayName for the human-readable name while keeping name/fetch key stable.
| def yield_table( | ||
| self, | ||
| table_name_and_type: Tuple[str, TableType, SupportedTypes, Optional[int]], # noqa: UP006, UP045 | ||
| table_name_and_type: Tuple[str, TableType, SupportedTypes, Optional[int], str], # noqa: UP006, UP045 | ||
| ) -> Iterable[Either[CreateTableRequest]]: |
There was a problem hiding this comment.
yield_table() now expects a 5-tuple, but DatabaseServiceSource.yield_table() is defined to accept Tuple[str, TableType]. With basedpyright configured to error on incompatible overrides, this signature change is likely to fail type-checking.
Recommendation: keep the override signature compatible and source fetch_key from an internal mapping/context built during get_tables_name_and_type() (or refactor the topology contract to carry a dedicated table-info object).
|


🧊 Iceberg Table Support for GCS and S3 Datalake Connector
Fixes #22644
The Problem
Companies migrating from BigQuery to Iceberg on GCS could not
ingest Iceberg table metadata into OpenMetadata. The Datalake
connector already supported GCS and already had Iceberg JSON
parsing logic — but two precise bugs prevented them from
working together.
Bug 1 — Wrong Table Discovery
DatalakeGcsClient.get_table_names()listed every blobindividually. An Iceberg table
orders/produced 5+ entries:orders/metadata/v1.metadata.json → treated as separate table
orders/metadata/v2.metadata.json → treated as separate table
orders/data/00000-0-abc.parquet → treated as separate table
Result: 5 bogus tables instead of 1 correct
orderstable.Bug 2 — Iceberg Column Parser Never Called
In
readers/dataframe/json.py, theraw_datagate onlyopened for JSON files containing a
"$schema"key.Iceberg metadata JSON uses
"format-version"— soraw_datawas alwaysNone, meaning_is_iceberg_delta_metadata()and_parse_iceberg_delta_schema()were never reached despitebeing already correctly implemented.
Result: Iceberg tables showed garbage columns
(
format-version,table-uuid,location)instead of the actual data schema.
The Fix — 5 Production Files, Zero New Abstractions
All fixes follow existing patterns exactly and reuse
existing parsing logic unchanged.
Fix 1 —
raw_dataGate (readers/dataframe/json.py)This single change unlocks the entire Iceberg parsing pipeline
that was already correctly implemented in
datalake_utils.py.Fix 2 — Iceberg Table Directory Detection (
gcs.py+s3.py)Added
_get_iceberg_tables()to bothDatalakeGcsClientand
DatalakeS3Client. Detects Iceberg table directoriesby scanning for
*/metadata/v*.metadata.jsonpattern,keeps only the latest version per table directory, and
yields one entry per Iceberg table instead of individual blobs.
Non-Iceberg buckets fall through to the original listing
behavior — zero breaking changes.
Fix 3 — Table Name + Type (
datalake_utils.py+metadata.py)Added
get_iceberg_table_name_from_metadata_path()whichextracts the directory name from the metadata path:
"warehouse/orders/metadata/v2.metadata.json"→"orders"standardize_table_name()now uses this for Iceberg paths.get_tables_name_and_type()now yieldsTableType.Icebergfor Iceberg metadata files.
Fix 4 — Fetch Path Preservation (
metadata.py)Ensured
fetch_dataframe_first_chunk()receives the originalmetadata blob path for fetching while the Table entity displays
the clean directory name. Both values flow through the pipeline
correctly.
Complete E2E Flow After Fix
For an Iceberg table
warehouse/orders/on GCS:get_table_names()standardize_table_name()warehouse/orders/metadata/v2.metadata.jsonordersTableTypeRegularIcebergraw_datagateNone→ parser never called_is_iceberg_delta_metadata()calledget_columns()schema.fieldsWhat Was NOT Changed (Reused 100%)
Following @harshach's direction to reuse existing logic:
_is_iceberg_delta_metadata()— zero changes_parse_iceberg_delta_schema()— zero changes_parse_struct_fields()— zero changesset_google_credentials()— zero changesDatalakeGcsClientcredential initialization — zero changesTests — 18 Tests, Zero Infrastructure Required
tests/unit/readers/test_json_reader.py:

tests/unit/source/database/test_iceberg_discovery.py:

New tests added: 18 across 2 files
tests/unit/readers/test_json_reader.py— 4 new tests:test_raw_data_set_for_iceberg_metadata— gate opens for Iceberg JSONtest_iceberg_columns_parsed_correctly— correct columns extractedtest_raw_data_none_for_regular_json— backward compatibilitytest_raw_data_set_for_json_schema— existing behavior preservedtests/unit/source/database/test_iceberg_discovery.py— 14 new tests:fallback for non-Iceberg, mixed bucket handling
Files Changed
readers/dataframe/json.pydatalake/clients/gcs.pydatalake/clients/s3.pyutils/datalake/datalake_utils.pydatalake/metadata.pytests/unit/readers/test_json_reader.pytests/unit/source/database/test_iceberg_discovery.pyType of Change
Checklist
Fixes #22644: Add Iceberg table support for GCS and S3 Datalake connectorhard-to-understand areas
All fixes are within the Python ingestion layer only.
Existing GCS and S3 credential schemas are reused unchanged.
implements it fully following maintainer guidance
Summary by Gitar
_parse_columnto handle type inference for complexARRAYandJSONtypes._process_unique_json_keyto streamline the logic for merging nested structures and_ArrayOfStructtypes during metadata discovery.This will update automatically on new commits.