[#11039] fix(iceberg): Fast fail missing OSS table metadata#11042
[#11039] fix(iceberg): Fast fail missing OSS table metadata#11042atovk wants to merge 3 commits into
Conversation
A stale Iceberg catalog entry can point at an OSS metadata file that no longer exists. Loading that table through the Aliyun OSS path may spend a long time in Iceberg's metadata read retry loop while Gravitino table operations still hold tree locks. Use SupportsMetadataLocation to check OSS metadata with the configured Iceberg FileIO before loadTable and tableExists enter the full catalog load path. Missing OSS metadata now returns the normal no-such-table result without calling catalog.loadTable. Tests: - JAVA_HOME=/Users/nullwo/.gradle/jdks/amazon_com_inc_-17-aarch64-os_x/amazon-corretto-17.jdk/Contents/Home ./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-common:test -PskipITs - JAVA_HOME=/Users/nullwo/.gradle/jdks/amazon_com_inc_-17-aarch64-os_x/amazon-corretto-17.jdk/Contents/Home ./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-rest-server:test -PskipITs
|
Should we fix this issue in the Iceberg side instead of Gravitino side? |
I think we should keep the fix on the Gravitino side for this PR, and I will also follow up in Iceberg via apache/iceberg#16299 (apache/iceberg#16299). Gravitino already knows the table metadata location through SupportsMetadataLocation, so it can short-circuit before entering Iceberg’s metadata read path. That gives us a fast and consistent failure for loadTable and tableExists, instead of waiting for Iceberg’s metadata read retry loop The Iceberg issue is still valid and complementary: some FileIO read paths, such as Aliyun OSS and Dell ECS, should translate missing-object read failures into Iceberg NotFoundException consistently. I’ll address that upstream in Iceberg, but it depends on an Iceberg change and a later So for this PR, I would keep the storage-neutral Gravitino fix, and use apache/iceberg#16299 as the upstream follow-up for the underlying FileIO/read-path behavior. |
Metadata fast-fail was scoped to OSS, but missing metadata is storage-agnostic once a catalog exposes the latest metadata location. The wrapper now uses the configured FileIO.exists path for any metadata location and falls back only when the FileIO rejects that location, preserving REST cases where S3FileIO is configured while the memory catalog stores local metadata paths. Constraint: Gravitino needs a direct fast-fail before an upstream Iceberg fix is released and adopted Rejected: Keep the oss:// prefix check | misses S3, GCS, ADLS, and local metadata with the same stale-pointer failure mode Rejected: Swallow FileIO initialization failures | hides invalid io-impl configuration Confidence: high Scope-risk: narrow Directive: Keep fallback limited to location incompatibility; do not catch FileIO initialization or configuration failures Tested: ./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-common:test --tests org.apache.gravitino.iceberg.common.ops.TestIcebergCatalogWrapper -PskipITs Tested: ./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-common:test :iceberg:iceberg-rest-server:test -PskipITs Not-tested: Full ./gradlew test Related: apache/iceberg#16299
What changes were proposed in this pull request?
This PR adds an OSS metadata fast-fail path in
IcebergCatalogWrapper:loadTableenters Iceberg's full catalog load path, useSupportsMetadataLocationto get the table metadata file location.oss://metadata locations with a configured IcebergFileIO, checknewInputFile(location).exists()first.loadTablethrowsNoSuchTableExceptionandtableExistsreturnsfalsewithout callingcatalog.loadTable.Why are the changes needed?
A stale Iceberg catalog entry can point to an OSS metadata file that has already been removed. In that case, Iceberg's Aliyun OSS metadata read path may spend a long time retrying
NoSuchKeybefore the failure surfaces. Gravitino table operations hold tree locks while dispatching those catalog calls, so the retry loop can make UI/API table operations appear stuck and can block related operations behind the same lock path.Using the OSS
exists()path lets Gravitino detect the missing metadata file before entering Iceberg's metadata read retry loop and return the normal no-such-table result quickly.Fix: #11039
Does this PR introduce any user-facing change?
No API or configuration changes.
For stale OSS-backed Iceberg table entries whose metadata file is already missing, table load/existence checks now fail fast with the existing no-such-table behavior instead of waiting for the full Iceberg metadata read retry loop.
How was this patch tested?
./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-common:test -PskipITs./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-rest-server:test -PskipITs