[#11039] fix(iceberg): Fast fail missing OSS table metadata by atovk · Pull Request #11042 · apache/gravitino

atovk · 2026-05-11T13:08:36Z

What changes were proposed in this pull request?

This PR adds an OSS metadata fast-fail path in IcebergCatalogWrapper:

Before loadTable enters Iceberg's full catalog load path, use SupportsMetadataLocation to get the table metadata file location.
For oss:// metadata locations with a configured Iceberg FileIO, check newInputFile(location).exists() first.
If the OSS metadata file is missing, loadTable throws NoSuchTableException and tableExists returns false without calling catalog.loadTable.
Add regression tests that verify missing OSS metadata does not enter the catalog load path.

Why are the changes needed?

A stale Iceberg catalog entry can point to an OSS metadata file that has already been removed. In that case, Iceberg's Aliyun OSS metadata read path may spend a long time retrying NoSuchKey before the failure surfaces. Gravitino table operations hold tree locks while dispatching those catalog calls, so the retry loop can make UI/API table operations appear stuck and can block related operations behind the same lock path.

Using the OSS exists() path lets Gravitino detect the missing metadata file before entering Iceberg's metadata read retry loop and return the normal no-such-table result quickly.

Fix: #11039

Does this PR introduce any user-facing change?

No API or configuration changes.

For stale OSS-backed Iceberg table entries whose metadata file is already missing, table load/existence checks now fail fast with the existing no-such-table behavior instead of waiting for the full Iceberg metadata read retry loop.

How was this patch tested?

JDK 17: ./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-common:test -PskipITs
JDK 17: ./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-rest-server:test -PskipITs

A stale Iceberg catalog entry can point at an OSS metadata file that no longer exists. Loading that table through the Aliyun OSS path may spend a long time in Iceberg's metadata read retry loop while Gravitino table operations still hold tree locks. Use SupportsMetadataLocation to check OSS metadata with the configured Iceberg FileIO before loadTable and tableExists enter the full catalog load path. Missing OSS metadata now returns the normal no-such-table result without calling catalog.loadTable. Tests: - JAVA_HOME=/Users/nullwo/.gradle/jdks/amazon_com_inc_-17-aarch64-os_x/amazon-corretto-17.jdk/Contents/Home ./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-common:test -PskipITs - JAVA_HOME=/Users/nullwo/.gradle/jdks/amazon_com_inc_-17-aarch64-os_x/amazon-corretto-17.jdk/Contents/Home ./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-rest-server:test -PskipITs

roryqi · 2026-05-12T04:10:34Z

Should we fix this issue in the Iceberg side instead of Gravitino side?

atovk · 2026-05-12T12:25:36Z

Should we fix this issue in the Iceberg side instead of Gravitino side?

I think we should keep the fix on the Gravitino side for this PR, and I will also follow up in Iceberg via apache/iceberg#16299 (apache/iceberg#16299).

Gravitino already knows the table metadata location through SupportsMetadataLocation, so it can short-circuit before entering Iceberg’s metadata read path. That gives us a fast and consistent failure for loadTable and tableExists, instead of waiting for Iceberg’s metadata read retry loop
when the metadata file is missing.

The Iceberg issue is still valid and complementary: some FileIO read paths, such as Aliyun OSS and Dell ECS, should translate missing-object read failures into Iceberg NotFoundException consistently. I’ll address that upstream in Iceberg, but it depends on an Iceberg change and a later
dependency upgrade in Gravitino.

So for this PR, I would keep the storage-neutral Gravitino fix, and use apache/iceberg#16299 as the upstream follow-up for the underlying FileIO/read-path behavior.

Metadata fast-fail was scoped to OSS, but missing metadata is storage-agnostic once a catalog exposes the latest metadata location. The wrapper now uses the configured FileIO.exists path for any metadata location and falls back only when the FileIO rejects that location, preserving REST cases where S3FileIO is configured while the memory catalog stores local metadata paths. Constraint: Gravitino needs a direct fast-fail before an upstream Iceberg fix is released and adopted Rejected: Keep the oss:// prefix check | misses S3, GCS, ADLS, and local metadata with the same stale-pointer failure mode Rejected: Swallow FileIO initialization failures | hides invalid io-impl configuration Confidence: high Scope-risk: narrow Directive: Keep fallback limited to location incompatibility; do not catch FileIO initialization or configuration failures Tested: ./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-common:test --tests org.apache.gravitino.iceberg.common.ops.TestIcebergCatalogWrapper -PskipITs Tested: ./gradlew --no-daemon --max-workers=1 :iceberg:iceberg-common:test :iceberg:iceberg-rest-server:test -PskipITs Not-tested: Full ./gradlew test Related: apache/iceberg#16299

atovk and others added 2 commits May 12, 2026 20:27

Merge branch 'main' into codex/fast-fail-aliyun-oss-nosuchkey

038c4e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#11039] fix(iceberg): Fast fail missing OSS table metadata#11042

[#11039] fix(iceberg): Fast fail missing OSS table metadata#11042
atovk wants to merge 3 commits into
apache:mainfrom
atovk:codex/fast-fail-aliyun-oss-nosuchkey

atovk commented May 11, 2026

Uh oh!

roryqi commented May 12, 2026

Uh oh!

atovk commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

atovk commented May 11, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

roryqi commented May 12, 2026

Uh oh!

atovk commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants