[#11891] fix(lance): retry repair-on-load metadata update on optimistic-lock conflict by yuqi1129 · Pull Request #11892 · apache/gravitino

yuqi1129 · 2026-07-03T08:50:18Z

What changes were proposed in this pull request?

LanceTableOperations.repairTableMetadata and recordCheckedEmptyVersion run an optimistic-locked EntityStore.update on every loadTable. When two loads repair the same table concurrently, the slower CAS matches zero rows and surfaces as IOException("Failed to update the entity"), which was rethrown as a fatal RuntimeException("Failed to repair table") (HTTP 500).

This wraps that update in a bounded CAS retry (updateTableWithCasRetry, 5 attempts): on conflict it re-reads the latest (already-repaired) entity and re-applies the idempotent updater, returning a usable table instead of failing.

Why are the changes needed?

Concurrent repair-on-load races (e.g. Spark parallel LOAD during planning + execution) intermittently fail table loads with HTTP 500. Seen as flaky LanceSparkRESTServiceIT.testSelectFromEmptyTableViaSpark.

Fix: #11891

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added TestLanceTableOperations.testLoadTableSurvivesConcurrentRepairVersionRace: it models a lost CAS (first store.update throws the conflict IOException, the retry re-reads the winner's already-repaired entity) and asserts loadTable returns the repaired table. It fails before the fix and passes after. The full TestLanceTableOperations suite (23 tests) passes locally.

…timistic-lock conflict repairTableMetadata and recordCheckedEmptyVersion run an optimistic-locked EntityStore.update on every loadTable. Concurrent loads repairing the same table race on the version CAS; the loser gets IOException("Failed to update the entity"), which was rethrown as a fatal RuntimeException (HTTP 500). Wrap the update in a bounded CAS retry (updateTableWithCasRetry): on conflict, re-read the latest already-repaired entity and re-apply the idempotent updater, returning a usable table instead of failing the load. Fix: apache#11891 Signed-off-by: yuqi <yuqi@datastrato.com>

Copilot

Pull request overview

This PR improves Lance table loadTable robustness under concurrent “repair-on-load” by adding a bounded retry loop around optimistic-lock (EntityStore.update) failures, preventing intermittent HTTP 500s when concurrent loads race to repair the same table metadata.

Changes:

Add updateTableWithCasRetry (max 5 attempts) and route repair-on-load updates through it to tolerate lost optimistic-lock CAS races.
Apply the retry wrapper to both repairTableMetadata and recordCheckedEmptyVersion.
Add a unit test reproducing the CAS-loss scenario and asserting loadTable returns a usable, repaired table instead of failing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`catalogs/catalog-lakehouse-generic/src/main/java/org/apache/gravitino/catalog/lakehouse/lance/LanceTableOperations.java`	Introduces a bounded retry for optimistic-lock update conflicts during repair-on-load paths.
`catalogs/catalog-lakehouse-generic/src/test/java/org/apache/gravitino/catalog/lakehouse/lance/TestLanceTableOperations.java`	Adds a regression test that models a lost CAS on the first update and validates recovery via retry.

github-actions · 2026-07-03T10:08:02Z

Code Coverage Report

Overall Project	67.45% `-0.03%`	🟢
Files changed	60.36%	🟢

Module	Coverage
aliyun	1.72%	🔴
api	46.52%	🟢
authorization-common	85.96%	🟢
aws	42.04%	🟢
azure	2.47%	🔴
catalog-common	9.92%	🔴
catalog-fileset	80.23%	🟢
catalog-glue	66.91%	🟢
catalog-hive	79.42%	🟢
catalog-jdbc-clickhouse	80.55%	🟢
catalog-jdbc-common	44.22%	🟢
catalog-jdbc-doris	80.28%	🟢
catalog-jdbc-hologres	54.03%	🟢
catalog-jdbc-mysql	79.23%	🟢
catalog-jdbc-oceanbase	80.91%	🟢
catalog-jdbc-postgresql	82.29%	🟢
catalog-jdbc-starrocks	78.51%	🟢
catalog-kafka	77.01%	🟢
catalog-lakehouse-generic	59.12% `+1.8%`	🟢
catalog-lakehouse-hudi	79.1%	🟢
catalog-lakehouse-iceberg	85.86%	🟢
catalog-lakehouse-paimon	84.25%	🟢
catalog-model	77.72%	🟢
cli	44.51%	🟢
client-java	78.01%	🟢
common	50.17%	🟢
core	82.59%	🟢
filesystem-hadoop3	77.3%	🟢
flink	0.0%	🔴
flink-common	47.09%	🟢
flink-runtime	0.0%	🔴
gcp	14.12%	🔴
hadoop-auth	66.67%	🟢
hadoop-common	12.7%	🔴
hive-metastore-common	53.29%	🟢
iceberg-common	58.3%	🟢
iceberg-rest-server	73.94%	🟢
idp-basic	85.71%	🟢
integration-test-common	0.0%	🔴
jobs	66.17%	🟢
lance-common	20.81%	🔴
lance-rest-server	64.84%	🟢
lineage	53.02%	🟢
optimizer	83.24%	🟢
optimizer-api	21.95%	🔴
server	85.94%	🟢
server-common	74.62%	🟢
spark	28.57%	🔴
spark-common	46.01%	🟢
tencent	69.84%	🟢
trino-connector	40.29%	🟢

Files

Module	File	Coverage
catalog-lakehouse-generic	LanceTableOperations.java	60.36%	🟢

jerryshao · 2026-07-03T10:31:59Z

Why has this been happening so frequently recently?

yuqi1129 · 2026-07-03T11:16:32Z

Why has this been happening so frequently recently?

I haven't seen this problem until today, and will follow it up for the next few weeks. It should not happen so frequently, theoretically.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings July 3, 2026 08:50

Copilot started reviewing on behalf of yuqi1129 July 3, 2026 08:50 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

jerryshao assigned yuqi1129 Jul 3, 2026

[apache#11891] fix(lance): address repair retry review comments

d79c7b2

yuqi1129 requested a review from Copilot July 3, 2026 11:24

Copilot started reviewing on behalf of yuqi1129 July 3, 2026 11:24 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Comment thread ...generic/src/main/java/org/apache/gravitino/catalog/lakehouse/lance/LanceTableOperations.java

Potential fix for pull request finding

b41e0d0

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

yuqi1129 closed this Jul 4, 2026

yuqi1129 reopened this Jul 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[#11891] fix(lance): retry repair-on-load metadata update on optimistic-lock conflict#11892

[#11891] fix(lance): retry repair-on-load metadata update on optimistic-lock conflict#11892
yuqi1129 wants to merge 3 commits into
apache:mainfrom
yuqi1129:fix-lance-repair-cas-11891

yuqi1129 commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026 •

edited

Loading

Uh oh!

jerryshao commented Jul 3, 2026

Uh oh!

yuqi1129 commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yuqi1129 commented Jul 3, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage Report

Uh oh!

jerryshao commented Jul 3, 2026

Uh oh!

yuqi1129 commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jul 3, 2026 •

edited

Loading