Skip to content

[#11891] fix(lance): retry repair-on-load metadata update on optimistic-lock conflict#11892

Open
yuqi1129 wants to merge 3 commits into
apache:mainfrom
yuqi1129:fix-lance-repair-cas-11891
Open

[#11891] fix(lance): retry repair-on-load metadata update on optimistic-lock conflict#11892
yuqi1129 wants to merge 3 commits into
apache:mainfrom
yuqi1129:fix-lance-repair-cas-11891

Conversation

@yuqi1129

@yuqi1129 yuqi1129 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

LanceTableOperations.repairTableMetadata and recordCheckedEmptyVersion run an optimistic-locked EntityStore.update on every loadTable. When two loads repair the same table concurrently, the slower CAS matches zero rows and surfaces as IOException("Failed to update the entity"), which was rethrown as a fatal RuntimeException("Failed to repair table") (HTTP 500).

This wraps that update in a bounded CAS retry (updateTableWithCasRetry, 5 attempts): on conflict it re-reads the latest (already-repaired) entity and re-applies the idempotent updater, returning a usable table instead of failing.

Why are the changes needed?

Concurrent repair-on-load races (e.g. Spark parallel LOAD during planning + execution) intermittently fail table loads with HTTP 500. Seen as flaky LanceSparkRESTServiceIT.testSelectFromEmptyTableViaSpark.

Fix: #11891

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added TestLanceTableOperations.testLoadTableSurvivesConcurrentRepairVersionRace: it models a lost CAS (first store.update throws the conflict IOException, the retry re-reads the winner's already-repaired entity) and asserts loadTable returns the repaired table. It fails before the fix and passes after. The full TestLanceTableOperations suite (23 tests) passes locally.

…timistic-lock conflict

repairTableMetadata and recordCheckedEmptyVersion run an optimistic-locked
EntityStore.update on every loadTable. Concurrent loads repairing the same table
race on the version CAS; the loser gets IOException("Failed to update the
entity"), which was rethrown as a fatal RuntimeException (HTTP 500).

Wrap the update in a bounded CAS retry (updateTableWithCasRetry): on conflict,
re-read the latest already-repaired entity and re-apply the idempotent updater,
returning a usable table instead of failing the load.

Fix: apache#11891
Signed-off-by: yuqi <yuqi@datastrato.com>
Copilot AI review requested due to automatic review settings July 3, 2026 08:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Lance table loadTable robustness under concurrent “repair-on-load” by adding a bounded retry loop around optimistic-lock (EntityStore.update) failures, preventing intermittent HTTP 500s when concurrent loads race to repair the same table metadata.

Changes:

  • Add updateTableWithCasRetry (max 5 attempts) and route repair-on-load updates through it to tolerate lost optimistic-lock CAS races.
  • Apply the retry wrapper to both repairTableMetadata and recordCheckedEmptyVersion.
  • Add a unit test reproducing the CAS-loss scenario and asserting loadTable returns a usable, repaired table instead of failing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
catalogs/catalog-lakehouse-generic/src/main/java/org/apache/gravitino/catalog/lakehouse/lance/LanceTableOperations.java Introduces a bounded retry for optimistic-lock update conflicts during repair-on-load paths.
catalogs/catalog-lakehouse-generic/src/test/java/org/apache/gravitino/catalog/lakehouse/lance/TestLanceTableOperations.java Adds a regression test that models a lost CAS on the first update and validates recovery via retry.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

Code Coverage Report

Overall Project 67.45% -0.03% 🟢
Files changed 60.36% 🟢

Module Coverage
aliyun 1.72% 🔴
api 46.52% 🟢
authorization-common 85.96% 🟢
aws 42.04% 🟢
azure 2.47% 🔴
catalog-common 9.92% 🔴
catalog-fileset 80.23% 🟢
catalog-glue 66.91% 🟢
catalog-hive 79.42% 🟢
catalog-jdbc-clickhouse 80.55% 🟢
catalog-jdbc-common 44.22% 🟢
catalog-jdbc-doris 80.28% 🟢
catalog-jdbc-hologres 54.03% 🟢
catalog-jdbc-mysql 79.23% 🟢
catalog-jdbc-oceanbase 80.91% 🟢
catalog-jdbc-postgresql 82.29% 🟢
catalog-jdbc-starrocks 78.51% 🟢
catalog-kafka 77.01% 🟢
catalog-lakehouse-generic 59.12% +1.8% 🟢
catalog-lakehouse-hudi 79.1% 🟢
catalog-lakehouse-iceberg 85.86% 🟢
catalog-lakehouse-paimon 84.25% 🟢
catalog-model 77.72% 🟢
cli 44.51% 🟢
client-java 78.01% 🟢
common 50.17% 🟢
core 82.59% 🟢
filesystem-hadoop3 77.3% 🟢
flink 0.0% 🔴
flink-common 47.09% 🟢
flink-runtime 0.0% 🔴
gcp 14.12% 🔴
hadoop-auth 66.67% 🟢
hadoop-common 12.7% 🔴
hive-metastore-common 53.29% 🟢
iceberg-common 58.3% 🟢
iceberg-rest-server 73.94% 🟢
idp-basic 85.71% 🟢
integration-test-common 0.0% 🔴
jobs 66.17% 🟢
lance-common 20.81% 🔴
lance-rest-server 64.84% 🟢
lineage 53.02% 🟢
optimizer 83.24% 🟢
optimizer-api 21.95% 🔴
server 85.94% 🟢
server-common 74.62% 🟢
spark 28.57% 🔴
spark-common 46.01% 🟢
tencent 69.84% 🟢
trino-connector 40.29% 🟢
Files
Module File Coverage
catalog-lakehouse-generic LanceTableOperations.java 60.36% 🟢

@jerryshao

Copy link
Copy Markdown
Contributor

Why has this been happening so frequently recently?

@yuqi1129

yuqi1129 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Why has this been happening so frequently recently?

I haven't seen this problem until today, and will follow it up for the next few weeks. It should not happen so frequently, theoretically.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@yuqi1129 yuqi1129 closed this Jul 4, 2026
@yuqi1129 yuqi1129 reopened this Jul 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug report] Lance loadTable throws "Failed to repair table" when concurrent repair-on-load loses the optimistic-lock CAS

3 participants