Fixes #25666: Add retry-with-backoff to getServiceStatus for transient failure recovery#27112
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
This PR improves ingestion agent health-check resilience by adding Resilience4j retry-with-backoff to PipelineServiceClient.getServiceStatus(), so transient pipeline-service failures (e.g., during Airflow restarts) don’t immediately mark the agent as unavailable.
Changes:
- Moved retry-with-backoff logic into
PipelineServiceClient.getServiceStatus()and simplifiedgetServiceStatusBackoff()to delegate. - Updated retry to operate on
PipelineServiceClientResponse(retrying when HTTP code is non-200). - Added an Airflow REST client test ensuring a transient 500 followed by 200 recovers successfully and issues the expected number of requests.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| openmetadata-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java | Centralizes retry-with-backoff in getServiceStatus() so all callers benefit from transient failure tolerance. |
| openmetadata-service/src/test/java/org/openmetadata/service/clients/pipeline/airflow/AirflowRESTClientTest.java | Adds regression test verifying recovery from a transient health-check failure. |
...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java
Outdated
Show resolved
Hide resolved
...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java
Outdated
Show resolved
Hide resolved
...e/src/test/java/org/openmetadata/service/clients/pipeline/airflow/AirflowRESTClientTest.java
Show resolved
Hide resolved
68791c8 to
9336512
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/GetEntityVersionsTool.java
Outdated
Show resolved
Hide resolved
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java
Outdated
Show resolved
Hide resolved
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java
Outdated
Show resolved
Hide resolved
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java
Outdated
Show resolved
Hide resolved
9336512 to
f32ab33
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java
Outdated
Show resolved
Hide resolved
...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java
Outdated
Show resolved
Hide resolved
...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java
Outdated
Show resolved
Hide resolved
f32ab33 to
0d023dc
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java
Outdated
Show resolved
Hide resolved
0d023dc to
7a38087
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java
Outdated
Show resolved
Hide resolved
7a38087 to
a6412ae
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java
Outdated
Show resolved
Hide resolved
a6412ae to
2676e07
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
ecd31ba to
f33984c
Compare
...data-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java
Outdated
Show resolved
Hide resolved
… for transient failure recovery
f33984c to
b0f9406
Compare
| /** | ||
| * Check the status of pipeline service to ensure it is healthy. Retries up to {@link | ||
| * #MAX_ATTEMPTS} times with backoff so that a single transient failure (e.g. Airflow restart, | ||
| * network blip) does not permanently mark the agent as unavailable. Responses with status codes | ||
| * ≥ 500 are retried; known non-transient 4xx responses (401, 403, 404, version mismatch) are | ||
| * returned immediately. Any other unexpected HTTP status code is passed through as-is and will | ||
| * not trigger retries. |
There was a problem hiding this comment.
The Javadoc implies that auth/plugin/version mismatch cases will come back as 4xx and therefore bypass retries, but getServiceStatus() only decides retries based on response.getCode() >= 500. Implementations that translate permanent config/auth/RBAC failures into buildUnhealthyStatus(...) (code 500) will still be retried and incur backoff delays. Consider updating the Javadoc to describe the actual contract (implementations must preserve non-retryable 4xx/422 codes), and/or add a protected isRetryable(PipelineServiceClientResponse) hook so clients can explicitly mark permanent failures as non-retryable.
| var retryConfig = | ||
| RetryConfig.<PipelineServiceClientResponse>custom() | ||
| .maxAttempts(MAX_ATTEMPTS) | ||
| .waitDuration(Duration.ofMillis(getRetryBackoffMillis())) | ||
| .retryOnResult(response -> response == null || response.getCode() >= 500) | ||
| .failAfterMaxAttempts(false) | ||
| .build(); |
There was a problem hiding this comment.
The PR description says getServiceStatus() retries when the response code is not 200, but the current predicate only retries on response == null or code >= 500 (and will pass through 4xx/429/etc without retry). Please either update the PR description / method docs to match the actual behavior, or broaden the retry predicate to include the intended retryable non-200 statuses (e.g., 429/408) if that was the goal.
| private static final Integer MAX_ATTEMPTS = 3; | ||
| private static final Integer BACKOFF_TIME_SECONDS = 5; | ||
| private static final long DEFAULT_BACKOFF_MILLIS = 5_000L; | ||
| private static final String DISABLED_STATUS = "disabled"; | ||
|
|
||
| private volatile Retry serviceStatusRetry; | ||
|
|
There was a problem hiding this comment.
DEFAULT_BACKOFF_MILLIS hardcodes a 5s sleep between retries, meaning getServiceStatus() can block for ~10s on repeated retryable failures. This will directly impact REST endpoints that call this method (e.g., validation/status checks) and can also slow down unit tests that exercise unhealthy paths unless they override getRetryBackoffMillis(). Consider making the backoff configurable via PipelineServiceClientConfiguration (or similar) so production can keep 5s while tests/interactive calls can use a smaller value.
Code Review ✅ Approved 10 resolved / 10 findingsAdds retry-with-backoff to getServiceStatus for transient failure recovery, addressing 10 issues including null response handling, permanent error classification, and unnecessary retry delays. All findings resolved. ✅ 10 resolved✅ Bug: GetEntityVersionsTool calls non-existent CommonUtils.parseIntParam()
✅ Bug: CompareEntityVersionsTool NPE when pojoToJson returns null
✅ Bug: CompareEntityVersionsTool hardcodes table-specific fields
✅ Edge Case: MCP tools NPE when exception message is null
✅ Quality: PR includes unrelated files (notes, scripts, issue templates)
...and 5 more resolved from earlier reviews OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|



Describe your changes:
Fixes #25666
Problem:
PipelineServiceClient.getServiceStatus()delegates togetServiceStatusInternal()with zero retry logic. A single transient REST failure (e.g.,selector manager closedduring an Airflow restart or network blip) immediately returns an unhealthy response (code 500), causing OpenMetadata to mark the Airflow agent asUNAVAILABLE. The agent never auto-recovers — a manual UI refresh or service restart is required.Root Cause: While a
getServiceStatusBackoff()method with Resilience4j retry already existed inPipelineServiceClient, it was never called by any production code. Both callers (IngestionPipelineResource.getRESTStatus()andSystemRepository.getPipelineServiceClientValidation()) callgetServiceStatus()directly, which had no retry.Fix: Move the Resilience4j retry-with-backoff logic into
getServiceStatus()itself, so every health check benefits from transient failure tolerance:getServiceStatus()now retries up to 3 attempts with 5-second backoff intervals when the response code is not 200getServiceStatusBackoff()is simplified to delegate togetServiceStatus()(avoids nested/double retry)SupplierimportTest: Added
getServiceStatusRecoversFromTransientFailuretest that enqueues a transient 500 followed by a healthy 200 and verifies the final status is healthy with exactly 3 requests (detection + failed attempt + successful retry).Type of change:
Checklist:
Fixes <issue-number>: <short explanation>