Skip to content

Fixes #25666: Add retry-with-backoff to getServiceStatus for transient failure recovery#27112

Open
RajdeepKushwaha5 wants to merge 2 commits intoopen-metadata:mainfrom
RajdeepKushwaha5:fix/25666-airflow-status-retry-on-transient-errors
Open

Fixes #25666: Add retry-with-backoff to getServiceStatus for transient failure recovery#27112
RajdeepKushwaha5 wants to merge 2 commits intoopen-metadata:mainfrom
RajdeepKushwaha5:fix/25666-airflow-status-retry-on-transient-errors

Conversation

@RajdeepKushwaha5
Copy link
Copy Markdown
Contributor

Describe your changes:

Fixes #25666

Problem: PipelineServiceClient.getServiceStatus() delegates to getServiceStatusInternal() with zero retry logic. A single transient REST failure (e.g., selector manager closed during an Airflow restart or network blip) immediately returns an unhealthy response (code 500), causing OpenMetadata to mark the Airflow agent as UNAVAILABLE. The agent never auto-recovers — a manual UI refresh or service restart is required.

Root Cause: While a getServiceStatusBackoff() method with Resilience4j retry already existed in PipelineServiceClient, it was never called by any production code. Both callers (IngestionPipelineResource.getRESTStatus() and SystemRepository.getPipelineServiceClientValidation()) call getServiceStatus() directly, which had no retry.

Fix: Move the Resilience4j retry-with-backoff logic into getServiceStatus() itself, so every health check benefits from transient failure tolerance:

  • getServiceStatus() now retries up to 3 attempts with 5-second backoff intervals when the response code is not 200
  • getServiceStatusBackoff() is simplified to delegate to getServiceStatus() (avoids nested/double retry)
  • Removed the unused Supplier import

Test: Added getServiceStatusRecoversFromTransientFailure test that enqueues a transient 500 followed by a healthy 200 and verifies the final status is healthy with exactly 3 requests (detection + failed attempt + successful retry).

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

Copilot AI review requested due to automatic review settings April 7, 2026 02:51
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves ingestion agent health-check resilience by adding Resilience4j retry-with-backoff to PipelineServiceClient.getServiceStatus(), so transient pipeline-service failures (e.g., during Airflow restarts) don’t immediately mark the agent as unavailable.

Changes:

  • Moved retry-with-backoff logic into PipelineServiceClient.getServiceStatus() and simplified getServiceStatusBackoff() to delegate.
  • Updated retry to operate on PipelineServiceClientResponse (retrying when HTTP code is non-200).
  • Added an Airflow REST client test ensuring a transient 500 followed by 200 recovers successfully and issues the expected number of requests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
openmetadata-spec/src/main/java/org/openmetadata/service/clients/pipeline/PipelineServiceClient.java Centralizes retry-with-backoff in getServiceStatus() so all callers benefit from transient failure tolerance.
openmetadata-service/src/test/java/org/openmetadata/service/clients/pipeline/airflow/AirflowRESTClientTest.java Adds regression test verifying recovery from a transient health-check failure.

@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from 68791c8 to 9336512 Compare April 7, 2026 03:30
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 7, 2026 04:54
@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from 9336512 to f32ab33 Compare April 7, 2026 04:54
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from f32ab33 to 0d023dc Compare April 7, 2026 05:38
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 7, 2026 05:49
@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from 0d023dc to 7a38087 Compare April 7, 2026 05:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from 7a38087 to a6412ae Compare April 7, 2026 06:00
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 7, 2026 06:12
@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from a6412ae to 2676e07 Compare April 7, 2026 06:12
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 8, 2026 00:39
@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/25666-airflow-status-retry-on-transient-errors branch from f33984c to b0f9406 Compare April 8, 2026 00:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment on lines +208 to +214
/**
* Check the status of pipeline service to ensure it is healthy. Retries up to {@link
* #MAX_ATTEMPTS} times with backoff so that a single transient failure (e.g. Airflow restart,
* network blip) does not permanently mark the agent as unavailable. Responses with status codes
* &ge; 500 are retried; known non-transient 4xx responses (401, 403, 404, version mismatch) are
* returned immediately. Any other unexpected HTTP status code is passed through as-is and will
* not trigger retries.
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Javadoc implies that auth/plugin/version mismatch cases will come back as 4xx and therefore bypass retries, but getServiceStatus() only decides retries based on response.getCode() >= 500. Implementations that translate permanent config/auth/RBAC failures into buildUnhealthyStatus(...) (code 500) will still be retried and incur backoff delays. Consider updating the Javadoc to describe the actual contract (implementations must preserve non-retryable 4xx/422 codes), and/or add a protected isRetryable(PipelineServiceClientResponse) hook so clients can explicitly mark permanent failures as non-retryable.

Copilot uses AI. Check for mistakes.
Comment on lines +236 to +242
var retryConfig =
RetryConfig.<PipelineServiceClientResponse>custom()
.maxAttempts(MAX_ATTEMPTS)
.waitDuration(Duration.ofMillis(getRetryBackoffMillis()))
.retryOnResult(response -> response == null || response.getCode() >= 500)
.failAfterMaxAttempts(false)
.build();
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says getServiceStatus() retries when the response code is not 200, but the current predicate only retries on response == null or code >= 500 (and will pass through 4xx/429/etc without retry). Please either update the PR description / method docs to match the actual behavior, or broaden the retry predicate to include the intended retryable non-200 statuses (e.g., 429/408) if that was the goal.

Copilot uses AI. Check for mistakes.
Comment on lines 67 to +72
private static final Integer MAX_ATTEMPTS = 3;
private static final Integer BACKOFF_TIME_SECONDS = 5;
private static final long DEFAULT_BACKOFF_MILLIS = 5_000L;
private static final String DISABLED_STATUS = "disabled";

private volatile Retry serviceStatusRetry;

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_BACKOFF_MILLIS hardcodes a 5s sleep between retries, meaning getServiceStatus() can block for ~10s on repeated retryable failures. This will directly impact REST endpoints that call this method (e.g., validation/status checks) and can also slow down unit tests that exercise unhealthy paths unless they override getRetryBackoffMillis(). Consider making the backoff configurable via PipelineServiceClientConfiguration (or similar) so production can keep 5s while tests/interactive calls can use a smaller value.

Copilot uses AI. Check for mistakes.
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 12, 2026

Code Review ✅ Approved 10 resolved / 10 findings

Adds retry-with-backoff to getServiceStatus for transient failure recovery, addressing 10 issues including null response handling, permanent error classification, and unnecessary retry delays. All findings resolved.

✅ 10 resolved
Bug: GetEntityVersionsTool calls non-existent CommonUtils.parseIntParam()

📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/GetEntityVersionsTool.java:51
At line 51, CommonUtils.parseIntParam(params.get("limit"), DEFAULT_LIMIT) is called, but the CommonUtils class in the MCP tools package does not define a parseIntParam() method. This will cause a compilation error.

Bug: CompareEntityVersionsTool NPE when pojoToJson returns null

📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java:115-117
JsonUtils.pojoToJson(null) returns null. At line 117, fromJson.equals(toJson) will throw a NullPointerException if either fromVal or toVal is null (e.g., a field not present in one version). This is a reachable condition since version comparisons commonly involve fields that were added or removed.

Bug: CompareEntityVersionsTool hardcodes table-specific fields

📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java:97-100
The computeDifferences method compares a hardcoded list of fields (columns, tableConstraints, tableType, etc.) that are specific to Table entities. When comparing other entity types (Pipeline, Dashboard, Topic, etc.), this tool will miss all entity-specific fields and only compare the few common ones (description, owners, tags, displayName). The tool's name and parameters suggest it is generic.

Edge Case: MCP tools NPE when exception message is null

📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java:90 📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/GetEntityVersionsTool.java:72
Both CompareEntityVersionsTool (line 90) and GetEntityVersionsTool (line 72) catch all exceptions and call Map.of("error", e.getMessage()). Map.of() does not permit null values, so if e.getMessage() is null (common for NPE, ClassCastException, etc.), this will throw a NullPointerException, masking the original error.

Quality: PR includes unrelated files (notes, scripts, issue templates)

📄 TOP_10_CONTRIBUTIONS.txt 📄 TOP_10_CONTRIBUTIONS_V2.txt 📄 issue-assignment-requests.md 📄 ISSUE_DESCRIPTION.md 📄 ISSUE_DESCRIPTION_RESOURCE_LEAKS.md 📄 ISSUE_SUBJECT_CONTEXT_SILENT_CATCHES.md 📄 PR_DESCRIPTION.md 📄 PR_DESCRIPTION_AIRFLOW_RETRY.md 📄 PR_DESCRIPTION_RESOURCE_LEAKS.md 📄 scripts/fix_basic.py 📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/CompareEntityVersionsTool.java 📄 openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/GetEntityVersionsTool.java
The PR includes 12 files that are unrelated to the stated fix (retry-with-backoff for getServiceStatus): TOP_10_CONTRIBUTIONS.txt, TOP_10_CONTRIBUTIONS_V2.txt, issue-assignment-requests.md, ISSUE_DESCRIPTION.md, ISSUE_DESCRIPTION_RESOURCE_LEAKS.md, ISSUE_SUBJECT_CONTEXT_SILENT_CATCHES.md, PR_DESCRIPTION*.md, scripts/fix_basic.py, and the two MCP tool files. These appear to be personal working notes and unrelated feature additions that should be in separate PRs. The fix_basic.py script contains a hardcoded Windows path (D:\OpenMetadata\...).

...and 5 more resolved from earlier reviews

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AirflowRESTClient marks agent UNAVAILABLE on transient error (selector manager closed) and does not auto-recover

3 participants