fix(async-job): propagate wrapped FailureType on async job aggregation#994
Draft
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
Draft
fix(async-job): propagate wrapped FailureType on async job aggregation#994devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
Conversation
Replace hardcoded system_error in AsyncJobOrchestrator's aggregated failure with the highest-precedence FailureType among wrapped non-breaking exceptions (config_error > transient_error > system_error). The user-facing message is chosen per FailureType to stay deterministic; underlying failure-type counts and exception reprs are moved into internal_message. Co-Authored-By: bot_apk <apk@cognition.ai>
Contributor
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1776776408-async-job-failure-type-propagation#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1776776408-async-job-failure-type-propagationPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/16221
Related to https://github.com/airbytehq/oncall/issues/12043
Summary
AsyncJobOrchestrator.create_and_get_completed_partitions()wraps every failed partition into a single aggregatedAirbyteTracedExceptionwithFailureType.system_errorand the generic message"One or more async jobs failed after exhausting all retry attempts."That hides the true underlying failure type — typicallytransient_error(HTTP 429) orconfig_error(HTTP 403) — so oncall triage sees a platform-stylesystem_errorfor what is usually source-side throttling or a customer-side permission issue.This change:
FailureTypefrom the wrapped_non_breaking_exceptions, using precedenceconfig_error>transient_error>system_error. If any wrapped exception is aconfig_error, the sync stops being treated as an internal bug and is correctly surfaced as a user-actionable failure. Transient failures similarly propagate so retry behavior at the platform level is no longer suppressed.messagewith a small deterministic set keyed by dominantFailureType. Keepingmessagedeterministic preserves its usefulness as a log aggregation key (per writing-good-error-messages).transient_error=5, system_error=2) plus raw exception reprs intointernal_messageso operators still get the detail they need without polluting the user-facing text._aggregate_failure_type,_count_failure_types) covering precedence, mixed traced/plain exceptions, and default fallback.No behavioral change when all wrapped exceptions are
system_error— existing testtest_given_exception_when_start_job_and_skip_this_exceptionstill assertsFailureType.system_errorand passes unchanged.Review & Testing Checklist for Human
FailureTypereads well in product (surfaced in Airbyte UI / emails). The three variants are:Async jobs failed because the source API rejected the request as unauthorized or forbidden.(config_error)Async jobs failed after exhausting retries for source API rate limit or transient errors.(transient_error)Async jobs failed after exhausting retry attempts.(system_error — unchanged baseline)FailureTypefromsystem_error→transient_error(when wrapped exceptions are transient) does not unexpectedly enable platform-level retries for connectors where that behavior is undesirable. Expected: platform will now correctly retry instead of failing hard on transient causes, which matches the reporter's intent. Callers that relied on the oldsystem_errorclassification for ops alerting should be double-checked.source-amazon-seller-partner, any other declarativeAsyncRetrieveruser) that may assert on the exact old message string in integration tests.Notes
create_and_get_completed_partitions()is changed. The per-partition wrapper inside_process_partitions_with_errors(which also hardcodessystem_error) is left alone because at that call site the orchestrator does not have visibility into the underlying job-level cause; improving that path would require a larger refactor of howAsyncJobcarries failure context. Filed as a follow-up consideration rather than rolled into this PR.FailureTypereclassification can alter retry behavior at the platform level but that is the intended fix, andsystem_error→transient_errorenables retries which is safe.system_error→config_errorstops retries, which is correct when the wrapped cause is genuinely a permission/auth failure.Link to Devin session: https://app.devin.ai/sessions/bda7c7c9b3ea47f5b51b848569e8f397