Skip to content

fix(async-job): propagate wrapped FailureType on async job aggregation#994

Draft
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin/1776776408-async-job-failure-type-propagation
Draft

fix(async-job): propagate wrapped FailureType on async job aggregation#994
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin/1776776408-async-job-failure-type-propagation

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/16221
Related to https://github.com/airbytehq/oncall/issues/12043

Summary

AsyncJobOrchestrator.create_and_get_completed_partitions() wraps every failed partition into a single aggregated AirbyteTracedException with FailureType.system_error and the generic message "One or more async jobs failed after exhausting all retry attempts." That hides the true underlying failure type — typically transient_error (HTTP 429) or config_error (HTTP 403) — so oncall triage sees a platform-style system_error for what is usually source-side throttling or a customer-side permission issue.

This change:

  • Derives the aggregated FailureType from the wrapped _non_breaking_exceptions, using precedence config_error > transient_error > system_error. If any wrapped exception is a config_error, the sync stops being treated as an internal bug and is correctly surfaced as a user-actionable failure. Transient failures similarly propagate so retry behavior at the platform level is no longer suppressed.
  • Replaces the generic user-facing message with a small deterministic set keyed by dominant FailureType. Keeping message deterministic preserves its usefulness as a log aggregation key (per writing-good-error-messages).
  • Moves a failure-type count breakdown (e.g. transient_error=5, system_error=2) plus raw exception reprs into internal_message so operators still get the detail they need without polluting the user-facing text.
  • Adds unit tests for the two new private helpers (_aggregate_failure_type, _count_failure_types) covering precedence, mixed traced/plain exceptions, and default fallback.

No behavioral change when all wrapped exceptions are system_error — existing test test_given_exception_when_start_job_and_skip_this_exception still asserts FailureType.system_error and passes unchanged.

Review & Testing Checklist for Human

  • Confirm the new message text for each FailureType reads well in product (surfaced in Airbyte UI / emails). The three variants are:
    • Async jobs failed because the source API rejected the request as unauthorized or forbidden. (config_error)
    • Async jobs failed after exhausting retries for source API rate limit or transient errors. (transient_error)
    • Async jobs failed after exhausting retry attempts. (system_error — unchanged baseline)
  • Sanity-check that changing the aggregated FailureType from system_errortransient_error (when wrapped exceptions are transient) does not unexpectedly enable platform-level retries for connectors where that behavior is undesirable. Expected: platform will now correctly retry instead of failing hard on transient causes, which matches the reporter's intent. Callers that relied on the old system_error classification for ops alerting should be double-checked.
  • Watch for downstream connectors (e.g. source-amazon-seller-partner, any other declarative AsyncRetriever user) that may assert on the exact old message string in integration tests.

Notes

  • Scope is intentionally narrow: only the final aggregated raise at the bottom of create_and_get_completed_partitions() is changed. The per-partition wrapper inside _process_partitions_with_errors (which also hardcodes system_error) is left alone because at that call site the orchestrator does not have visibility into the underlying job-level cause; improving that path would require a larger refactor of how AsyncJob carries failure context. Filed as a follow-up consideration rather than rolled into this PR.
  • Companion connector-level fix for Amazon SP-API throttling is in flight at fix(source-amazon-seller-partner): Increase max_retries for async report error handlers to handle 429 rate limits airbyte#75966 — orthogonal to this CDK change.
  • Breaking change evaluation: this is not a breaking change per Managing Breaking Changes. No spec, schema, or state format change. FailureType reclassification can alter retry behavior at the platform level but that is the intended fix, and system_errortransient_error enables retries which is safe. system_errorconfig_error stops retries, which is correct when the wrapped cause is genuinely a permission/auth failure.

Link to Devin session: https://app.devin.ai/sessions/bda7c7c9b3ea47f5b51b848569e8f397

Replace hardcoded system_error in AsyncJobOrchestrator's aggregated failure with the highest-precedence FailureType among wrapped non-breaking exceptions (config_error > transient_error > system_error). The user-facing message is chosen per FailureType to stay deterministic; underlying failure-type counts and exception reprs are moved into internal_message.

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1776776408-async-job-failure-type-propagation#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1776776408-async-job-failure-type-propagation

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

PyTest Results (Fast)

4 022 tests  +4   4 011 ✅ +4   7m 41s ⏱️ -1s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit b00c20c. ± Comparison against base commit 1256a1f.

@github-actions
Copy link
Copy Markdown

PyTest Results (Full)

4 025 tests  +4   4 013 ✅ +4   11m 11s ⏱️ -22s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit b00c20c. ± Comparison against base commit 1256a1f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants