fix(cdk): reclassify async job failure from config_error to system_error and add user-facing message#961
Conversation
…ror and add user-facing message - Change FailureType from config_error to system_error in _process_partitions_with_errors and the final aggregated exception in create_and_get_completed_partitions - Add user-facing message field to both AirbyteTracedException instances - Update test assertion to expect system_error instead of config_error The previous config_error classification incorrectly prevented automatic retries and misled oncall into thinking these were user configuration issues. These are job execution failures (e.g., expired download URLs, HTTP 403) that are not caused by user configuration. Co-Authored-By: bot_apk <apk@cognition.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1774353551-fix-job-orchestrator-error-message#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1774353551-fix-job-orchestrator-error-messagePR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
|
This reclassification looks directionally right: expired Bing Ads SAS URLs are not a user config problem, and I would still call out that this is a mitigation, not the root-cause fix. The root cause is that Recommended root-cause fix on the Bing Ads side:
Recommended streams to use that custom requester on:
Please also add unit tests for the new custom requester when that connector-side fix is implemented. At minimum I would want coverage for:
So: I support this PR as a correctness/triage improvement, but I would recommend tracking the connector-level fresh-URL requester as the actual root-cause fix for oncall#11749. |
|
Thanks for the thorough review, Patrick Nilan (@pnilan)! Fully agree this is a mitigation, not the root-cause fix — the PR description and the triage report on the issue both call this out explicitly. Your recommended connector-side fix (re-polling I'd suggest tracking that connector-level fix as a separate follow-up issue linked to oncall#11749, since it's a distinct change in |
There was a problem hiding this comment.
Pull request overview
Reclassifies async job failures in the declarative CDK AsyncJobOrchestrator from config_error to system_error and adds user-facing error messages so platform-level retry behavior and oncall triage align with the actual failure mode (system-side retry exhaustion rather than user misconfiguration).
Changes:
- Updated
_process_partitions_with_errorsto emitFailureType.system_errorand attach a user-facing message. - Updated the final re-raise in
create_and_get_completed_partitionsto useFailureType.system_errorand a descriptive user-facing message. - Adjusted the affected unit test assertion to expect
system_error.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
airbyte_cdk/sources/declarative/async_job/job_orchestrator.py |
Reclassifies async job retry-exhaustion failures to system_error and provides user-facing messages. |
unit_tests/sources/declarative/async_job/test_job_orchestrator.py |
Updates expectation to match new system_error classification. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| AirbyteTracedException: An exception indicating that at least one job could not be completed. | ||
| Raises: | ||
| AirbyteTracedException: If at least one job could not be completed. | ||
| """ | ||
| status_by_job_id = {job.api_job_id(): job.status() for job in partition.jobs} |
There was a problem hiding this comment.
The docstring for _process_partitions_with_errors says it returns/raises AirbyteTracedException, but the method returns None and only appends an exception to _non_breaking_exceptions. Please update the docstring to reflect the actual behavior (no return value, no raise).
Summary
Reclassifies the
FailureTypefor async job failures inAsyncJobOrchestratorfromconfig_errortosystem_errorand adds user-facing error messages.Problem: When async jobs fail after exhausting retries (e.g., expired SAS download URLs in Bing Ads causing HTTP 403), the error was classified as
config_error. This prevented automatic retries at the platform level and misled oncall into investigating user configuration — when the failure had nothing to do with user config.Fix (2 locations in
job_orchestrator.py):_process_partitions_with_errors: Addedmessagefield, changedfailure_typetosystem_errorcreate_and_get_completed_partitions(final re-raise): ChangedmessagefromNoneto a descriptive string, changedfailure_typetosystem_errorRelated https://github.com/airbytehq/airbyte-internal-issues/issues/16099
Related to https://github.com/airbytehq/oncall/issues/11749
Review & Testing Checklist for Human
_is_breaking_exceptioninteraction is safe. That method (line ~508) treatsconfig_erroras breaking. Withsystem_error, these exceptions are no longer "breaking." Confirm that the exceptions created in_process_partitions_with_errorsare never evaluated by_is_breaking_exception(they are appended directly to_non_breaking_exceptions, so this should be fine — but please verify).system_errorenables platform-level retries for these failures. Verify this won't cause issues like infinite retry storms for connectors using async jobs (Bing Ads, Facebook Marketing, etc.)."Async job failed after exhausting all retry attempts."and"One or more async jobs failed after exhausting all retry attempts."are generic by design (the orchestrator doesn't know the specific failure cause). Confirm these are acceptable or suggest alternatives.Notes
AsyncHttpJobRepository.fetch_records). That would be a separate, larger change.config_error→system_errorreclassification is not a connector-level breaking change (no schema, state, or spec changes). It's a bug fix in error classification.config_errortosystem_error.Link to Devin session: https://app.devin.ai/sessions/67fd62259c8a464f92fbf9bae3b43982