fix(cdk): reclassify async job failure from config_error to system_error and add user-facing message by devin-ai-integration[bot] · Pull Request #961 · airbytehq/airbyte-python-cdk

devin-ai-integration · 2026-03-24T12:02:34Z

Summary

Reclassifies the FailureType for async job failures in AsyncJobOrchestrator from config_error to system_error and adds user-facing error messages.

Problem: When async jobs fail after exhausting retries (e.g., expired SAS download URLs in Bing Ads causing HTTP 403), the error was classified as config_error. This prevented automatic retries at the platform level and misled oncall into investigating user configuration — when the failure had nothing to do with user config.

Fix (2 locations in job_orchestrator.py):

_process_partitions_with_errors: Added message field, changed failure_type to system_error
create_and_get_completed_partitions (final re-raise): Changed message from None to a descriptive string, changed failure_type to system_error

Related https://github.com/airbytehq/airbyte-internal-issues/issues/16099
Related to https://github.com/airbytehq/oncall/issues/11749

Review & Testing Checklist for Human

Verify _is_breaking_exception interaction is safe. That method (line ~508) treats config_error as breaking. With system_error, these exceptions are no longer "breaking." Confirm that the exceptions created in _process_partitions_with_errors are never evaluated by _is_breaking_exception (they are appended directly to _non_breaking_exceptions, so this should be fine — but please verify).
Confirm platform retry behavior change is desired. system_error enables platform-level retries for these failures. Verify this won't cause issues like infinite retry storms for connectors using async jobs (Bing Ads, Facebook Marketing, etc.).
Evaluate user-facing message text. The messages "Async job failed after exhausting all retry attempts." and "One or more async jobs failed after exhausting all retry attempts." are generic by design (the orchestrator doesn't know the specific failure cause). Confirm these are acceptable or suggest alternatives.

Notes

This does not fix the deeper root cause (CDK not refreshing expired download URLs in AsyncHttpJobRepository.fetch_records). That would be a separate, larger change.
The config_error → system_error reclassification is not a connector-level breaking change (no schema, state, or spec changes). It's a bug fix in error classification.
All 16 existing unit tests pass. The only test change is updating one assertion from config_error to system_error.

Link to Devin session: https://app.devin.ai/sessions/67fd62259c8a464f92fbf9bae3b43982

…ror and add user-facing message - Change FailureType from config_error to system_error in _process_partitions_with_errors and the final aggregated exception in create_and_get_completed_partitions - Add user-facing message field to both AirbyteTracedException instances - Update test assertion to expect system_error instead of config_error The previous config_error classification incorrectly prevented automatic retries and misled oncall into thinking these were user configuration issues. These are job execution failures (e.g., expired download URLs, HTTP 403) that are not caused by user configuration. Co-Authored-By: bot_apk <apk@cognition.ai>

devin-ai-integration · 2026-03-24T12:02:36Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

github-actions · 2026-03-24T12:02:50Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1774353551-fix-job-orchestrator-error-message#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1774353551-fix-job-orchestrator-error-message

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/prerelease - Triggers a prerelease publish with default arguments
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📚 Show Repo Guidance

Helpful Resources

CDK API Reference

📝 Edit this welcome message.

github-actions · 2026-03-24T12:11:39Z

PyTest Results (Fast)

3 934 tests ±0 3 923 ✅ +1 7m 48s ⏱️ +39s
1 suites ±0 11 💤 - 1
1 files ±0 0 ❌ ±0

Results for commit fb799dd. ± Comparison against base commit 0e57414.

github-actions · 2026-03-24T12:14:47Z

PyTest Results (Full)

3 937 tests ±0 3 925 ✅ ±0 11m 11s ⏱️ -3s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit fb799dd. ± Comparison against base commit 0e57414.

Patrick Nilan (pnilan) · 2026-03-24T22:41:55Z

This reclassification looks directionally right: expired Bing Ads SAS URLs are not a user config problem, and system_error is a better fit than config_error for platform retry behavior.

I would still call out that this is a mitigation, not the root-cause fix. The root cause is that source-bing-ads report streams cache ReportDownloadUrl at poll-completion time and may not attempt the actual GET until well after the Azure SAS TTL has expired. Retrying the sync at the platform level helps, but the connector/CDK async download path can still reproduce the same failure mode under long-running/high-concurrency workloads.

Recommended root-cause fix on the Bing Ads side:

Add a custom download requester for report streams that repolls GenerateReport/Poll immediately before the download step using creation_response["ReportRequestId"], extracts a fresh ReportDownloadUrl, and then downloads from that fresh URL.
Wire that requester into the report async retriever rather than relying on the stale URL captured from the earlier polling response.

Recommended streams to use that custom requester on:

All built-in report streams derived from report_base_stream / basic_async_retriever (for example campaign_performance_report_*, user_location_performance_report_*, etc.).
Custom reports derived from the same report template.
I would not scope this to bulk streams unless we have evidence that their URL lifecycle has the same problem path.

Please also add unit tests for the new custom requester when that connector-side fix is implemented. At minimum I would want coverage for:

repolling with the original ReportRequestId at download time,
using the newly returned URL rather than the stale cached one,
successful download after refresh,
failure behavior when the repoll response is missing a download URL or returns a non-success status.

So: I support this PR as a correctness/triage improvement, but I would recommend tracking the connector-level fresh-URL requester as the actual root-cause fix for oncall#11749.

devin-ai-integration · 2026-03-24T22:42:58Z

Thanks for the thorough review, Patrick Nilan (@pnilan)! Fully agree this is a mitigation, not the root-cause fix — the PR description and the triage report on the issue both call this out explicitly.

Your recommended connector-side fix (re-polling GenerateReport/Poll for a fresh ReportDownloadUrl immediately before download) is a great approach. That would address the SAS URL expiration under high-concurrency workloads where the current cached-URL path fails.

I'd suggest tracking that connector-level fix as a separate follow-up issue linked to oncall#11749, since it's a distinct change in source-bing-ads rather than the CDK.

Devin session

Patrick Nilan (pnilan)

Agree

Copilot

Pull request overview

Reclassifies async job failures in the declarative CDK AsyncJobOrchestrator from config_error to system_error and adds user-facing error messages so platform-level retry behavior and oncall triage align with the actual failure mode (system-side retry exhaustion rather than user misconfiguration).

Changes:

Updated _process_partitions_with_errors to emit FailureType.system_error and attach a user-facing message.
Updated the final re-raise in create_and_get_completed_partitions to use FailureType.system_error and a descriptive user-facing message.
Adjusted the affected unit test assertion to expect system_error.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`airbyte_cdk/sources/declarative/async_job/job_orchestrator.py`	Reclassifies async job retry-exhaustion failures to `system_error` and provides user-facing messages.
`unit_tests/sources/declarative/async_job/test_job_orchestrator.py`	Updates expectation to match new `system_error` classification.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T22:47:50Z

airbyte_cdk/sources/declarative/async_job/job_orchestrator.py

            AirbyteTracedException: An exception indicating that at least one job could not be completed.
        Raises:
            AirbyteTracedException: If at least one job could not be completed.
        """
        status_by_job_id = {job.api_job_id(): job.status() for job in partition.jobs}


The docstring for _process_partitions_with_errors says it returns/raises AirbyteTracedException, but the method returns None and only appends an exception to _non_breaking_exceptions. Please update the docstring to reflect the actual behavior (no return value, no raise).

Patrick Nilan (pnilan) self-requested a review March 24, 2026 22:39

Patrick Nilan (pnilan) approved these changes Mar 24, 2026

View reviewed changes

Patrick Nilan (pnilan) marked this pull request as ready for review March 24, 2026 22:45

Copilot AI review requested due to automatic review settings March 24, 2026 22:45

Patrick Nilan (pnilan) merged commit acafc75 into main Mar 24, 2026
32 of 34 checks passed

Patrick Nilan (pnilan) deleted the devin/1774353551-fix-job-orchestrator-error-message branch March 24, 2026 22:45

Copilot started reviewing on behalf of Patrick Nilan (pnilan) March 24, 2026 22:45 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

devin-ai-integration bot mentioned this pull request Mar 24, 2026

fix(source-bing-ads): refresh report download URLs at download time to avoid expired SAS URLs (AI-Triage PR) airbytehq/airbyte#75428

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cdk): reclassify async job failure from config_error to system_error and add user-facing message#961

fix(cdk): reclassify async job failure from config_error to system_error and add user-facing message#961
Patrick Nilan (pnilan) merged 1 commit intomainfrom
devin/1774353551-fix-job-orchestrator-error-message

devin-ai-integration bot commented Mar 24, 2026 •

edited by pnilan

Loading

Uh oh!

devin-ai-integration bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Testing This CDK Version

PR Slash Commands

Helpful Resources

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Patrick Nilan (pnilan) commented Mar 24, 2026

Uh oh!

devin-ai-integration bot commented Mar 24, 2026

Uh oh!

Patrick Nilan (pnilan) left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devin-ai-integration bot commented Mar 24, 2026 • edited by pnilan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Mar 24, 2026

🤖 Devin AI Engineer

Uh oh!

github-actions bot commented Mar 24, 2026

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

PR Slash Commands

Helpful Resources

Uh oh!

github-actions bot commented Mar 24, 2026

PyTest Results (Fast)

Uh oh!

github-actions bot commented Mar 24, 2026

PyTest Results (Full)

Uh oh!

Patrick Nilan (pnilan) commented Mar 24, 2026

Uh oh!

devin-ai-integration bot commented Mar 24, 2026

Uh oh!

Patrick Nilan (pnilan) left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devin-ai-integration bot commented Mar 24, 2026 •

edited by pnilan

Loading