Skip to content

fix(cdk): reclassify async job failure from config_error to system_error and add user-facing message#961

Merged
Patrick Nilan (pnilan) merged 1 commit intomainfrom
devin/1774353551-fix-job-orchestrator-error-message
Mar 24, 2026
Merged

fix(cdk): reclassify async job failure from config_error to system_error and add user-facing message#961
Patrick Nilan (pnilan) merged 1 commit intomainfrom
devin/1774353551-fix-job-orchestrator-error-message

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot commented Mar 24, 2026

Summary

Reclassifies the FailureType for async job failures in AsyncJobOrchestrator from config_error to system_error and adds user-facing error messages.

Problem: When async jobs fail after exhausting retries (e.g., expired SAS download URLs in Bing Ads causing HTTP 403), the error was classified as config_error. This prevented automatic retries at the platform level and misled oncall into investigating user configuration — when the failure had nothing to do with user config.

Fix (2 locations in job_orchestrator.py):

  • _process_partitions_with_errors: Added message field, changed failure_type to system_error
  • create_and_get_completed_partitions (final re-raise): Changed message from None to a descriptive string, changed failure_type to system_error

Related https://github.com/airbytehq/airbyte-internal-issues/issues/16099
Related to https://github.com/airbytehq/oncall/issues/11749

Review & Testing Checklist for Human

  • Verify _is_breaking_exception interaction is safe. That method (line ~508) treats config_error as breaking. With system_error, these exceptions are no longer "breaking." Confirm that the exceptions created in _process_partitions_with_errors are never evaluated by _is_breaking_exception (they are appended directly to _non_breaking_exceptions, so this should be fine — but please verify).
  • Confirm platform retry behavior change is desired. system_error enables platform-level retries for these failures. Verify this won't cause issues like infinite retry storms for connectors using async jobs (Bing Ads, Facebook Marketing, etc.).
  • Evaluate user-facing message text. The messages "Async job failed after exhausting all retry attempts." and "One or more async jobs failed after exhausting all retry attempts." are generic by design (the orchestrator doesn't know the specific failure cause). Confirm these are acceptable or suggest alternatives.

Notes

  • This does not fix the deeper root cause (CDK not refreshing expired download URLs in AsyncHttpJobRepository.fetch_records). That would be a separate, larger change.
  • The config_errorsystem_error reclassification is not a connector-level breaking change (no schema, state, or spec changes). It's a bug fix in error classification.
  • All 16 existing unit tests pass. The only test change is updating one assertion from config_error to system_error.

Link to Devin session: https://app.devin.ai/sessions/67fd62259c8a464f92fbf9bae3b43982

…ror and add user-facing message

- Change FailureType from config_error to system_error in _process_partitions_with_errors
  and the final aggregated exception in create_and_get_completed_partitions
- Add user-facing message field to both AirbyteTracedException instances
- Update test assertion to expect system_error instead of config_error

The previous config_error classification incorrectly prevented automatic retries
and misled oncall into thinking these were user configuration issues. These are
job execution failures (e.g., expired download URLs, HTTP 403) that are not
caused by user configuration.

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1774353551-fix-job-orchestrator-error-message#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1774353551-fix-job-orchestrator-error-message

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

PyTest Results (Fast)

3 934 tests  ±0   3 923 ✅ +1   7m 48s ⏱️ +39s
    1 suites ±0      11 💤  - 1 
    1 files   ±0       0 ❌ ±0 

Results for commit fb799dd. ± Comparison against base commit 0e57414.

@github-actions
Copy link
Copy Markdown

PyTest Results (Full)

3 937 tests  ±0   3 925 ✅ ±0   11m 11s ⏱️ -3s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit fb799dd. ± Comparison against base commit 0e57414.

@pnilan
Copy link
Copy Markdown
Contributor

This reclassification looks directionally right: expired Bing Ads SAS URLs are not a user config problem, and system_error is a better fit than config_error for platform retry behavior.

I would still call out that this is a mitigation, not the root-cause fix. The root cause is that source-bing-ads report streams cache ReportDownloadUrl at poll-completion time and may not attempt the actual GET until well after the Azure SAS TTL has expired. Retrying the sync at the platform level helps, but the connector/CDK async download path can still reproduce the same failure mode under long-running/high-concurrency workloads.

Recommended root-cause fix on the Bing Ads side:

  • Add a custom download requester for report streams that repolls GenerateReport/Poll immediately before the download step using creation_response["ReportRequestId"], extracts a fresh ReportDownloadUrl, and then downloads from that fresh URL.
  • Wire that requester into the report async retriever rather than relying on the stale URL captured from the earlier polling response.

Recommended streams to use that custom requester on:

  • All built-in report streams derived from report_base_stream / basic_async_retriever (for example campaign_performance_report_*, user_location_performance_report_*, etc.).
  • Custom reports derived from the same report template.
  • I would not scope this to bulk streams unless we have evidence that their URL lifecycle has the same problem path.

Please also add unit tests for the new custom requester when that connector-side fix is implemented. At minimum I would want coverage for:

  • repolling with the original ReportRequestId at download time,
  • using the newly returned URL rather than the stale cached one,
  • successful download after refresh,
  • failure behavior when the repoll response is missing a download URL or returns a non-success status.

So: I support this PR as a correctness/triage improvement, but I would recommend tracking the connector-level fresh-URL requester as the actual root-cause fix for oncall#11749.

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review, Patrick Nilan (@pnilan)! Fully agree this is a mitigation, not the root-cause fix — the PR description and the triage report on the issue both call this out explicitly.

Your recommended connector-side fix (re-polling GenerateReport/Poll for a fresh ReportDownloadUrl immediately before download) is a great approach. That would address the SAS URL expiration under high-concurrency workloads where the current cached-URL path fails.

I'd suggest tracking that connector-level fix as a separate follow-up issue linked to oncall#11749, since it's a distinct change in source-bing-ads rather than the CDK.


Devin session

Copy link
Copy Markdown
Contributor

@pnilan Patrick Nilan (pnilan) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@pnilan Patrick Nilan (pnilan) marked this pull request as ready for review March 24, 2026 22:45
Copilot AI review requested due to automatic review settings March 24, 2026 22:45
@pnilan Patrick Nilan (pnilan) merged commit acafc75 into main Mar 24, 2026
32 of 34 checks passed
@pnilan Patrick Nilan (pnilan) deleted the devin/1774353551-fix-job-orchestrator-error-message branch March 24, 2026 22:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reclassifies async job failures in the declarative CDK AsyncJobOrchestrator from config_error to system_error and adds user-facing error messages so platform-level retry behavior and oncall triage align with the actual failure mode (system-side retry exhaustion rather than user misconfiguration).

Changes:

  • Updated _process_partitions_with_errors to emit FailureType.system_error and attach a user-facing message.
  • Updated the final re-raise in create_and_get_completed_partitions to use FailureType.system_error and a descriptive user-facing message.
  • Adjusted the affected unit test assertion to expect system_error.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
airbyte_cdk/sources/declarative/async_job/job_orchestrator.py Reclassifies async job retry-exhaustion failures to system_error and provides user-facing messages.
unit_tests/sources/declarative/async_job/test_job_orchestrator.py Updates expectation to match new system_error classification.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 430 to 434
AirbyteTracedException: An exception indicating that at least one job could not be completed.
Raises:
AirbyteTracedException: If at least one job could not be completed.
"""
status_by_job_id = {job.api_job_id(): job.status() for job in partition.jobs}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for _process_partitions_with_errors says it returns/raises AirbyteTracedException, but the method returns None and only appends an exception to _non_breaking_exceptions. Please update the docstring to reflect the actual behavior (no return value, no raise).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants