fix(source-github): prevent blocking sleep during rate limit handling#74758
Conversation
Replace single long blocking time.sleep() with chunked 60-second intervals during rate limit wait periods. This prevents the platform heartbeat from timing out when the connector waits for GitHub API rate limits to reset. Also fix FailureType from config_error to transient_error for rate limit errors, since rate limits are temporary conditions, not configuration issues. Resolves airbytehq/oncall#11614 Co-Authored-By: bot_apk <apk@cognition.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksPR Slash CommandsAirbyte Maintainers (that's you!) can execute the following slash commands on your PR:
📚 Show Repo GuidanceHelpful Resources
|
Co-Authored-By: bot_apk <apk@cognition.ai>
|
Deploy preview for airbyte-docs ready!
Deployed with vercel-action |
|
|
↪️ Triggering Reason: Draft PR with CI passing. Ready for regression validation and live testing. |
|
Fix Validation EvidenceOutcome: Fix/Feature Proven Successfully (regression tests) | Live testing pending approval Evidence SummaryRegression tests (SPEC, CHECK, DISCOVER, READ) all passed with no regressions vs baseline v2.1.14. The fix replaces a single blocking Live connection testing against the customer connection from the oncall issue is pending Slack approval for version pinning. Next Steps
Connector & PR DetailsConnector: Evidence PlanProving Criteria
Disproving Criteria
Cases Attempted
Pre-flight Checks
Detailed Evidence Log2026-03-13 11:53 UTC — Initial status comment posted, pre-flight checks started Note: Connection IDs and detailed logs are recorded in the linked private issue. Devin Session: https://app.devin.ai/sessions/515bfef5c6ec4515a97d359c702b0234 |
|
|
↪️ Triggering Reason: Prove-fix passed (regression tests all green). PR marked as ready-for-review. Ready for final AI review gate. |
Reviewing PR for connector safety and quality.
|
|
AI PR Review in progress. Gathering evidence and evaluating gates now. Session: https://app.devin.ai/sessions/cc6ab7c7ff3b44138ac6060d993aadce |
AI PR Review ReportReview Action: NO ACTION (NOT ELIGIBLE)
📋 PR Details & EligibilityConnector & PR InfoConnector(s): Auto-Approve EligibilityEligible: No Review Action DetailsNO ACTION (NOT ELIGIBLE) — All enforced gates pass, but the Behavioral Changes anti-pattern gate is flagged. This PR changes runtime behavior (sleep chunking, error type reclassification) which requires human sign-off. No PR review submitted.
🔍 Gate Evaluation DetailsGate-by-Gate Analysis
Behavioral Changes DetailChange 1 — Chunked sleep (
Change 2 — Error classification (
Breaking Change EvaluationEvaluated against the breaking change checklist:
Conclusion: NOT a breaking change. PATCH version bump (2.1.14 → 2.1.15) is correct. Progressive rollout is disabled ( 📚 Evidence ConsultedEvidence
❓ How to RespondBehavioral Changes — Human Sign-Off RequiredThe Behavioral Changes gate is flagged because this PR changes:
These are intentional bug fixes. A human reviewer should verify:
Minor HousekeepingThe changelog entry at Providing Context or JustificationYou can add explanations that the bot will see on the next review: Option 1: PR Description (recommended) ## AI PR Review Justification
### Behavioral Changes
[Your explanation here]Option 2: PR Comment After adding your response, re-run Note: Justifications provide context for the bot to evaluate. For the Behavioral Changes gate, justifications help explain the situation but still require human sign-off. |
|
↪️ Triggering Reason: Draft PR with prove-fix passed (regression tests all green). Ready for AI review before merge consideration. |
Reviewing PR for connector safety and quality.
|
|
AI PR Review in progress. Gathering evidence and evaluating gates now. Session: https://app.devin.ai/sessions/9900985c80f4426ca43c5a2a6ab01ab0 |
AI PR Review ReportReview Action: NO ACTION (NOT ELIGIBLE)
📋 PR Details & EligibilityConnector & PR InfoConnector(s): Auto-Approve EligibilityEligible: No Review Action DetailsNO ACTION (NOT ELIGIBLE) -- All enforced gates pass, but the Behavioral Changes anti-pattern gate is flagged. This PR changes runtime behavior (sleep chunking, error type reclassification) which requires human sign-off. No PR review submitted.
🔍 Gate Evaluation DetailsGate-by-Gate Analysis
Behavioral Changes DetailChange 1 -- Chunked sleep (
Change 2 -- Error classification (
Breaking Change EvaluationEvaluated against the breaking change checklist:
Conclusion: NOT a breaking change. PATCH version bump (2.1.14 -> 2.1.15) is correct. 📚 Evidence ConsultedEvidence
❓ How to RespondBehavioral Changes -- Human Sign-Off RequiredThe Behavioral Changes gate is flagged because this PR changes:
These are intentional bug fixes. A human reviewer should verify:
Minor HousekeepingThe changelog entry at Providing Context or JustificationYou can add explanations that the bot will see on the next review: Option 1: PR Description (recommended) ## AI PR Review Justification
### Behavioral Changes
[Your explanation here]Option 2: PR Comment After adding your response, re-run Note: Justifications provide context for the bot to evaluate. For the Behavioral Changes gate, justifications help explain the situation but still require human sign-off. |
…before rate limit exhaustion Instead of draining all tokens to zero and then blocking with a long sleep, inject small proportional delays once every token's remaining quota drops below a configurable reserve (default: 50 calls or 10% of limit). This spreads remaining calls over the reset window and reduces the chance of hitting the wall entirely. Also reclassifies rate-limit exhaustion as transient_error (not config_error) since it is a temporary condition.
|
🙋 Escalating per Hands-Free AI Triage Project triage. Reason: PR has prove-fix passed and AI review completed twice, but the |
|
/format-fix
|
|
/publish-connectors-prerelease
|
… to 2.1.19 Co-Authored-By: gl_serhii.lazebnyi <serglazebny@gmail.com>
…ary keys of bidding strategy streams Remove bidding_strategy.id from the primary keys of campaign_bidding_strategy and ad_group_bidding_strategy streams. This field is nullable in the Google Ads API schema, and including it as a primary key causes sync failures for destinations that enforce non-null PK constraints (e.g., Iceberg). BREAKING CHANGE: Primary keys changed for campaign_bidding_strategy and ad_group_bidding_strategy streams. Users syncing these streams must refresh the source schema and reset the affected streams after upgrading. Co-Authored-By: bot_apk <apk@cognition.ai>
Co-Authored-By: bot_apk <apk@cognition.ai>
Co-Authored-By: bot_apk <apk@cognition.ai>
|
/publish-connectors-prerelease
|
… to 2.1.21 Co-Authored-By: Daryna Ishchenko <darina.ishchenko17@gmail.com>
|
/ai-canary-prerelease
|
|
❌ Cannot revive Devin session - the session is too old. Please start a new session instead. |
Canary Prerelease: StartingStarting canary testing for this prerelease. I'll:
Sensitive customer details (workspace IDs, connection IDs) will be kept in the linked private issue; this PR will only show anonymized references (e.g. Devin session: https://app.devin.ai/sessions/219af73c46994d4e8c89e4d8751bc515 |
|
Canary Prerelease: Pins AppliedConnector: 9 connections pinned for canary testing. Coverage spans 4 US + 2 US-Central + 4 EU dataplanes and BigQuery, S3, S3 Data Lake, and ab-analytics destinations.
Monitoring period: 2–4 hours minimum. I'll post periodic status updates as sync results come in. Full customer details and connection IDs are in the linked private oncall issue. |
Canary Monitoring Update — 1h check-inMonitoring window: approximately 30 minutes since pins applied. 3 of 9 connections have completed a sync on the prerelease; remaining 6 are idle on their natural schedules.
Overall status: HEALTHY — 3/3 post-pin syncs succeeded, no new error patterns observed. Continuing to monitor. Next check-in in approximately 2h. |
Canary Monitoring Update — approximately 3h check-inMonitoring window: approximately 3 hours since pins applied. 10 syncs observed on the prerelease across 5 of 9 actors; all completed syncs have succeeded. No new error patterns observed, no rate-limit-related failures, no log-timeout heartbeat issues.
Overall status: HEALTHY (9 succeeded / 0 failed / 1 running). Continuing to monitor idle connections on their natural schedules. Final report after approximately 4h total window. |
Canary Prerelease: Final ReportConnector: SummaryThe prerelease performed cleanly across all exercised canary connections. 13 sync jobs completed post-pin across 6 distinct actors with a 100% success rate, spanning BigQuery / S3 / Snowflake-adjacent destinations in US and EU dataplanes. No rate-limit related failures, no heartbeat/log-timeout anomalies, no new error patterns vs. the GA 2.1.20 baseline. 3 connections remained idle on their natural schedules during the window but are pinned and healthy. Detailed Results
Canary VerdictOverall Status: PASS The chunked-sleep-with-heartbeat fix for rate-limit handling behaves as intended on live traffic. Recommend proceeding to formal release. Next steps:
For full customer details and canary connection mapping, see the linked private oncall issue. |
…mit-sleep Co-Authored-By: Daryna Ishchenko <darina.ishchenko17@gmail.com>
d8a3f69
into
master
What
Resolves https://github.com/airbytehq/oncall/issues/11614:
When all GitHub API tokens hit rate limits, the connector sleeps in a single blocking
time.sleep()call for the entire wait duration. This blocks all output, causing the platform's heartbeat mechanism to consider the connector dead and terminate the sync.Additionally, the old default
max_waiting_timeof 10 minutes was well below GitHub's 60-minute rate limit reset window. This meant the chunked sleep path was almost never reachable — the connector would immediately raise an exception instead of waiting for the reset.Rate limit exhaustion was also classified as
FailureType.config_error, which is incorrect — rate limits are a transient condition, not a configuration problem.How
Chunked sleep (
utils.py): Replaced the singletime.sleep(min_time_to_wait)call with_sleep_with_heartbeat(), which sleeps in 60-second intervals and emits a log line between each interval. This keeps the platform heartbeat alive during long rate-limit waits.API budget throttle (
utils.py): Added a proactive throttling mechanism that injects small delays when all tokens drop below a reserve threshold (10% of quota or 50 calls minimum). This spreads remaining calls over the reset window to reduce the chance of full exhaustion.Increased default
max_waiting_time(spec.json,source.py,utils.py): Bumped the default from 10 minutes → 120 minutes and the spec maximum from 60 → 240. GitHub rate limits reset every 60 minutes, so the old 10-minute default meant the connector would always fail immediately rather than wait for a reset. The new 120-minute default ensures the connector can survive a full rate-limit cycle.Error classification (
streams.py): ChangedFailureType.config_error→FailureType.transient_errorforGitHubAPILimitException. Also cleaned up the user-facing error message (removed embedded URL and remediation language).Test updates: Adjusted assertions to match the new chunked-sleep behavior, updated expected
FailureType, and added tests for the budget throttle mechanism.Review guide
source_github/utils.py— core changes:_sleep_with_heartbeat(),_apply_budget_throttle(),_get_budget_reserve(), and the_max_timedefault increasesource_github/spec.json— default/maximum/description changes formax_waiting_timesource_github/source.py— default fallback updated from 10 → 120source_github/streams.py— error classification and message changesunit_tests/test_multiple_token_authenticator.py— updated test assertions and new budget throttle testsmetadata.yaml/pyproject.toml/docs/integrations/sources/github.md— version bump to 2.1.20, changelog entryKey things to verify in review:
max_waiting_timeincrease (10 min → 120 min): Existing connections that relied on the old 10-minute default will now wait up to 2 hours when rate-limited instead of failing fast. Is this the desired behavior for all users, or should the default be lower (e.g. 60 min)?_sleep_with_heartbeatis sufficient to satisfy the platform heartbeat (vs. needing to emit records). The v2.1.14 fix forpull_request_statsused a similar pattern.FailureType.transient_error— does the platform retry differently for transient vs config errors? This means the platform may auto-retry on rate limit exhaustion._budget_loggedflag only resets oncheck_all_tokens()(after exhaustion sleep completes). During normal throttled operation, the log message appears only once per connector lifetime.getattr/setattrfor dynamic attribute access (count_restvscount_graphql) — this is inherited from the existing pattern inprocess_token.backoff_strategies.pywhere the CDK does a single blockingtime.sleep()on 403 responses when backoff < 10 minutes. That path is a separate issue.Updates since last revision
Second merge conflict resolution with master. Master landed #76090 (
source-githubv2.1.19: fix 403 permission errors misclassified as retryable). Version bumped to 2.1.20. Also reverted 3 unrelatedsource-google-adscommits that were accidentally pushed to this branch by another session — the PR diff now only containssource-githubchanges.User Impact
Syncs with many configured repositories (or other configurations that exhaust all token rate limits) should no longer time out during rate-limit waits. The connector will log progress periodically and resume once limits reset, rather than silently blocking and getting killed by the platform.
Behavior change: The default
max_waiting_timeincreases from 10 to 120 minutes. Existing connections using the default will now wait longer before failing on rate limits. This is intentional — the old default was too low to ever allow waiting through a GitHub rate limit reset cycle.Rate limit failures that do exceed
max_waiting_timewill now surface as transient errors instead of config errors, which more accurately reflects the nature of the problem.Can this PR be safely reverted and rolled back?
No state, config, or schema changes. The fix is purely behavioral (sleep chunking, error classification, and default tuning). Reverting restores the previous defaults and behavior.
Link to Devin session: https://app.devin.ai/sessions/d857c90514c549fda9c170dbac70633e
Requested by: Serhii Lazebnyi (@lazebnyi)