Skip to content

Event Hubs: Handle additional retriable/reconnectable recoverable AMQP connection cases#4213

Open
milas wants to merge 1 commit intoAzure:mainfrom
milas:milas/fix-amqp-reconnection
Open

Event Hubs: Handle additional retriable/reconnectable recoverable AMQP connection cases#4213
milas wants to merge 1 commit intoAzure:mainfrom
milas:milas/fix-amqp-reconnection

Conversation

@milas
Copy link
Copy Markdown

@milas milas commented Apr 17, 2026

The AMQP recoverable connection is not recovering in a variety of situations.

This makes it really hard to use -- the connection is created globally, passed to the event processor, which then gives it to the various partition workers.

Thus, without it recovering gracefully, in the event of a transient issue, the event processor must be discarded and a new instance created with a new connection, which is highly disruptive to partition balancing.

Added cases:

  • Connection dropped -> reconnect
  • Connection forced -> reconnect
  • Connection framing -> reconnect
  • Authorization -> reconnection
  • Internal timeout -> retry

I added very simple test cases that assert on the mapping of these errors -- if these are extraneous, let me know, and I'll get rid of them.

* Connection dropped -> reconnect
* Connection forced -> reconnect
* Connection framing -> reconnect
* Authorization -> reconnection
* Internal timeout -> retry
@milas milas requested a review from j7nw4r as a code owner April 17, 2026 18:05
@github-actions github-actions Bot added Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs labels Apr 17, 2026
@github-actions
Copy link
Copy Markdown

Thank you for your contribution @milas! We will review the pull request and get back to you soon.

@milas

This comment was marked as outdated.

@milas
Copy link
Copy Markdown
Author

milas commented Apr 18, 2026

@microsoft-github-policy-service agree company="Realm.Security, Inc."

@milas
Copy link
Copy Markdown
Author

milas commented Apr 18, 2026

I believe I got the CLA properly signed - CI is showing ✅ on license/cla now.

The "azsdk/checkenforcer" status still shows as pending but the linked job shows as successful, so I'm not sure if it got wedged and needs a contributor to kick it or if that's part of an internal Microsoft CI process that requires maintainers to do something externally before it officially reports back status.

Regardless, if there's anything else needed from me either process or code feedback-wise, just let me know!


FYI that the policy bot is going a bit nuts -- it commented 4x yesterday before I was able to respond and then it rejected me because the syntax it tells you to use is incorrect 🙃

@milas
Copy link
Copy Markdown
Author

milas commented Apr 28, 2026

FYI this PR is not sufficient as-is right now. In addition to the missing cases, the errors get wrapped from AMQP Error -> azure_core -> AMQP Error (sometimes more than once!).

As a result, unwrapping/following the error source chain is necessary. I have a fix for that as well but am waiting for input/feedback from maintainers before I move forward with any extra commits / PRs.

I'm concerned there are other correctness issues hidden by the retry, e.g. deadlocking when actually recovering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant