Event Hubs: Handle additional retriable/reconnectable recoverable AMQP connection cases#4213
Event Hubs: Handle additional retriable/reconnectable recoverable AMQP connection cases#4213milas wants to merge 1 commit intoAzure:mainfrom
Conversation
* Connection dropped -> reconnect * Connection forced -> reconnect * Connection framing -> reconnect * Authorization -> reconnection * Internal timeout -> retry
|
Thank you for your contribution @milas! We will review the pull request and get back to you soon. |
This comment was marked as outdated.
This comment was marked as outdated.
|
@microsoft-github-policy-service agree company="Realm.Security, Inc." |
|
I believe I got the CLA properly signed - CI is showing ✅ on The "azsdk/checkenforcer" status still shows as pending but the linked job shows as successful, so I'm not sure if it got wedged and needs a contributor to kick it or if that's part of an internal Microsoft CI process that requires maintainers to do something externally before it officially reports back status. Regardless, if there's anything else needed from me either process or code feedback-wise, just let me know! FYI that the policy bot is going a bit nuts -- it commented 4x yesterday before I was able to respond and then it rejected me because the syntax it tells you to use is incorrect 🙃 |
|
FYI this PR is not sufficient as-is right now. In addition to the missing cases, the errors get wrapped from AMQP Error -> azure_core -> AMQP Error (sometimes more than once!). As a result, unwrapping/following the error I'm concerned there are other correctness issues hidden by the retry, e.g. deadlocking when actually recovering. |
The AMQP recoverable connection is not recovering in a variety of situations.
This makes it really hard to use -- the connection is created globally, passed to the event processor, which then gives it to the various partition workers.
Thus, without it recovering gracefully, in the event of a transient issue, the event processor must be discarded and a new instance created with a new connection, which is highly disruptive to partition balancing.
Added cases:
I added very simple test cases that assert on the mapping of these errors -- if these are extraneous, let me know, and I'll get rid of them.