Resillience to RabbitMQ cluster outage/behavior #465
Replies: 1 comment
-
|
Hey, thanks for raising this scenario. I would suggest to use the latest 3.4.0 (there were few improvements to RMQ and we've moved to the latest RMQ Client). The consumers should resume would be my expectation. Are you using a outbox plugin or health check circuit breaker plugin? They might play a role. At the end of the day, turning on INFO/DEBUG logging should help for the Curious if the issue is isolated to SMB or the RMQ client. We're on the latest version as of this time. Let me know how I could help here. If you'd shared the logs or sample. Also happy to take in any PRs for a load/resiliency test that others would benefit. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We are currently evaluating SMB as the messaging framework for an application. We use a 3 node RabbitMQ cluster with quorum queues.
Basic behavior seems fine for our application. Although we haven't tested with 100+ exchanges/queues that we intent to use.
Our main focus right now is on the behavior of SMB with respect to RabbitMQ cluster (with haproxy in front) issues. Which effectively is two scenario's.
We want recovery preferrably within seconds. Since we use the messages for operational flow control, and +5 second delays is directly visible for our users.
The partial splitbrain seems to behave well. Surprisingly since other message framework seem to give a whole lot of trouble.
However when a node goes does we experience all sort of problems. Detection of the disconnect is almost instant. And recovery of the connection happens (to a different node) after a few seconds. Default behavior.
However publisher hang and never seem to recover. Using a timeout on the publish prevents the hang. But after that publishes seem 'flaky'
And some consumers don't recover at all. Message consuming simply stops. Allowing multiple instances of the consumers partly solves this (mask the issue, actually) since the next disconnect will likelely block the remaining consumers as well.
Are there any suggestions for what we could try to improve the behavior?
Best regards,
Paul
Beta Was this translation helpful? Give feedback.
All reactions