fix(deps): bump node-rdkafka to ^3.6.0 to fix cooperative-sticky rebalance bug#2728
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 4 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.4 #2728 +/- ##
===================================================
+ Coverage 74.50% 74.68% +0.17%
===================================================
Files 200 200
Lines 13610 13615 +5
===================================================
+ Hits 10140 10168 +28
+ Misses 3460 3437 -23
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
LGTM — the dependency bump is clean with no code changes required. The node-rdkafka 3.x release has no API breaking changes (only dropped EOL Node.js support), and the librdkafka upgrade path from 2.3.0 to 2.12.0 has no consumer-breaking changes. The root cause analysis in the PR body is thorough. |
f528083 to
1735ada
Compare
|
Excellent investigation and write-up. The dependency bump itself is clean — only |
|
LGTM |
|
On hold because it is not known yet whether this PR actually helps with the CI. |
Jira issue not foundThe Jira issue BB-760 was not found. |
Upgrades node-rdkafka to ^3.6.0 (resolving to 3.6.1 / librdkafka 2.12.0) to fix a cooperative-sticky partition assignor bug that causes partitions to become orphaned during consumer group rebalances. Issue: BB-760
…scribe The bootstrap consumer uses setInterval(200ms) to call consume(1, cb), creating multiple concurrent C++ async workers (each with a 1000ms timeout). Since librdkafka 2.10.0, these workers survive an unsubscribe→subscribe transition and can dequeue messages from the next subscription, causing them to be lost to the normal consume pipeline. Track in-flight workers with a counter and defer unsubscribe() until all have completed. Issue: BB-760
aa79ee9 to
11fe6ee
Compare
|
LGTM |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
Pin 'range,roundrobin' (the current librdkafka default) explicitly so that future version updates cannot silently change the rebalance strategy. Our consumers rely on eager rebalancing. Issue: BB-760
|
LGTM |
…trap Address review feedback: instead of tracking in-flight consume workers and polling for drain, eliminate concurrent workers entirely by using chained setTimeout. Each consume(1) call is only scheduled after the previous one completes, guaranteeing at most one C++ async worker is in flight. This makes unsubscribe() safe to call immediately from the callback. Issue: BB-760
|
LGTM |
- Rename consumeNext to consume - Remove bootstrapDone flag (unnecessary with chained setTimeout since only one consume worker is ever in flight) - Use find() instead of forEach + matched flag - Flatten the match handling out of the forEach callback Issue: BB-760
|
LGTM |
|
LGTM |
|
/approve |
In the queueThe changeset has received all authorizations and has been added to the The changeset will be merged in:
The following branches will NOT be impacted:
There is no action required on your side. You will be notified here once IMPORTANT Please do not attempt to modify this pull request.
If you need this pull request to be removed from the queue, please contact a The following options are set: approve |
|
I have successfully merged the changeset of this pull request
The following branches have NOT changed:
Please check the status of the associated issue BB-760. Goodbye delthas. |
Summary
Bumps
node-rdkafkafrom^2.12.0(librdkafka 2.3.0) to^3.6.0(currently resolving to 3.6.1 / librdkafka 2.12.0) as a maintenance upgrade, picking up bugfixes accumulated across librdkafka 2.3.0→2.12.0.Also fixes a latent race condition in
BackbeatConsumer._bootstrapConsumer()exposed by librdkafka 2.10.0+.Context
This was initially motivated by investigating a flaky CI failure in the "Kafka Cleaner" test (Zenko CI run), where the
backbeat-metricstopic had unconsumed messages after rapid pod replacement. While the original hypothesis (cooperative-sticky rebalance bug confluentinc/librdkafka#4908) turned out not to apply — BackbeatConsumer uses eager rebalance (range,roundrobin), not cooperative-sticky — the upgrade is still worthwhile for the accumulated bugfixes, and exposed a real bug in the bootstrap code that needed fixing.Bootstrap consumer fix
The librdkafka upgrade exposed a latent race condition in
_bootstrapConsumer(). The bootstrap previously usedsetInterval(200ms)to callconsume(1, consumeCb), dispatching C++ async workers to the libuv thread pool (each with a 1000ms timeout). Up to 5 workers could be in flight concurrently.When the bootstrap match was found, the old code called
clearIntervalthen immediatelyunsubscribe(). ButclearIntervalonly prevents new consume calls — workers already in the C++ thread pool continue running. With librdkafka < 2.10.0,unsubscribe()effectively invalidated these stale workers (they'd return empty or error). With librdkafka >= 2.10.0 ("Enhanced handling for subscribe/unsubscribe edge cases"), these workers survive across the unsubscribe→subscribe transition and dequeue messages from the next subscription, delivering them to the bootstrap'sconsumeCbwhich silently ignores non-bootstrap messages.This was confirmed by local reproduction and bisection:
_bootstrapConsumertestThe fix replaces
setIntervalwith chainedsetTimeout: eachconsume(1)is only scheduled after the previous one completes, guaranteeing at most one C++ async worker is in flight at a time. This makesunsubscribe()safe to call directly from the callback with no drain/polling needed.Why ^3.6.0
We set the floor to
^3.6.0(librdkafka 2.12.0). KIP-848 (new consumer group protocol) is opt-in (group.protocol=consumermust be explicitly set) and does not affect the defaultclassicprotocol.Upgrade safety
librdkafka 2.3.0 → 2.12.0: The librdkafka CHANGELOG shows no consumer-breaking changes in this range. The only notable breaking change is in v2.4.0 where
INVALID_RECORDproducer errors became non-retriable — this does not affect consumers. The metadata recovery behavior change in 2.10.0 (brokers not in metadata responses are removed and clients re-bootstrap) is a minor behavioral difference that should not impact normal operation.node-rdkafka 2.x → 3.x: The 3.0.0 release only dropped support for EOL Node.js versions — no API changes. Backbeat requires Node >= 20 and runs on Node 22.14.0.
Issue: BB-760