Skip to content

Commit 1aeea3e

Browse files
committed
LOGC-63: Enable rdkafka cgrp debug logs on workbench crr-queue-processor
Add a workflow step that appends RDKAFKA_DEBUG_LOGS=cgrp to the backbeat container's /conf/env and SIGKILLs the queue-processor so supervisord respawns it with the new env. With this set, librdkafka surfaces consumer-group state transitions (JoinGroup/SyncGroup/ Heartbeat/rebalance triggers) as rdkafka.log events in the process stdout, which are dumped by the existing "Dump backbeat logs on failure" step. Needed to identify the broker-side cause of the 1s rebalance loop the next time the CRR e2e flake reproduces. The env mutation has to live in CI rather than supervisord.conf because env/default/config/backbeat/supervisord.conf is generated by workbench's docker-compose action on each run. Scope kept to cgrp to keep log volume manageable. Workbench-only; production deployments use Federation and are unaffected.
1 parent 9e043f5 commit 1aeea3e

2 files changed

Lines changed: 56 additions & 0 deletions

File tree

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
#!/usr/bin/env bash
2+
# Enable librdkafka cgrp debug logs on the workbench backbeat
3+
# crr-queue-processor so consumer-group state transitions
4+
# (JoinGroup/SyncGroup/Heartbeat/rebalance triggers) surface as
5+
# rdkafka.log events in the process stdout. Needed to diagnose an
6+
# intermittent 1s rebalance loop that leaves CRR e2e tests stuck at
7+
# ReplicationStatus=PENDING.
8+
#
9+
# We can't bake this into the supervisord.conf because that file is
10+
# generated by workbench's docker-compose action. Instead, append the
11+
# export to /conf/env inside the container and SIGKILL the QProc;
12+
# supervisord's autorestart=true respawns it, the new instance sources
13+
# /conf/env and starts with RDKAFKA_DEBUG_LOGS=cgrp set.
14+
#
15+
# Workbench-only; production deployments use Federation and are
16+
# unaffected.
17+
set -e
18+
19+
echo "Enabling librdkafka cgrp debug on crr-queue-processor..."
20+
21+
docker exec workbench-backbeat sh -c \
22+
'grep -q "RDKAFKA_DEBUG_LOGS" /conf/env || echo "export RDKAFKA_DEBUG_LOGS=cgrp" >> /conf/env'
23+
24+
proc=$(docker exec workbench-backbeat supervisord ctl -c /conf/supervisord.conf status \
25+
| sed 's/\x1b\[[0-9;]*m//g' \
26+
| awk '/^crr-queue-processor_[0-9]+[[:space:]]/ {print $1; exit}')
27+
if [ -z "$proc" ]; then
28+
echo "ERROR: crr-queue-processor not found in supervisord status"
29+
docker exec workbench-backbeat supervisord ctl -c /conf/supervisord.conf status || true
30+
exit 1
31+
fi
32+
33+
# Count current "ready to consume" markers; after restart we expect
34+
# this count to grow by one.
35+
prev_ready=$(docker exec workbench-backbeat sh -c \
36+
"cat /logs/crr-queue-processor_*.log 2>/dev/null | grep -c 'queue processor is ready to consume replication entries' || true")
37+
38+
docker exec workbench-backbeat supervisord ctl -c /conf/supervisord.conf signal KILL "$proc" || true
39+
40+
# Wait for the respawned instance to log a fresh ready marker.
41+
for i in {1..60}; do
42+
sleep 1
43+
new_ready=$(docker exec workbench-backbeat sh -c \
44+
"cat /logs/crr-queue-processor_*.log 2>/dev/null | grep -c 'queue processor is ready to consume replication entries' || true")
45+
if [ "${new_ready:-0}" -gt "${prev_ready:-0}" ]; then
46+
echo "✓ crr-queue-processor restarted with RDKAFKA_DEBUG_LOGS=cgrp"
47+
exit 0
48+
fi
49+
done
50+
51+
echo "ERROR: crr-queue-processor did not re-emit 'ready to consume' within 60s after restart"
52+
docker exec workbench-backbeat supervisord ctl -c /conf/supervisord.conf status || true
53+
exit 1

.github/workflows/e2e-tests.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,9 @@ jobs:
4141
- name: Start Workbench
4242
uses: scality/workbench@v0.16.0
4343

44+
- name: Enable rdkafka cgrp debug on crr-queue-processor
45+
run: ./.github/scripts/enable-qproc-rdkafka-debug.sh
46+
4447
- name: Wait for all services
4548
run: |
4649
set -e

0 commit comments

Comments
 (0)