liveness probe to check message processing#848
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the Kafka broker Kubernetes health checking to detect “frozen” brokers that keep the TCP port open by switching the liveness probe from a TCP socket check to an exec-based Kafka protocol request (ApiVersions).
Changes:
- Replaced
livenessProbe.tcpSocketwith anexecprobe that runskafka-broker-api-versions.shagainst localhost. - Tuned liveness probe timings (initial delay, timeout, failure threshold) to trigger restarts on non-responsive brokers.
| command: | ||
| - /bin/sh | ||
| - -c | ||
| - /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 --timeout 10000 |
There was a problem hiding this comment.
The liveness probe hard-codes localhost:9092, but the container port is templated via .Values.apps.events.kafka.port. If the chart is deployed with a non-9092 port, liveness will fail even though the broker is healthy. Consider templating the port here (or using the named port) to keep this probe consistent with the rest of the manifest.
| - /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 --timeout 10000 | |
| - /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:{{ .Values.apps.events.kafka.port }} --timeout 10000 |
| command: | ||
| - /bin/sh | ||
| - -c | ||
| - /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 --timeout 10000 |
There was a problem hiding this comment.
kafka-broker-api-versions.sh writes the supported API versions to stdout; running this every 30s as a liveness probe will continuously add noise to the container logs. Consider suppressing output (e.g., redirect to /dev/null) or switching to a quieter command so logs remain actionable.
| - /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 --timeout 10000 | |
| - /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 --timeout 10000 >/dev/null 2>&1 |
| initialDelaySeconds: 60 | ||
| periodSeconds: 30 | ||
| timeoutSeconds: 5 | ||
| timeoutSeconds: 15 | ||
| failureThreshold: 3 |
There was a problem hiding this comment.
This liveness probe starts a Kafka CLI command (which typically spins up a new JVM) every periodSeconds: 30. That can add steady CPU/memory overhead and can become significant on small nodes or under load. Consider increasing the probe period, using a lighter-weight probe implementation, or documenting/confirming the resource impact is acceptable for this deployment.
| exec: | ||
| command: | ||
| - /bin/sh | ||
| - -c | ||
| - timeout 10 /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 |
There was a problem hiding this comment.
The exec.command list items are indented at the same level as command:, which makes the YAML invalid (the sequence must be nested under the command key). Adjust indentation so the - /bin/sh, - -c, and the actual command are children of command:.
Implemented solution
Replacing TCP socket probe with liveness probe to check that the broker is still alive:
If the broker freeze (happened in the past during cluster upgrades) this should trigger a liveness failure and K8s will take care of the rest.
How to test this PR
Freeze the Kafka JVM without killing the TCP port.
Then watch K8s:
After ~90 s (3 × 30 s periods) you should see the RESTARTS counter increment.
Sanity checks:
Breaking changes (select one):
breaking-changeand the migration procedure is well described abovePossible deployment updates issues (select one):
alert:deploymentTest coverage (select one):
Documentation (select one):
Nice to have (if relevant):