Skip to content

liveness probe to check message processing#848

Merged
filippomc merged 2 commits into
developfrom
feature/events_probe
Apr 22, 2026
Merged

liveness probe to check message processing#848
filippomc merged 2 commits into
developfrom
feature/events_probe

Conversation

@ddelpiano

Copy link
Copy Markdown
Member

Implemented solution

Replacing TCP socket probe with liveness probe to check that the broker is still alive:

  • Opens a connection to localhost:9092
  • Sends a ApiVersions request (a standard Kafka protocol request)
  • The broker responds with the list of API versions it supports
  • The script prints them and exits 0
    If the broker freeze (happened in the past during cluster upgrades) this should trigger a liveness failure and K8s will take care of the rest.

How to test this PR

Freeze the Kafka JVM without killing the TCP port.

# Freeze the process (port stays open, broker stops responding)
kubectl exec -n ifn kafka-0 -- kill -STOP 1

Then watch K8s:

kubectl get pod -n ifn kafka-0 -w

After ~90 s (3 × 30 s periods) you should see the RESTARTS counter increment.

Sanity checks:

  • The pull request is explicitly linked to the relevant issue(s)
  • The issue is well described: clearly states the problem and the general proposed solution(s)
  • In this PR it is explicitly stated how to test the current change
  • The labels in the issue set the scope and the type of issue (bug, feature, etc.)
  • The relevant components are indicated in the issue (if any)
  • All the automated test checks are passing
  • All the linked issues are included in one Sprint
  • All the linked issues are in the Review state
  • All the linked issues are assigned

Breaking changes (select one):

  • The present changes do not change the preexisting api in any way
  • This PR and the issue are tagged as a breaking-change and the migration procedure is well described above

Possible deployment updates issues (select one):

  • There is no reason why deployments based on CloudHarness may break after the current update
  • This PR and the issue are tagged as alert:deployment

Test coverage (select one):

  • Tests for the relevant cases are included in this pr
  • The changes included in this pr are out of the current test coverage scope

Documentation (select one):

  • The documentation has been updated to match the current changes
  • The changes included in this PR are out of the current documentation scope

Nice to have (if relevant):

  • Screenshots of the changes
  • Explanatory video/animated gif

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Kafka broker Kubernetes health checking to detect “frozen” brokers that keep the TCP port open by switching the liveness probe from a TCP socket check to an exec-based Kafka protocol request (ApiVersions).

Changes:

  • Replaced livenessProbe.tcpSocket with an exec probe that runs kafka-broker-api-versions.sh against localhost.
  • Tuned liveness probe timings (initial delay, timeout, failure threshold) to trigger restarts on non-responsive brokers.

command:
- /bin/sh
- -c
- /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 --timeout 10000

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The liveness probe hard-codes localhost:9092, but the container port is templated via .Values.apps.events.kafka.port. If the chart is deployed with a non-9092 port, liveness will fail even though the broker is healthy. Consider templating the port here (or using the named port) to keep this probe consistent with the rest of the manifest.

Suggested change
- /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 --timeout 10000
- /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:{{ .Values.apps.events.kafka.port }} --timeout 10000

Copilot uses AI. Check for mistakes.
command:
- /bin/sh
- -c
- /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 --timeout 10000

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kafka-broker-api-versions.sh writes the supported API versions to stdout; running this every 30s as a liveness probe will continuously add noise to the container logs. Consider suppressing output (e.g., redirect to /dev/null) or switching to a quieter command so logs remain actionable.

Suggested change
- /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 --timeout 10000
- /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 --timeout 10000 >/dev/null 2>&1

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +73
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 5
timeoutSeconds: 15
failureThreshold: 3

Copilot AI Apr 16, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This liveness probe starts a Kafka CLI command (which typically spins up a new JVM) every periodSeconds: 30. That can add steady CPU/memory overhead and can become significant on small nodes or under load. Consider increasing the probe period, using a lighter-weight probe implementation, or documenting/confirming the resource impact is acceptable for this deployment.

Copilot uses AI. Check for mistakes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment on lines +65 to +69
exec:
command:
- /bin/sh
- -c
- timeout 10 /opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092

Copilot AI Apr 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exec.command list items are indented at the same level as command:, which makes the YAML invalid (the sequence must be nested under the command key). Adjust indentation so the - /bin/sh, - -c, and the actual command are children of command:.

Copilot uses AI. Check for mistakes.
@filippomc filippomc merged commit 1066c53 into develop Apr 22, 2026
13 of 15 checks passed
@filippomc filippomc deleted the feature/events_probe branch April 22, 2026 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants