Skip to content

ci: wait for stream subsystem readiness in test_prometheus_stream.sh#13253

Closed
nic-6443 wants to merge 1 commit intoapache:masterfrom
nic-6443:nic/fix-prom-stream-flake
Closed

ci: wait for stream subsystem readiness in test_prometheus_stream.sh#13253
nic-6443 wants to merge 1 commit intoapache:masterfrom
nic-6443:nic/fix-prom-stream-flake

Conversation

@nic-6443
Copy link
Copy Markdown
Member

Description

t/cli/test_prometheus_stream.sh is flaky on the CI: many recent runs on master and feature branches fail with failed: prometheus can't work in stream subsystem. A few examples just from the last few days:

Root cause

The script does:

make run
sleep 0.5
# ...
curl http://127.0.0.1:9100 || true   # probe stream proxy
sleep 1
out="$(curl http://127.0.0.1:9091/apisix/prometheus/metrics)"
grep "apisix_stream_connection_total{route=\"1\"} 1"

On slower GitHub Actions runners, 0.5s is not enough for the stream subsystem's prometheus exporter to come up after make run. The probe to port 9100 hits a half-initialized listener and gets Connection reset by peer, so the connection counter is never bumped, and the metric grep fails. Confirmed by inspecting the failure logs:

  • HELP/TYPE lines for apisix_stream_connection_total are absent from the metrics output in the failing runs (they're present in the passing runs).

Fix

Replace the fixed sleep 0.5 with a polling loop (up to ~10s) that waits until the prometheus metrics endpoint exposes the stream metric's HELP line, which is a reliable signal that the stream prometheus plugin is loaded and ready to record traffic.

The downstream curl ... :9100; sleep 1 is left as-is — once the plugin is loaded, that single probe + 1s flush window already works (the metric has refresh_interval: 1).

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to the changes introduced in this PR — N/A, this PR fixes an existing test
  • I have added proper labels to this PR
  • Review and CI completed successfully

The test relied on a fixed `sleep 0.5` after `make run` before probing the
stream proxy on port 9100. On slow GitHub Actions runners this is often not
enough for the stream subsystem's prometheus exporter to come up, leading
to "Connection reset by peer" on the probe and a missing
`apisix_stream_connection_total` metric. The test then fails with
"prometheus can't work in stream subsystem".

Replace the fixed sleep with a polling loop (up to ~10s) that waits until
the prometheus metrics endpoint exposes the stream metric's HELP line,
which indicates the stream prometheus plugin is loaded and ready.
Copilot AI review requested due to automatic review settings April 18, 2026 03:53
@dosubot dosubot Bot added size:XS This PR changes 0-9 lines, ignoring generated files. CI labels Apr 18, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces CI flakiness in the stream Prometheus CLI test by replacing a fixed post-startup sleep with a readiness wait that polls the Prometheus metrics endpoint until the stream metric is exposed.

Changes:

  • Replace sleep 0.5 with a polling loop that waits for apisix_stream_connection_total’s # HELP line before proceeding (in two test scenarios).
  • Add comments explaining the readiness signal and why it’s needed on slower CI runners.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +98 to +106
# Same readiness wait as above, since admin is disabled here we can't probe
# admin API; the prometheus HELP line is still a reliable readiness signal.
for _ in $(seq 1 20); do
if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \
2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then
break
fi
sleep 0.5
done
Comment on lines +44 to 51
for _ in $(seq 1 20); do
if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \
2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then
break
fi
sleep 0.5
done

Comment on lines +44 to +49
for _ in $(seq 1 20); do
if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \
2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then
break
fi
sleep 0.5
Comment on lines +100 to +106
for _ in $(seq 1 20); do
if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \
2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then
break
fi
sleep 0.5
done
@nic-6443 nic-6443 closed this Apr 22, 2026
@nic-6443 nic-6443 deleted the nic/fix-prom-stream-flake branch April 22, 2026 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants