ci: wait for stream subsystem readiness in test_prometheus_stream.sh by nic-6443 · Pull Request #13253 · apache/apisix

nic-6443 · 2026-04-18T03:53:54Z

Description

t/cli/test_prometheus_stream.sh is flaky on the CI: many recent runs on master and feature branches fail with failed: prometheus can't work in stream subsystem. A few examples just from the last few days:

master 41775ef5 — https://github.com/apache/apisix/actions/runs/24496623331
master ddddeafc — https://github.com/apache/apisix/actions/runs/24488366032
branch feat-support-bedrock 8bf841b4 — https://github.com/apache/apisix/actions/runs/24561304700

Root cause

The script does:

make run
sleep 0.5
# ...
curl http://127.0.0.1:9100 || true   # probe stream proxy
sleep 1
out="$(curl http://127.0.0.1:9091/apisix/prometheus/metrics)"
grep "apisix_stream_connection_total{route=\"1\"} 1"

On slower GitHub Actions runners, 0.5s is not enough for the stream subsystem's prometheus exporter to come up after make run. The probe to port 9100 hits a half-initialized listener and gets Connection reset by peer, so the connection counter is never bumped, and the metric grep fails. Confirmed by inspecting the failure logs:

HELP/TYPE lines for apisix_stream_connection_total are absent from the metrics output in the failing runs (they're present in the passing runs).

Fix

Replace the fixed sleep 0.5 with a polling loop (up to ~10s) that waits until the prometheus metrics endpoint exposes the stream metric's HELP line, which is a reliable signal that the stream prometheus plugin is loaded and ready to record traffic.

The downstream curl ... :9100; sleep 1 is left as-is — once the plugin is loaded, that single probe + 1s flush window already works (the metric has refresh_interval: 1).

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to the changes introduced in this PR — N/A, this PR fixes an existing test
I have added proper labels to this PR
Review and CI completed successfully

The test relied on a fixed `sleep 0.5` after `make run` before probing the stream proxy on port 9100. On slow GitHub Actions runners this is often not enough for the stream subsystem's prometheus exporter to come up, leading to "Connection reset by peer" on the probe and a missing `apisix_stream_connection_total` metric. The test then fails with "prometheus can't work in stream subsystem". Replace the fixed sleep with a polling loop (up to ~10s) that waits until the prometheus metrics endpoint exposes the stream metric's HELP line, which indicates the stream prometheus plugin is loaded and ready.

Copilot

Pull request overview

This PR reduces CI flakiness in the stream Prometheus CLI test by replacing a fixed post-startup sleep with a readiness wait that polls the Prometheus metrics endpoint until the stream metric is exposed.

Changes:

Replace sleep 0.5 with a polling loop that waits for apisix_stream_connection_total’s # HELP line before proceeding (in two test scenarios).
Add comments explaining the readiness signal and why it’s needed on slower CI runners.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+# Same readiness wait as above, since admin is disabled here we can't probe
+# admin API; the prometheus HELP line is still a reliable readiness signal.
+for _ in $(seq 1 20); do
+    if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \
+            2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then
+        break
+    fi
+    sleep 0.5
+done


+for _ in $(seq 1 20); do
+    if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \
+            2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then
+        break
+    fi
+    sleep 0.5
+done



+for _ in $(seq 1 20); do
+    if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \
+            2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then
+        break
+    fi
+    sleep 0.5


+for _ in $(seq 1 20); do
+    if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \
+            2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then
+        break
+    fi
+    sleep 0.5
+done


Copilot AI review requested due to automatic review settings April 18, 2026 03:53

dosubot Bot added size:XS This PR changes 0-9 lines, ignoring generated files. CI labels Apr 18, 2026

Copilot started reviewing on behalf of nic-6443 April 18, 2026 03:54 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

moonming approved these changes Apr 20, 2026

View reviewed changes

nic-6443 closed this Apr 22, 2026

nic-6443 deleted the nic/fix-prom-stream-flake branch April 22, 2026 03:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: wait for stream subsystem readiness in test_prometheus_stream.sh#13253

ci: wait for stream subsystem readiness in test_prometheus_stream.sh#13253
nic-6443 wants to merge 1 commit intoapache:masterfrom
nic-6443:nic/fix-prom-stream-flake

nic-6443 commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nic-6443 commented Apr 18, 2026

Description

Root cause

Fix

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants