ci: wait for stream subsystem readiness in test_prometheus_stream.sh#13253
Closed
nic-6443 wants to merge 1 commit intoapache:masterfrom
Closed
ci: wait for stream subsystem readiness in test_prometheus_stream.sh#13253nic-6443 wants to merge 1 commit intoapache:masterfrom
nic-6443 wants to merge 1 commit intoapache:masterfrom
Conversation
The test relied on a fixed `sleep 0.5` after `make run` before probing the stream proxy on port 9100. On slow GitHub Actions runners this is often not enough for the stream subsystem's prometheus exporter to come up, leading to "Connection reset by peer" on the probe and a missing `apisix_stream_connection_total` metric. The test then fails with "prometheus can't work in stream subsystem". Replace the fixed sleep with a polling loop (up to ~10s) that waits until the prometheus metrics endpoint exposes the stream metric's HELP line, which indicates the stream prometheus plugin is loaded and ready.
There was a problem hiding this comment.
Pull request overview
This PR reduces CI flakiness in the stream Prometheus CLI test by replacing a fixed post-startup sleep with a readiness wait that polls the Prometheus metrics endpoint until the stream metric is exposed.
Changes:
- Replace
sleep 0.5with a polling loop that waits forapisix_stream_connection_total’s# HELPline before proceeding (in two test scenarios). - Add comments explaining the readiness signal and why it’s needed on slower CI runners.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+98
to
+106
| # Same readiness wait as above, since admin is disabled here we can't probe | ||
| # admin API; the prometheus HELP line is still a reliable readiness signal. | ||
| for _ in $(seq 1 20); do | ||
| if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \ | ||
| 2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then | ||
| break | ||
| fi | ||
| sleep 0.5 | ||
| done |
Comment on lines
+44
to
51
| for _ in $(seq 1 20); do | ||
| if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \ | ||
| 2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then | ||
| break | ||
| fi | ||
| sleep 0.5 | ||
| done | ||
|
|
Comment on lines
+44
to
+49
| for _ in $(seq 1 20); do | ||
| if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \ | ||
| 2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then | ||
| break | ||
| fi | ||
| sleep 0.5 |
Comment on lines
+100
to
+106
| for _ in $(seq 1 20); do | ||
| if curl -s --max-time 2 http://127.0.0.1:9091/apisix/prometheus/metrics \ | ||
| 2>/dev/null | grep -q "# HELP apisix_stream_connection_total"; then | ||
| break | ||
| fi | ||
| sleep 0.5 | ||
| done |
moonming
approved these changes
Apr 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
t/cli/test_prometheus_stream.shis flaky on the CI: many recent runs onmasterand feature branches fail withfailed: prometheus can't work in stream subsystem. A few examples just from the last few days:41775ef5— https://github.com/apache/apisix/actions/runs/24496623331ddddeafc— https://github.com/apache/apisix/actions/runs/24488366032feat-support-bedrock8bf841b4— https://github.com/apache/apisix/actions/runs/24561304700Root cause
The script does:
On slower GitHub Actions runners, 0.5s is not enough for the stream subsystem's prometheus exporter to come up after
make run. The probe to port 9100 hits a half-initialized listener and getsConnection reset by peer, so the connection counter is never bumped, and the metric grep fails. Confirmed by inspecting the failure logs:apisix_stream_connection_totalare absent from the metrics output in the failing runs (they're present in the passing runs).Fix
Replace the fixed
sleep 0.5with a polling loop (up to ~10s) that waits until the prometheus metrics endpoint exposes the stream metric's HELP line, which is a reliable signal that the stream prometheus plugin is loaded and ready to record traffic.The downstream
curl ... :9100; sleep 1is left as-is — once the plugin is loaded, that single probe + 1s flush window already works (the metric hasrefresh_interval: 1).Checklist