Skip to content

Commit 952ecf2

Browse files
fix(wait-for-grafana): two-phase polling with smarter timeout handling (#213)
* fix(wait-for-grafana): two-phase polling with smarter timeout handling Split the polling loop into two phases: 1. TCP-bind phase (new `startupTimeout` input, default 300s): polls every 5s while curl returns 000 (ECONNREFUSED). Grafana 13 and dev-preview images on contested runners can take >60s to bind the port; the old single-phase timeout would fail the job before the process was ready. 2. Health phase (existing `timeout` input, default 60s): once the port responds with anything other than 000, switches to fast polling (existing `interval`, default 0.5s) until the expected status code is received. Additional improvements: - Fail fast on 4xx responses in the health phase — these indicate a URL misconfiguration rather than a timing issue and won't self-resolve. - Pass --connect-timeout 2 to curl in phase 1 so each attempt doesn't hang waiting for a connection that will never come. Backward compatible: callers that don't pass `startupTimeout` get the 300s default, which is strictly better than the previous 60s limit. * babysit: address Copilot review feedback - Default startup_timeout to 300 if 5th arg is absent, preventing immediate Phase 1 exit when called with the old 4-arg interface - Distinguish ECONNREFUSED (curl exit 7) from other 000-producing errors (DNS failure = exit 6, TLS errors, etc.); fail fast on anything that isn't exit 7, avoiding a 300s stall on misconfigured URLs - Add --connect-timeout 2 and --max-time 10 to Phase 2 curl calls so a stalled connection cannot outlast the health timeout window * babysit: document startupTimeout input in README * babysit: remove version-specific language from docs 000 failures confirmed on grafana-enterprise@12.3.6, not just v13 or dev-preview images -- the issue is pure runner contention, not image version. Update action.yml description and README to reflect this.
1 parent 4698961 commit 952ecf2

3 files changed

Lines changed: 61 additions & 6 deletions

File tree

wait-for-grafana/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,14 @@ The maximum time to wait for the server to respond, in seconds. Default is `60`.
2020

2121
The time to wait between each check, in seconds. Default is `0.5`.
2222

23+
### `startupTimeout` (optional)
24+
25+
The maximum time to wait for the server's TCP port to bind, in seconds. Default is `300`.
26+
27+
This covers the window between the container starting and Grafana's HTTP listener becoming active. During this phase the action polls every 5 seconds. Once the port responds (with any status other than `000`), normal health polling begins using the `timeout` and `interval` values above.
28+
29+
On contested CI runners, Grafana's HTTP listener can take longer to bind than the default health-check window allows, regardless of Grafana version. Increasing this value gives the process more time to start without affecting the health-check phase.
30+
2331
## How to use?
2432

2533
You can use this action in your workflow to wait for a Grafana server to become available before running tests or other operations. Here's an example of how to use it:

wait-for-grafana/action.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,18 @@ inputs:
1717
description: "Interval between checks in seconds"
1818
required: true
1919
default: "0.5"
20+
startupTimeout:
21+
description: "Seconds to wait for the TCP port to bind before health polling begins. On contested CI runners, Grafana's HTTP listener can take longer to start than the default health-check window allows."
22+
required: true
23+
default: "300"
2024
runs:
2125
using: "composite"
2226
steps:
23-
- run: ${{ github.action_path }}/wait-for-grafana.sh "${GRAFANA_URL}" "${GRAFANA_RESPONSE_CODE}" "${GRAFANA_TIMEOUT}" "${GRAFANA_INTERVAL}"
27+
- run: ${{ github.action_path }}/wait-for-grafana.sh "${GRAFANA_URL}" "${GRAFANA_RESPONSE_CODE}" "${GRAFANA_TIMEOUT}" "${GRAFANA_INTERVAL}" "${GRAFANA_STARTUP_TIMEOUT}"
2428
shell: bash
2529
env:
2630
GRAFANA_URL: ${{ inputs.url }}
2731
GRAFANA_RESPONSE_CODE: ${{ inputs.responseCode }}
2832
GRAFANA_TIMEOUT: ${{ inputs.timeout }}
2933
GRAFANA_INTERVAL: ${{ inputs.interval }}
34+
GRAFANA_STARTUP_TIMEOUT: ${{ inputs.startupTimeout }}

wait-for-grafana/wait-for-grafana.sh

Lines changed: 47 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,25 +4,67 @@ url="$1"
44
expected_response_code="$2"
55
timeout="$3"
66
interval="$4"
7+
startup_timeout="${5:-300}"
78

89
echo "Checking URL: $url"
910
echo "Expected response code: $expected_response_code"
10-
echo "Timeout: $timeout seconds"
11+
echo "Startup timeout (TCP bind): $startup_timeout seconds"
12+
echo "Health timeout (after bind): $timeout seconds"
1113
echo "Interval: $interval seconds"
1214

13-
end_time=$((SECONDS + timeout))
15+
# Phase 1: wait for TCP port to bind.
16+
# curl exit code 7 = ECONNREFUSED: the process isn't listening yet, safe to keep waiting.
17+
# Any other non-zero exit code (e.g. 6 = DNS failure) indicates misconfiguration — fail fast.
18+
startup_end=$((SECONDS + startup_timeout))
19+
port_bound=false
1420

15-
while [ $SECONDS -lt $end_time ]; do
16-
response=$(curl -s -o /dev/null -w "%{http_code}" "$url")
21+
while [ $SECONDS -lt $startup_end ]; do
22+
response=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 2 "$url")
23+
curl_exit=$?
24+
25+
if [ $curl_exit -ne 0 ] && [ $curl_exit -ne 7 ]; then
26+
echo "curl failed with exit code $curl_exit (not a connection-refused error) — failing fast"
27+
exit 1
28+
fi
29+
30+
if [ "$response" != "000" ]; then
31+
port_bound=true
32+
break
33+
fi
34+
35+
echo "Waiting for TCP bind (curl exit: $curl_exit). Current status: $response"
36+
sleep 5
37+
done
38+
39+
if [ "$port_bound" = false ]; then
40+
echo "Startup timeout reached. Server TCP port did not bind within $startup_timeout seconds"
41+
exit 1
42+
fi
43+
44+
echo "TCP port bound. Waiting for server to respond with status code $expected_response_code..."
45+
46+
# Phase 2: port is open, wait for a healthy response.
47+
# --connect-timeout and --max-time bound each curl call so a stalled connection
48+
# cannot outlast the health window.
49+
# Fail fast on 4xx — indicates a URL misconfiguration, not a timing issue.
50+
health_end=$((SECONDS + timeout))
51+
52+
while [ $SECONDS -lt $health_end ]; do
53+
response=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 2 --max-time 10 "$url")
1754

1855
if [ "$response" -eq "$expected_response_code" ]; then
1956
echo "Server is up and responding with status code $expected_response_code"
2057
exit 0
2158
fi
2259

60+
if [ "$response" -ge 400 ] && [ "$response" -lt 500 ]; then
61+
echo "Server returned $response — likely a URL misconfiguration, failing fast"
62+
exit 1
63+
fi
64+
2365
echo "Waiting for server to respond with status code $expected_response_code. Current status: $response"
2466
sleep "$interval"
2567
done
2668

27-
echo "Timeout reached. Server did not respond with status code $expected_response_code within $timeout seconds"
69+
echo "Timeout reached. Server did not respond with status code $expected_response_code within $timeout seconds after TCP bind"
2870
exit 1

0 commit comments

Comments
 (0)