fix(wait-for-grafana): two-phase polling with smarter timeout handling#213
fix(wait-for-grafana): two-phase polling with smarter timeout handling#213darrenjaneczek merged 4 commits intomainfrom
Conversation
Split the polling loop into two phases: 1. TCP-bind phase (new `startupTimeout` input, default 300s): polls every 5s while curl returns 000 (ECONNREFUSED). Grafana 13 and dev-preview images on contested runners can take >60s to bind the port; the old single-phase timeout would fail the job before the process was ready. 2. Health phase (existing `timeout` input, default 60s): once the port responds with anything other than 000, switches to fast polling (existing `interval`, default 0.5s) until the expected status code is received. Additional improvements: - Fail fast on 4xx responses in the health phase — these indicate a URL misconfiguration rather than a timing issue and won't self-resolve. - Pass --connect-timeout 2 to curl in phase 1 so each attempt doesn't hang waiting for a connection that will never come. Backward compatible: callers that don't pass `startupTimeout` get the 300s default, which is strictly better than the previous 60s limit.
|
|
1 similar comment
|
|
There was a problem hiding this comment.
Pull request overview
This PR updates the wait-for-grafana composite action to be more resilient to slow-starting Grafana containers on CI by splitting readiness polling into a TCP-bind phase followed by a health-check phase.
Changes:
- Added a Phase 1 “TCP bind” polling loop that waits while curl reports
000, with a newstartupTimeoutinput (default 300s). - Kept Phase 2 health polling for the expected HTTP status code, and added a fast-fail path for
4xxresponses. - Updated the composite action to pass the new startup timeout input through to the script.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| wait-for-grafana/wait-for-grafana.sh | Implements two-phase polling (bind then health) and adds 4xx fast-fail behavior. |
| wait-for-grafana/action.yml | Introduces startupTimeout input and wires it into the script invocation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Default startup_timeout to 300 if 5th arg is absent, preventing immediate Phase 1 exit when called with the old 4-arg interface - Distinguish ECONNREFUSED (curl exit 7) from other 000-producing errors (DNS failure = exit 6, TLS errors, etc.); fail fast on anything that isn't exit 7, avoiding a 300s stall on misconfigured URLs - Add --connect-timeout 2 and --max-time 10 to Phase 2 curl calls so a stalled connection cannot outlast the health timeout window
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
000 failures confirmed on grafana-enterprise@12.3.6, not just v13 or dev-preview images -- the issue is pure runner contention, not image version. Update action.yml description and README to reflect this.
Any curl error during startup is a transient condition for this action's use case (localhost Grafana container). Enumerating specific exit codes to fail-fast on is fragile — exit 56 broke a real CI run immediately after #213 merged. Simplify Phase 1: keep waiting on any curl error until startup_timeout expires or a non-000 response is received. Only Phase 2 fails fast (on 4xx), where a bad response genuinely indicates misconfiguration.
Any curl error during startup is a transient condition for this action's use case (localhost Grafana container). Enumerating specific exit codes to fail-fast on is fragile — exit 56 broke a real CI run immediately after #213 merged. Simplify Phase 1: keep waiting on any curl error until startup_timeout expires or a non-000 response is received. Only Phase 2 fails fast (on 4xx), where a bad response genuinely indicates misconfiguration.
…ansient (#215) * fix(wait-for-grafana): treat exit 56 (recv error) as transient startup The Phase 1 allowlist was too narrow: only exit 7 (ECONNREFUSED) was treated as safe to keep waiting on. Exit 56 (CURLE_RECV_ERROR) means the TCP connection was accepted but reset before HTTP headers were sent -- a valid transient state when Grafana's listener is up but not yet ready to serve. This caused premature failures on real CI runs. Flip the guard: fail fast only on codes that indicate misconfiguration and will never self-resolve (exit 3 = malformed URL, exit 6 = DNS failure). Keep waiting on all other non-zero exits, including: exit 7 = ECONNREFUSED (port not yet bound) exit 52 = got nothing (port open, server not yet responding) exit 56 = recv error (connection accepted then reset during startup) * fix(wait-for-grafana): remove fail-fast from Phase 1 Any curl error during startup is a transient condition for this action's use case (localhost Grafana container). Enumerating specific exit codes to fail-fast on is fragile — exit 56 broke a real CI run immediately after #213 merged. Simplify Phase 1: keep waiting on any curl error until startup_timeout expires or a non-000 response is received. Only Phase 2 fails fast (on 4xx), where a bad response genuinely indicates misconfiguration. * babysit: address Copilot review feedback on PR #215 - Rename port_bound -> server_up and update all related messages to reflect that Phase 1 exits on any non-000 response, not just TCP bind - Update echo labels: "TCP bind" / "after bind" wording removed - Add --max-time 5 to Phase 1 curl so stalled connections are bounded - Log curl exit code in Phase 1 wait message to aid CI debugging - Clarify comment: note that action targets localhost, so persistent errors expire naturally via startup_timeout rather than fail-fast
Problem
On contested CI runners, Grafana's HTTP listener can take longer than 60 seconds to bind its TCP port. The previous single-phase loop treated every non-200 status (including
000— curl's code for ECONNREFUSED) the same, so a slow-starting container would time out the job before Grafana was ever ready.This shows up as
wait-for-grafanatiming out with repeatedCurrent status: 000lines. It is a runner resource contention issue, not specific to any Grafana version — confirmed seen ongrafana-enterprise@12.3.6andgrafana-enterprise@13.0.0anddev-preview-react19.Solution
Split polling into two phases:
Phase 1 — TCP bind (
startupTimeout, default 300s)000(ECONNREFUSED, curl exit code 7)000with exit 7 specifically means the process isn't listening yet — always safe to keep waitingPhase 2 — Health check (
timeout, default 60s)0004xx— a client error indicates URL misconfiguration, not a timing issue--connect-timeout 2 --max-time 10bounds each curl call so a stalled connection can't outlast the health windowBackward compatibility
startupTimeoutget the 300s defaulttimeoutandintervalinputs are unchanged