Skip to content

fix(wait-for-grafana): two-phase polling with smarter timeout handling#213

Merged
darrenjaneczek merged 4 commits intomainfrom
fix/wait-for-grafana-smarter-timeout
Apr 22, 2026
Merged

fix(wait-for-grafana): two-phase polling with smarter timeout handling#213
darrenjaneczek merged 4 commits intomainfrom
fix/wait-for-grafana-smarter-timeout

Conversation

@darrenjaneczek
Copy link
Copy Markdown
Contributor

@darrenjaneczek darrenjaneczek commented Apr 16, 2026

Problem

On contested CI runners, Grafana's HTTP listener can take longer than 60 seconds to bind its TCP port. The previous single-phase loop treated every non-200 status (including 000 — curl's code for ECONNREFUSED) the same, so a slow-starting container would time out the job before Grafana was ever ready.

This shows up as wait-for-grafana timing out with repeated Current status: 000 lines. It is a runner resource contention issue, not specific to any Grafana version — confirmed seen on grafana-enterprise@12.3.6 and grafana-enterprise@13.0.0 and dev-preview-react19.

Solution

Split polling into two phases:

Phase 1 — TCP bind (startupTimeout, default 300s)

  • Polls every 5s while curl returns 000 (ECONNREFUSED, curl exit code 7)
  • 000 with exit 7 specifically means the process isn't listening yet — always safe to keep waiting
  • Fails fast on other non-zero curl exit codes (e.g. exit 6 = DNS failure) that won't self-resolve
  • Long default (300s) covers worst-case runner contention without impacting fast-starting instances

Phase 2 — Health check (timeout, default 60s)

  • Kicks in the moment the port responds with anything other than 000
  • Keeps the existing fast interval (default 0.5s) for snappy success detection
  • Fails fast on 4xx — a client error indicates URL misconfiguration, not a timing issue
  • --connect-timeout 2 --max-time 10 bounds each curl call so a stalled connection can't outlast the health window

Backward compatibility

  • Existing callers that don't pass startupTimeout get the 300s default
  • The timeout and interval inputs are unchanged
  • If Grafana starts quickly (the common case), Phase 1 exits immediately and Phase 2 behaves exactly as before

Split the polling loop into two phases:

1. TCP-bind phase (new `startupTimeout` input, default 300s): polls every
   5s while curl returns 000 (ECONNREFUSED). Grafana 13 and dev-preview
   images on contested runners can take >60s to bind the port; the old
   single-phase timeout would fail the job before the process was ready.

2. Health phase (existing `timeout` input, default 60s): once the port
   responds with anything other than 000, switches to fast polling
   (existing `interval`, default 0.5s) until the expected status code
   is received.

Additional improvements:
- Fail fast on 4xx responses in the health phase — these indicate a URL
  misconfiguration rather than a timing issue and won't self-resolve.
- Pass --connect-timeout 2 to curl in phase 1 so each attempt doesn't
  hang waiting for a connection that will never come.

Backward compatible: callers that don't pass `startupTimeout` get the
300s default, which is strictly better than the previous 60s limit.
@cla-assistant
Copy link
Copy Markdown

cla-assistant Bot commented Apr 16, 2026

CLA assistant check
All committers have signed the CLA.

@cla-assistant
Copy link
Copy Markdown

cla-assistant Bot commented Apr 16, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

1 similar comment
@cla-assistant
Copy link
Copy Markdown

cla-assistant Bot commented Apr 16, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the wait-for-grafana composite action to be more resilient to slow-starting Grafana containers on CI by splitting readiness polling into a TCP-bind phase followed by a health-check phase.

Changes:

  • Added a Phase 1 “TCP bind” polling loop that waits while curl reports 000, with a new startupTimeout input (default 300s).
  • Kept Phase 2 health polling for the expected HTTP status code, and added a fast-fail path for 4xx responses.
  • Updated the composite action to pass the new startup timeout input through to the script.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
wait-for-grafana/wait-for-grafana.sh Implements two-phase polling (bind then health) and adds 4xx fast-fail behavior.
wait-for-grafana/action.yml Introduces startupTimeout input and wires it into the script invocation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread wait-for-grafana/wait-for-grafana.sh Outdated
Comment thread wait-for-grafana/wait-for-grafana.sh
Comment thread wait-for-grafana/wait-for-grafana.sh
- Default startup_timeout to 300 if 5th arg is absent, preventing
  immediate Phase 1 exit when called with the old 4-arg interface
- Distinguish ECONNREFUSED (curl exit 7) from other 000-producing
  errors (DNS failure = exit 6, TLS errors, etc.); fail fast on
  anything that isn't exit 7, avoiding a 300s stall on misconfigured URLs
- Add --connect-timeout 2 and --max-time 10 to Phase 2 curl calls
  so a stalled connection cannot outlast the health timeout window
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread wait-for-grafana/action.yml
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@darrenjaneczek darrenjaneczek marked this pull request as ready for review April 16, 2026 23:10
@darrenjaneczek darrenjaneczek requested review from a team as code owners April 16, 2026 23:10
000 failures confirmed on grafana-enterprise@12.3.6, not just v13 or
dev-preview images -- the issue is pure runner contention, not image
version. Update action.yml description and README to reflect this.
@darrenjaneczek darrenjaneczek merged commit 952ecf2 into main Apr 22, 2026
10 checks passed
@darrenjaneczek darrenjaneczek deleted the fix/wait-for-grafana-smarter-timeout branch April 22, 2026 13:32
@github-project-automation github-project-automation Bot moved this from 🔬 In review to 🚀 Shipped in Grafana Catalog Team Apr 22, 2026
darrenjaneczek added a commit that referenced this pull request Apr 22, 2026
Any curl error during startup is a transient condition for this action's
use case (localhost Grafana container). Enumerating specific exit codes
to fail-fast on is fragile — exit 56 broke a real CI run immediately
after #213 merged.

Simplify Phase 1: keep waiting on any curl error until startup_timeout
expires or a non-000 response is received. Only Phase 2 fails fast
(on 4xx), where a bad response genuinely indicates misconfiguration.
darrenjaneczek added a commit that referenced this pull request Apr 22, 2026
Any curl error during startup is a transient condition for this action's
use case (localhost Grafana container). Enumerating specific exit codes
to fail-fast on is fragile — exit 56 broke a real CI run immediately
after #213 merged.

Simplify Phase 1: keep waiting on any curl error until startup_timeout
expires or a non-000 response is received. Only Phase 2 fails fast
(on 4xx), where a bad response genuinely indicates misconfiguration.
sunker pushed a commit that referenced this pull request Apr 23, 2026
…ansient (#215)

* fix(wait-for-grafana): treat exit 56 (recv error) as transient startup

The Phase 1 allowlist was too narrow: only exit 7 (ECONNREFUSED) was
treated as safe to keep waiting on. Exit 56 (CURLE_RECV_ERROR) means
the TCP connection was accepted but reset before HTTP headers were sent
-- a valid transient state when Grafana's listener is up but not yet
ready to serve. This caused premature failures on real CI runs.

Flip the guard: fail fast only on codes that indicate misconfiguration
and will never self-resolve (exit 3 = malformed URL, exit 6 = DNS
failure). Keep waiting on all other non-zero exits, including:
  exit 7  = ECONNREFUSED (port not yet bound)
  exit 52 = got nothing (port open, server not yet responding)
  exit 56 = recv error (connection accepted then reset during startup)

* fix(wait-for-grafana): remove fail-fast from Phase 1

Any curl error during startup is a transient condition for this action's
use case (localhost Grafana container). Enumerating specific exit codes
to fail-fast on is fragile — exit 56 broke a real CI run immediately
after #213 merged.

Simplify Phase 1: keep waiting on any curl error until startup_timeout
expires or a non-000 response is received. Only Phase 2 fails fast
(on 4xx), where a bad response genuinely indicates misconfiguration.

* babysit: address Copilot review feedback on PR #215

- Rename port_bound -> server_up and update all related messages to
  reflect that Phase 1 exits on any non-000 response, not just TCP bind
- Update echo labels: "TCP bind" / "after bind" wording removed
- Add --max-time 5 to Phase 1 curl so stalled connections are bounded
- Log curl exit code in Phase 1 wait message to aid CI debugging
- Clarify comment: note that action targets localhost, so persistent
  errors expire naturally via startup_timeout rather than fail-fast
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🚀 Shipped

Development

Successfully merging this pull request may close these issues.

3 participants