Skip to content

[serve] Enable HAProxy redispatch and retry-on for backend resilience#63159

Closed
harshit-anyscale wants to merge 27 commits into
ray-project:masterfrom
harshit-anyscale:serve-haproxy-redispatch
Closed

[serve] Enable HAProxy redispatch and retry-on for backend resilience#63159
harshit-anyscale wants to merge 27 commits into
ray-project:masterfrom
harshit-anyscale:serve-haproxy-redispatch

Conversation

@harshit-anyscale
Copy link
Copy Markdown
Contributor

@harshit-anyscale harshit-anyscale commented May 6, 2026

Summary

Add option redispatch and retry-on conn-failure empty-response to each generated HAProxy backend block (python/ray/serve/_private/haproxy_templates.py). This makes the existing default retries 3 actually pick a different server on retry, and extends retry coverage to the safe failure modes (connect failure, empty response).

Motivation

During load testing of a multi-app Serve service with HAProxy ingress, we observed a tail of HTTP 5xx returned to clients despite zero 5xx coming from upstream replicas. The HAProxy Prometheus metrics showed:

  • haproxy_backend_http_responses_total{code="5xx"} ≈ 0 (replicas weren't returning 5xx)
  • haproxy_frontend_http_responses_total{code="5xx"} > 0 (clients saw 5xx)

Root cause: when a replica enters the window between dying and being marked DOWN by HAProxy's health check, requests still get dispatched to it. HAProxy's default retries 3 retries against the same dead server (no redispatch), exhausts retries, falls through to the SERVE_PROXY_ACTOR-fallback backup, and ultimately surfaces a synthesized 502 (which the existing errorfile 502 → 500.http config maps to a 500 visible to the client).

option redispatch makes the retries pick a different healthy server, eliminating most of this tail. retry-on conn-failure empty-response extends retry coverage from "connect-only" (default) to also handle the case where the connection succeeded but the server died before responding.

Why these specific options

Directive Included? Reason
option redispatch Converts default retries 3 from useless-on-same-dead-server to actually useful.
retry-on conn-failure Connect refused / RST during handshake — request never reached server, safe to retry.
retry-on empty-response Connection accepted but server closed before responding — safe to retry, no response bytes seen by client.

The two retry conditions chosen are the body-stream-safe subset that don't risk double-execution of side-effecting application logic.

🤖 Generated with Claude Code

Add `option redispatch` and `retry-on conn-failure empty-response` to
each generated backend block in the HAProxy template.

Without these directives, when a Serve replica becomes unreachable
between health-check intervals (e.g. during autoscaling churn), HAProxy
exhausts its default `retries 3` against the same dead server, falls
through to the slow `SERVE_PROXY_ACTOR-fallback` backup, and surfaces a
synthesized 502 (remapped to 500 by the existing errorfile config) to
the client. With `option redispatch`, the retries pick a fresh server.
With `retry-on conn-failure empty-response`, retries trigger on
connection-establishment failures (replica process gone) and
connected-but-no-response cases (replica died after accept) — both are
body-stream-safe and don't risk double-execution of side-effecting
requests.

The two `retry-on` conditions intentionally exclude `response-timeout`
(no `timeout server` is set in the default config so it would never
fire) and `5xx` (would mask `BackPressureError` from the application
layer, causing thrash instead of respecting back-pressure).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshit-anyscale harshit-anyscale requested a review from a team as a code owner May 6, 2026 17:04
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the HAProxy configuration templates and associated tests to include "option redispatch" and "retry-on conn-failure empty-response" for improved reliability. The review feedback suggests explicitly defining the retry count and adding a note regarding the HAProxy 2.0+ version requirement for the "retry-on" directive to ensure environment compatibility and consistent behavior.

I am having trouble creating individual review comments. Click here to see my feedback.

python/ray/serve/_private/haproxy_templates.py (94-100)

medium

The retry-on directive was introduced in HAProxy 2.0. If the environment uses an older version of HAProxy (e.g., stock versions on older LTS distributions), this configuration will cause HAProxy to fail at startup.

Additionally, while HAProxy defaults to retries 3, it is more robust to set this explicitly in the configuration to ensure consistent behavior across different environments and custom HAProxy builds.

    # On a server failure, retry on a different server (combined with the
    # explicit `retries 3`) instead of giving up to the backup or returning
    # a synthesized 5xx. retry-on covers connection-establishment failures
    # and "connected but no response yet" cases — both are body-stream-safe
    # and don't risk double-execution.
    # Note: retry-on requires HAProxy 2.0+.
    option redispatch
    retries 3
    retry-on conn-failure empty-response

python/ray/serve/tests/test_haproxy_api.py (329-335)

medium

Updating the test fixture to match the suggested template changes (explicit retries 3 and version note).

    # On a server failure, retry on a different server (combined with the
    # explicit `retries 3`) instead of giving up to the backup or returning
    # a synthesized 5xx. retry-on covers connection-establishment failures
    # and "connected but no response yet" cases — both are body-stream-safe
    # and don't risk double-execution.
    # Note: retry-on requires HAProxy 2.0+.
    option redispatch
    retries 3
    retry-on conn-failure empty-response

python/ray/serve/tests/test_haproxy_api.py (354-360)

medium

Updating the test fixture to match the suggested template changes (explicit retries 3 and version note).

    # On a server failure, retry on a different server (combined with the
    # explicit `retries 3`) instead of giving up to the backup or returning
    # a synthesized 5xx. retry-on covers connection-establishment failures
    # and "connected but no response yet" cases — both are body-stream-safe
    # and don't risk double-execution.
    # Note: retry-on requires HAProxy 2.0+.
    option redispatch
    retries 3
    retry-on conn-failure empty-response

@harshit-anyscale harshit-anyscale self-assigned this May 6, 2026
@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label May 6, 2026
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label May 6, 2026
harshit-anyscale and others added 2 commits May 7, 2026 03:45
With HAProxy redispatch enabled, requests to a deployment whose only
replica died mid-request now reach the backup proxy actor instead of
being given up on by HAProxy. The backup raises DeploymentUnavailableError
(→ 503) when the controller has broadcast unavailability, or the request
fails as ActorDiedError (→ 500) if it hasn't. Both are valid 5xx
responses for a deployment that is genuinely unavailable; loosen the
assertion to accept either.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related knobs that matter for HAProxy ingress reliability under
autoscaling churn, both previously hardcoded:

1. RAY_SERVE_HAPROXY_RELOAD_TIMEOUT_S (default: 5)
   The wait window in _wait_for_hap_availability() used to be 5s and
   hardcoded. With many backends/servers under load, HAProxy's startup
   (config parse + initial health checks against every server + socket
   transfer) regularly exceeds 5s, the new process is killed, and the
   old process keeps running with stale config. Since Ray's
   HAProxyManager regenerates config rather than using HAProxy's runtime
   API, a failed reload silently drops new replica registrations:
   primaries are added to the broadcast but never to HAProxy's server
   list, leaving the pool to drift smaller than reality and concentrating
   load on a shrinking subset of replicas (or on the single fallback
   ProxyActor).

2. RAY_SERVE_HAPROXY_RETRIES (default: 3)
   Number of connection-level retries. Combined with the existing
   `option redispatch` (added in the same PR series), each retry picks
   a different healthy server. Raising this is useful in environments
   with autoscaling churn where any given primary may briefly be
   unreachable before another sibling can serve the request.

Both default to the previous compiled-in values, preserving behavior
for users who don't opt in. Recommended overrides for high-churn
clusters:

    RAY_SERVE_HAPROXY_RELOAD_TIMEOUT_S=30
    RAY_SERVE_HAPROXY_RETRIES=100

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread python/ray/serve/_private/haproxy.py
harshit-anyscale and others added 3 commits May 7, 2026 15:47
Always return 200 from HAProxy's /-/healthz when HAProxy itself is up
(unless drain mode is active). Previously, the endpoint returned 503
whenever every backend reported nbsrv() == 0, which happens not only
during real outages but also briefly during reload transitions
(initial-check windows where servers are present in the config but
HAProxy's per-server health checks haven't completed yet).

Tying ALB target health to backend pool state via this endpoint causes
flapping cascades: a 1-3 second 503 window during a reload makes the
ALB mark the target unhealthy for minutes (interval × healthy_threshold
to recover). With reloads happening repeatedly under autoscaling
churn, all targets end up unhealthy simultaneously, the ALB has no
healthy targets, and every client request becomes an ALB-synthesized
502.

After this change, /-/healthz reflects only HAProxy's own liveness
(the question the ALB target health check actually needs answered).
Backend availability is still surfaced to clients per-request: if
HAProxy can't reach any backend, it returns 503 from the actual
request path — which is the correct HTTP semantic and uncorrelated
across requests, not a cluster-wide blackout.

Trade-off: clients may now see HAProxy-returned 503s for the small
fraction of requests that genuinely have no backend, instead of
ALB-returned 502s. Net effect is a substantial volume reduction in
client-visible failures because the cascade pattern (one bad probe
blanking out an entire target for minutes) is eliminated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At Ray Serve scales with many replicas under autoscaling churn, every
controller broadcast triggers a full HAProxy config regeneration plus a
graceful reload. Each reload blocks the proxy actor's event loop long
enough (config-string templating, file write, fork+exec of the new
HAProxy process, initial server-state probes) that the actor can miss
the controller's `check_health` window. Three consecutive missed checks
trip PROXY_HEALTH_CHECK_UNHEALTHY_THRESHOLD; the controller kills and
recreates the actor; the new actor immediately receives a broadcast and
starts another reload; repeat. Result: every proxy in the cluster ends
up cycling continuously, the ALB sees its targets as permanently
unhealthy, and clients receive 502s from "no healthy targets" not from
any HAProxy or replica issue.

This change adds an incremental update path: when the only difference
between the previous and new backend configs is the set of servers
within existing backends (the dominant case during autoscaling churn —
adds/removes of individual replicas), apply the change via HAProxy's
runtime API (`add server` / `del server` over the admin socket)
instead of regenerating the config and reloading. Runtime-API server
operations are sub-millisecond per command and don't block the actor's
event loop, so the proxy stays responsive to the controller even while
processing rapid-fire broadcasts.

Structural changes — backend added or removed, ACL/path-prefix change,
fallback-server change, timeout or health-check parameter change —
still take the existing reload path. The diff is conservative: any
field-level discrepancy outside the servers list forces a reload.

If any single runtime-API command fails (socket transient, HAProxy
version doesn't support `add server`, etc.) the change is abandoned
mid-flight and the caller falls back to a reload, so partial state is
never persisted.

Internal state (HAProxyApi.backend_configs and the on-disk config file)
is updated either way so a subsequent reload — for instance, triggered
by a future structural change — sees the current set of servers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review of the prior commit flagged three issues; this addresses
them.

(High) `_send_socket_command` returns HAProxy's response body verbatim,
including error strings — HAProxy conveys most command-level rejections
through the response (e.g. "[ALERT] (123) : 'add server' : Backend X
not found.") rather than via transport-level exceptions. The previous
apply path only caught transport exceptions, so it would treat a rejection
as success and skip the fallback reload, leaving live HAProxy state
diverged from `self.backend_configs`. Add `_send_runtime_command` which
wraps the send + checks the response for HAProxy's error markers
(log-level prefixes plus a small set of common error tokens) and raises
RuntimeError when any are present. Mutating commands in
`try_apply_servers_dynamically` use the new wrapper so application-level
rejections trigger the existing fallback-to-reload path.

(Low) Add `BackendConfig.name` to the structural-change comparison in
`_compute_backend_diff`. The current callers in this file always set
the dict key equal to the BackendConfig.name, but the helper itself
shouldn't assume that invariant — a name change has to reload because
it affects the frontend ACL labels and `use_backend` directives.

(Doc) Update the comment that claimed the apply loop "avoids leaving
HAProxy in a partially-updated state" — commands are sent sequentially
and partial state is live until the fallback reload converges things
back to the desired state. The fallback reload is what guarantees
consistency, not the abort-on-error.

Tests: add coverage for the rejection-via-response-string case (the
common one) and direct unit tests for `_send_runtime_command`'s
validator behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
error_markers = (
"[alert]",
"[warning]",
"[notice]",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice-level marker may defeat incremental optimization entirely

Medium Severity

Including "[notice]" in error_markers within _send_runtime_command may cause the method to raise RuntimeError on successful HAProxy operations. HAProxy's CLI can return [NOTICE]-level messages for non-error events (e.g., server state transitions after enable server, or registration confirmations after add server in some versions). If the HAProxy version in use returns [NOTICE] on any successful runtime command, the incremental path in try_apply_servers_dynamically will always exception out and fall back to a full reload, completely negating the performance optimization that this PR introduces.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit efbe04e. Configure here.

harshit-anyscale and others added 2 commits May 8, 2026 08:02
When a prior reload's `-x admin.sock` socket transfer fails (admin
socket overloaded under autoscaling churn), the new HAProxy exits 1
before binding listeners. The actor's `self._proc` ends up pointing at
that dead process, and every subsequent `_apply_backend_update` raises
"HAProxy crashed during startup: exit code 1" from
`_wait_for_hap_availability(old_proc)` -- the actor is stuck with no
listener on 8000 until the controller kills it.

Detect `old_proc` already exited and start a fresh HAProxy without
`-sf`/`-x`. Drops in-flight connections briefly but lets the actor
return to a healthy state instead of staying down.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under autoscaling churn the Serve controller fires target-group /
fallback-target broadcasts tens of ms apart; without coalescing, each
one runs its own runtime-API command burst on the HAProxy admin
socket. The CLI mux serializes everything (runtime API, stats, the
`-x` socket transfer used by reload) so command-level timeouts and
`-x` failures cluster together once admin-socket pressure exceeds
HAProxy's processing rate.

Coalesce broadcasts in `HAProxyManager` into a single sleeping flush
task. Updates arriving during the sleep are absorbed implicitly via
in-place writes to `self._target_groups` / fallback fields; updates
during the apply re-arm a `_coalesce_pending` flag so the trailing
broadcast in a flurry isn't dropped.

Window is configurable via `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S`
(default 0.5s, set 0 to disable). Trade-off: each broadcast incurs up
to one window of additional latency before reaching HAProxy, which is
small relative to typical replica-start time and well below the rate
at which admin-socket saturation begins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# defensively (in case other fields on ServerConfig change in the
# future) and skip the reload.
self.set_backend_configs(new_backend_configs)
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed reload never retried due to premature state update

Medium Severity

try_apply_servers_dynamically calls set_backend_configs(new_backend_configs) on every code path before returning, including when it returns False for structural changes. In _apply_backend_update, if the subsequent reload() call then raises an exception, self.backend_configs already reflects the new desired state. On the next broadcast with identical target groups, _compute_backend_diff sees no difference and try_apply_servers_dynamically returns True (no-op), so the reload is never retried. HAProxy permanently serves stale routing until a different config change arrives. The old code unconditionally regenerated and reloaded on every broadcast, ensuring convergence after transient failures.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ce527e4. Configure here.

# Don't let an apply failure kill the coalescer. The next broadcast
# will schedule a fresh flush task; the underlying error is already
# logged by `_apply_backend_update` / `try_apply_servers_dynamically`.
logger.error(f"Coalesced backend apply failed: {e}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coalescer silently drops pending broadcast on apply failure

Low Severity

When _apply_backend_update raises inside _coalesce_and_apply, the except block logs the error and exits. If a broadcast arrived during the failed apply (setting _coalesce_pending = True), the task ends without re-checking the flag. No new task is scheduled until the next broadcast arrives. If no further broadcasts come, the pending update is silently lost and HAProxy stays stale indefinitely.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ce527e4. Configure here.

harshit-anyscale and others added 2 commits May 8, 2026 09:35
The dead-old-proc fresh-start path didn't account for orphaned HAProxy
processes still holding ports 8000 / 9101 / 8404. When a prior reload's
`-x admin.sock` socket transfer fails, the new process exits without
sending SIGUSR1 to the old (the SIGUSR1 only fires after binding
succeeds), and the old process keeps its listening sockets fully open
indefinitely while the actor logs it as a finished old proc in
`_old_procs`. Subsequent fresh-start attempts then race against those
orphans for ports and crash with `cannot bind socket (Address already
in use)`, only recovering when the orphans naturally drain (tens of
seconds).

SIGKILL tracked old procs before the fresh-start bind so the new
process can take the ports immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The controller's `check_health` was calling `is_running()`, which sends
`show info` over the HAProxy admin socket. Under load that socket is
shared with our runtime-API commands and the `-x` reload FD transfer,
and saturates: `show info` exceeds the 5s `_send_socket_command`
timeout, `is_running()` returns False, the controller counts it as a
missed health check, and after 3 consecutive misses (~30-50s) the
controller marks the proxy unhealthy and kills the actor. The killed
actor's HAProxy child gets orphaned, holds listener ports, and the
replacement actor's HAProxy crashes with EADDRINUSE -- compounding
the cascade we were trying to avoid.

Replace the admin-socket check with a direct probe of the HAProxy HTTP
listener at `/-/healthz`. The listener is served by worker threads
independent of the CLI mux, so it stays responsive when admin socket
is saturated. Matches what the upstream ALB target group health check
sees, so proxy self-reported health aligns with load-balancer view.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread python/ray/serve/_private/haproxy.py Outdated
try:
reader, writer = await asyncio.wait_for(
asyncio.open_connection("127.0.0.1", self._http_options.port),
timeout=2.0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Health probe hardcodes 127.0.0.1 ignoring configured bind address

Medium Severity

_probe_http_listener always connects to "127.0.0.1", but HAProxy's frontend_host property maps a non-loopback http_options.host (e.g., "192.168.1.5") directly into the bind directive — meaning HAProxy only listens on that specific interface. In that configuration the probe fails with connection refused on every health check, causing the controller to consider the proxy unhealthy and kill/recreate it in a loop, even though HAProxy is perfectly functional on its configured address.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit aded8a7. Configure here.

harshit-anyscale and others added 5 commits May 8, 2026 11:08
`_wait_for_hap_availability` polled `is_running()` (which sends
`show info` over the HAProxy admin socket) to decide whether the new
HAProxy process had finished startup. Under load the admin socket is
shared with runtime-API commands and the `-x` FD transfer, so a benign
`show info` can take >5s to return. With the 15s reload timeout, only
a few iterations get a chance to succeed; if all of them hit a slow
admin socket we abort the reload as "did not enter running state
within 15 seconds" -- even though the new HAProxy is actually serving.

Switch the readiness probe to a direct HTTP HEAD/GET to /-/healthz on
the frontend listener. The listener is independent of the CLI mux and
stays responsive when the admin socket is saturated. Crash detection
still uses `proc.returncode` polling, so a genuinely failed startup
is still caught quickly.

Also make `_save_server_state` failures non-fatal in `_graceful_reload`.
The state file is purely an optimization for preserving per-server
connection counts across reload; a `show servers state` admin-socket
timeout used to abort the entire reload, which then cascaded as the
next broadcast retried with the same overloaded socket.

Refactor the HTTP probe into `HAProxyApi.probe_http_listener` so both
`HAProxyManager.check_health` (controller-facing) and
`_wait_for_hap_availability` (internal reload) share one implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit only treated `HTTP/1.x 200` as alive, which broke
two legitimate states where HAProxy is fully functioning but /-/healthz
returns 503:

- Drain mode: `disable()` flips `pass_health_checks=False` so
  /-/healthz returns 503 to signal the upstream LB to stop routing.
  With strict-200, `check_health` returned False during drain and the
  controller would mark the proxy UNHEALTHY, killing a draining proxy
  and triggering the orphan-port cascade we just fixed elsewhere.

- Initial startup: when `has_received_routes=False` (before the first
  controller broadcast), /-/healthz returns 503. Strict-200 made
  `_wait_for_hap_availability` time out during initial `start()`, so
  the actor never reached HEALTHY.

The signal we want from the probe is "HAProxy is alive and parsing
HTTP", not "HAProxy says traffic is healthy". Match the old
admin-socket `is_running()` behavior, which returned True in both
above states.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HTTP probe takes up to 4s per attempt (2s connect + 2s read), which
exhausted the 5s startup budget in 1-2 iterations under load and made
initial HAProxy startup fail with "did not enter running state within
5 seconds". Even though HAProxy was bound to its port, its worker
threads were busy replaying the server-state file and dispatching
health-check setup for hundreds of servers -- so the first HTTP
request to /-/healthz could take a few hundred ms to be processed,
and the cumulative wait blew the budget.

Split the probe in two:

- `_is_listener_bound()`: TCP open + close, 0.5s timeout. Used inside
  `_wait_for_hap_availability` for startup/reload readiness. Iterates
  fast (sub-millisecond when port not yet bound, sub-millisecond when
  bound) so we get many attempts per second within the budget. The
  signal "kernel accepted my SYN" is sufficient to know HAProxy
  finished binding listeners.

- `probe_http_listener()`: full HTTP GET, kept for the controller's
  `check_health`. Runs every ~10s on stable proxies where HAProxy is
  long past startup, so the per-attempt cost doesn't matter. Verifies
  HAProxy is actually parsing HTTP, which is a stronger signal than
  TCP-bound for ongoing health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_start_and_wait_for_haproxy` had a hardcoded 5s timeout, applied to
both initial actor startup (via `start()`) and graceful reload (via
`_graceful_reload`). With large configs (many backends + large
server-state file replay) HAProxy can need more than 5s to reach the
bind step, especially on hosts under CPU pressure from a cascade
already in progress -- which manifests as "HAProxy did not enter
running state within 5 seconds" right when we need recovery to be
robust.

Default to `RAY_SERVE_HAPROXY_RELOAD_TIMEOUT_S` (15s) to match the
timeout already used elsewhere in the reload path. Belt-and-suspenders
alongside the TCP-only readiness probe: the probe makes per-attempt
cost cheap, the larger budget tolerates genuinely slow startup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reloads can fail with "did not enter running state within N seconds"
without telling us where the time went. Add structured timing to the
graceful reload path so failures expose the bottleneck:

- `_start_and_wait_for_haproxy`: logs spawn vs wait_for_ready time and
  the args used (visible whether `-sf`/`-x` are involved).
- `_graceful_reload`: logs each phase (old_proc_check, save_state,
  new_proc_start) with cumulative duration on success and failure.
- `_wait_for_hap_availability`: logs probe-attempt count on success
  (debug) and includes attempt count in the timeout error.

After deploy, "HAProxy graceful reload OK in 12.34s (old_proc_check=0.05s
save_state=2.10s new_proc_start=10.19s)" or the corresponding failure
message tells you exactly which phase is slow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread python/ray/serve/_private/haproxy.py
harshit-anyscale and others added 3 commits May 8, 2026 18:56
Two independent quick wins to reduce HAProxy reload latency, both via
the rendered config:

1. Split the admin socket into two:
   - `<socket>.fd` (`expose-fd listeners`) — dedicated to the `-x`
     FD-transfer during graceful reload.
   - `<socket>` — runtime API (`add server`, `del server`, `show stat`)
     and stats. No `expose-fd listeners`.

   Previously a single socket carried both. With nbthread > 1 HAProxy
   parallelizes CLI connections across threads, but the `_getsocks`
   exchange during reload still has to share the same connection-slot
   pool with our runtime-API command stream. Splitting them keeps the
   FD-transfer socket idle except during reload, so reloads don't
   queue behind in-flight `add server` commands.

   `transfer_socket_path` is derived deterministically from
   `socket_path` (`.fd` suffix), so existing per-node path scoping
   carries over. `-x` now points at the transfer socket;
   `_send_socket_command` continues to use `socket_path`.

2. Add `default-server init-addr last,libc,none` to the defaults
   block. HAProxy's default `init-addr` order consults libc DNS even
   for IP-literal servers, which can stall startup for seconds when a
   config has thousands of server entries. `last,libc,none` reuses
   the previous process's resolved address (free via the server-state
   file), only consults libc if needed, and accepts "no IP yet" rather
   than failing startup -- so HAProxy reaches its bind step
   immediately.

Updates the rendered-config golden in `test_haproxy_api.py` to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each `add server`/`enable server`/`disable server`/`del server` was
issued as its own Unix socket round-trip via `_send_runtime_command`.
A real autoscaling broadcast (say 50 server changes across 24 backends)
produced ~200 sequential per-command connections to the local HAProxy
admin socket. Under traffic load each round-trip queues behind HAProxy
worker threads serving HTTP, and the cumulative wallclock exceeds the
5s `_send_socket_command` timeout. The apply fails, falls back to
reload, and (in the worst case) cascades into a cluster-wide event of
runtime-API timeouts + fresh-start recoveries.

Add `_send_runtime_commands(commands, chunk_size=64)` which joins
batches of commands with `;` (HAProxy CLI's native multi-command
separator) and sends each chunk over a single admin-socket connection.
Same error-marker substring detection as the single-command path; loses
per-command attribution on failure but the caller's response (fall back
to reload) is the same regardless.

Apply loop in `try_apply_servers_dynamically` switched from per-command
awaits to building a command list and sending via the batched API. For
a 200-command apply, this drops the round-trip count from 200 to
ceil(200/64) = 4, amortizing the per-connection accept/dispatch cost
into a single CLI session per chunk where HAProxy processes commands
in-order without re-acquiring the socket.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chunk timings

Under autoscaling churn at production scale, broadcasts of dozens to
hundreds of `add/del/disable/enable` commands routinely exceeded the
hardcoded 5s admin-socket timeout, falling back to a full reload that
amplifies load cluster-wide. Two tactical fixes:

* `RAY_SERVE_HAPROXY_RUNTIME_CHUNK_SIZE` (default 16, was 64): per-chunk
  timeout budget instead of per-batch budget. Smaller chunks complete
  the runtime-API path instead of failing into the slow reload path.
* `RAY_SERVE_HAPROXY_SOCKET_TIMEOUT_S` (default 15, was hardcoded 5):
  wider per-chunk budget so HAProxy can finish processing even when the
  CLI mux is queueing behind heavy HTTP worker dispatch.

Both knobs are env-configurable so they can be tuned at the next load
test without a redeploy.

`_send_runtime_commands` now logs per-chunk wallclock on failure (which
chunk, how long before it failed, prior chunks' times, total elapsed)
and a summary on success, so we can distinguish "just over the timeout
budget" (smaller chunks help) from "HAProxy genuinely wedged" (a
different problem). Success log is info-level + stderr so chunk-time
distributions are visible in normal proxy stderr alongside the
existing apply summaries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread python/ray/serve/_private/haproxy.py Outdated
Replaces the dynamic `add server` / `del server` / `disable server` /
`enable server` runtime-API model with HAProxy's `server-template`
directive plus per-backend slot pools.

Each backend's rendered config now contains a pre-allocated pool of
`server-template srv 1-N ...` slots that start disabled at a placeholder
address. The proxy actor binds slot ↔ replica mappings in-process
(BackendSlotPool, first-free via min-heap) and uses `set server addr
... port ...` + `set server state ready/maint` to repurpose slots at
runtime. This eliminates server-list mutation under HAProxy's CLI mux
lock (which contends with HTTP worker dispatch under heavy traffic) and
roughly halves command count per broadcast for replica removals.

Architecture:
* `RAY_SERVE_HAPROXY_TOTAL_SLOTS` (4096) caps total slots across all
  backends; `_compute_slot_split` partitions the budget per backend at
  config-generation time using replica-count * headroom proportional
  split with a per-backend minimum floor.
* `BackendSlotPool` per backend tracks replica → slot. Lives on
  `HAProxyApi` so it's adjacent to the runtime-API call sites.
* `_rebind_slot_pools()` runs after every successful (re)load:
  recreates pools sized from BackendConfig.slot_pool_size, then issues
  one batched runtime-API call to populate slots from the current
  backend_configs. This is what makes the post-reload window not lose
  the desired server set.
* `try_apply_servers_dynamically` now translates diffs into
  `set server ... addr/state` commands. Slot-pool-size changes count
  as structural (the rendered template line itself changes), so they
  go through reload, which recomputes the split.
* Pool exhaustion → return False → caller reloads → reload re-splits
  with bigger share for the exhausted backend.

Failure modes:
* Removals processed before adds so 1:1 churn doesn't need 2x slots.
* Slot population after reload is best-effort; on failure HAProxy is
  up but no replicas are routable until next broadcast retries.
* `slot_mapping.json` written to socket_dir after each apply for
  live debugging (translates srvN back to SERVE_REPLICA_ name).

Constants added (all env-overridable):
  RAY_SERVE_HAPROXY_TOTAL_SLOTS=4096
  RAY_SERVE_HAPROXY_MIN_SLOTS_PER_BACKEND=16
  RAY_SERVE_HAPROXY_SLOT_HEADROOM_FACTOR=2.0

Behavior change: `show stat` on the admin socket now reports
`srv1`/`srv2`/... instead of `SERVE_REPLICA_<app>-<replica>`. Use
`slot_mapping.json` or the per-apply log lines to translate.

Test golden config updated for the new template line. Pure-Python
unit tests for BackendSlotPool and _compute_slot_split added in
test_haproxy_slots.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread python/ray/serve/tests/unit/test_haproxy_diff.py
"invalid argument",
"unknown command",
"must be disabled",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated error markers tuple across two methods

Low Severity

The error_markers tuple is defined identically inside both _send_runtime_command and _send_runtime_commands. If a marker is later added to one but not the other, error detection would silently diverge between single-command and batched-command paths. Additionally, _send_runtime_command (singular) appears to be unused — only _send_runtime_commands (plural) is called from _rebind_slot_pools and try_apply_servers_dynamically.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9dbe27e. Configure here.

harshit-anyscale and others added 7 commits May 11, 2026 09:52
Five distinct bugs found while re-reading the prior commit:

1. CRITICAL — slot split recomputed on every broadcast → reload-storm.
   `_build_name_to_backend_configs` unconditionally called
   `_compute_slot_split`. The proportional split shifts every time any
   backend's replica count changes (since ratios reallocate among all
   backends), so every broadcast saw `slot_pool_size` differ →
   structural diff → reload. The runtime-API path was effectively never
   taken. Fix: cache `_current_split` on HAProxyManager; recompute only
   when the backend set changes or any backend has overflowed its
   current allocation. Steady-state churn now hits the runtime-API path
   as intended; reloads only happen on topology changes or genuine
   exhaustion.

2. HIGH — `_rebind_slot_pools` failure left inconsistent state.
   The method pre-assigns slots in the pool BEFORE sending runtime-API
   commands. On send failure the pool said slots were assigned but
   HAProxy still had them at the placeholder; next broadcast with the
   same servers would see an empty diff and skip work, leaving HAProxy
   serving nothing forever. Fix: on send failure wipe both pools and
   `backend_configs` so the next broadcast goes through the first-time
   setup path → reload → fresh rebind with a clean admin socket.

3. HIGH (tests) — `BackendConfig(servers=[...])` without
   `slot_pool_size` rendered `server-template srv 1-0`, which HAProxy
   rejects. Production sets the size via `_compute_slot_split`, but
   test fixtures construct BackendConfig directly. Fix: `__post_init__`
   defaults `slot_pool_size` to `max(len(servers)*2, MIN_SLOTS)` when
   not set; production assignment after construction still wins.

4. HIGH — `enable_hap_optimization` (server-state file) collided with
   server-template. The state file persists slot ↔ replica mappings
   across reloads and is loaded by the new HAProxy on startup. Our
   `_rebind_slot_pools` then assigns slots first-free from a FRESH pool
   and sends `set server addr/state` commands. Result: the same replica
   ends up bound to two slots (one restored by state-file load, one
   freshly assigned by the actor), doubling traffic to it. Compounds
   over reloads. Fix: stop loading the state file -- the mapping is now
   owned by the actor's in-memory pool. Removed `_save_server_state`
   call from `_graceful_reload` (and the now-unused method) and dropped
   `load-server-state-from-file global` from the template. Kept `-x`
   FD transfer (independent, still useful for socket migration).

5. LOW — "Slot population OK: X servers placed" log used the desired
   total instead of the actual placed count. Showed the wrong number on
   mid-batch pool exhaustion. Fix: track `placed_count` directly.

Test golden config updated for the removed `load-server-state-from-file`
line. No new tests; the existing slot-pool unit tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With server-template, `show stat` reports every templated slot
(srv1..srv4096) including placeholders in MAINT. Callers of
`get_all_stats` (and via it `total_servers`, `active_servers`,
`HAProxyStats.*`) were getting inflated counts that don't reflect
real replicas. `is_drained` / `is_system_idle` happened to be fine
because placeholder slots have 0 sessions and 0 queued, so the
session sums were correct. But anything counting servers --
including the existing integration tests -- broke.

Filter in `get_all_stats`:
* Slots actively assigned in the in-memory pool: kept.
* Slots with current_sessions > 0 or queued > 0: kept (recently
  released slots still draining in-flight work; `is_system_idle`
  needs to see those).
* Fallback servers (named after the actor, not srv-prefixed): kept.
* Everything else (placeholders): dropped.

Added a public `BackendSlotPool.assigned_slots() -> Set[int]`
accessor so the filter doesn't have to reach into private state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e commands

After a graceful reload, `_start_and_wait_for_haproxy` returns as soon
as the new proc's frontend listener accepts TCP -- which is fast via
`-x` FD transfer. But the runtime-API admin socket binding lags behind
that by tens to hundreds of milliseconds. During that lag, connections
to the admin-socket path still go to the OLD proc (it still has the
file bound). For the initial reload (when the old proc started with an
empty config), that means our `set server backend/srvN ...` commands
hit a process that doesn't know about any backends -- "No such
backend." -- and `_rebind_slot_pools` wipes pool+backend_configs to
force recovery, which schedules another reload, which races the same
window, and so on.

Visible failure pattern: bursts of 503s on high-traffic endpoints
during cluster bring-up (~3% aggregate failure rate observed on a
recent load test, ~7% on /highscale-stress/ specifically), centered
on the period between the initial empty-config start and the first
broadcast's reload completing.

Fix: probe `show info` for the new proc's PID after start and before
issuing any runtime commands. Polls every 50ms with a 10s deadline.
On the rare timeout, log a warning and proceed -- runtime commands
may still fail, in which case the existing reload-fallback path
kicks in.

Applied in three reload paths:
* HAProxyApi.start() (initial start)
* _graceful_reload (normal graceful reload)
* _graceful_reload fresh-start branch

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/hard-stop timeouts

Production load tests showed a reload cascade: under traffic spikes the
data-plane threads saturated and starved the admin socket, runtime-API
commands hit their 15s timeout, the proxy actor fell back to a full
reload, and the in-flight streaming and long-running requests on the
old HAProxy process were chopped when hard-stop-after fired.

Three changes that together break the cascade and survive the reloads
it can't prevent:

- haproxy_templates.py: pin both stats sockets to `thread 1` and
  restrict all `bind` lines (prometheus, http_frontend, stats) to
  `thread 2-{nbthread}`. With admin work on a dedicated thread, runtime
  commands stay responsive even when threads 2..N are at 100% CPU.
- constants.py: RAY_SERVE_HAPROXY_SOCKET_TIMEOUT_S 15s -> 60s as a
  safety margin for rare slow windows that isolation alone doesn't
  cover.
- constants.py: RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S 120s -> 600s so
  long-running requests (e.g. SSE streams, /long-runner/-style
  endpoints) survive across reloads instead of being TCP-RST'd.

Thread isolation requires nbthread >= 2; HAProxyConfig.__post_init__
enforces this so a misconfigured template can't render `thread 2-1`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cket diagnostics

Thread isolation didn't fix the admin-socket cascade. Logs from the
post-isolation run show admin_wait consistently hitting the full 60s
ceiling -- meaning the socket is unresponsive, not contended -- and
rebind_slots variance got worse (single-threaded admin processing).

The likely root cause is now: HAProxy in `-db` debug mode emits per
backend-state-change / health-check-failure lines on stderr. The proxy
actor leaves stderr=PIPE without an active reader, so the 64KB pipe
buffer fills under sustained churn and HAProxy blocks on its next
stderr write -- which freezes the admin socket (same proc) and looks
identical to a deadlock from outside.

- haproxy_templates.py: remove `thread 1` from stats sockets and
  `thread 2-N` from frontend/stats binds, restoring HAProxy's default
  thread distribution (all 4 threads handle admin work in parallel).
- haproxy.py: drop nbthread>=2 validation from HAProxyConfig.
- haproxy.py: spawn `_drain_proc_stderr` after each successful start
  to copy HAProxy stderr to a per-pid log file
  (`<socket_path>.stderr.<pid>.log`), keeping the OS pipe buffer
  drained. Bounded to 10MB with truncate-on-roll.
- haproxy.py: on admin-socket-wait timeout, dump a snapshot
  (`_dump_admin_socket_diagnostics`) of the socket path's filesystem
  state, the new proc's `/proc/<pid>/status` (State: line shows
  Sleeping vs Running -- the smoking gun for a stderr write block),
  and a tail of the per-pid stderr log so the next run carries its
  own diagnostic evidence.
- constants.py: trim the SOCKET_TIMEOUT_S comment that referenced the
  now-removed thread isolation. Keep 60s timeout and 600s
  hard-stop-after; both are independently useful.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pool

Three changes that address the residual /cpu-fanout/, /heavy-payload/,
and /long-runner/ errors once HAProxy itself was stable:

- constants.py: default `RAY_SERVE_HAPROXY_TIMEOUT_SERVER_S` to 3600
  (matches timeout_client_s) and `RAY_SERVE_HAPROXY_TIMEOUT_CONNECT_S`
  to 5. Both were `None` before -- the rendered defaults block was
  emitting blank lines where `timeout server` and `timeout connect`
  should have been, leaving HAProxy on its undefined-behavior fallback
  for both. `timeout server` unset is particularly suspect during
  graceful reload edges.
- constants.py: bump `RAY_SERVE_HAPROXY_MIN_SLOTS_PER_BACKEND` from
  16 to 32. With `inter 1s fall 2`, a slot can be DOWN for ~2s after a
  replica goes away; backends with only 16 slots (like /heavy-payload/)
  can briefly have zero UP servers under churn and serve 503 to
  whatever requests arrive in that window. 32 gives headroom for
  small-pool backends without significantly affecting total budget.
- haproxy_templates.py: drop `option abortonclose`. It RSTs the
  backend connection any time the client's TCP read-half goes quiet
  -- fine for sub-second GETs but catastrophic for endpoints that
  hold a connection open for tens of seconds (NAT idle reset,
  keep-alive timer, locust pause, all trigger spurious aborts that
  surface as 502 to the caller, normalized to 500 by errorfile).

Tests: golden config in test_haproxy_api.py updated to drop the
`option abortonclose` line; render verified to match exactly. Per-
backend timeout/diff tests in test_haproxy_diff.py and the slot-split
unit tests in test_haproxy_slots.py pass timeouts/min_per_backend
explicitly, so the new global defaults don't affect them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…POptions

For long-running endpoints (e.g. /long-runner/ with med 73s requests),
two changes that together reduce the rate of mid-request kills during
graceful reload:

- constants.py: bump RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S 600s -> 1800s.
  After `-sf`, the old proc is force-killed when hard-stop fires. A
  request in flight on a near-end-of-life old proc gets RST'd, which
  the replica logs as HTTP 499 (client closed before response). With
  1800s, requests up to ~30 minutes have a wide margin to complete
  even if they ride on a near-deadline proc.

- constants.py + haproxy.py: introduce
  RAY_SERVE_HAPROXY_TIMEOUT_HTTP_KEEP_ALIVE_S (default 60s), promote
  `timeout_http_keep_alive_s` from a property reading
  `http_options.keep_alive_timeout_s` to a real field on HAProxyConfig
  defaulting to the new env var. The old setup conflated two unrelated
  timeouts:
    * HTTPOptions.keep_alive_timeout_s -- uvicorn keep-alive on the
      *replica* side (per Ray Serve's existing semantics)
    * HAProxy keep-alive -- HAProxy <-> client idle timeout, used for
      forcing locust/client TCP connections to rotate off old procs
      after a reload
  60s in HAProxy doesn't kill active long-running requests (those
  are bounded by `timeout client`/`timeout server`, both 3600s). It
  only bounds idle time between requests on a keep-alive connection,
  which encourages clients to reopen connections and land on the new
  proc.

Test: test_haproxy_api.py now passes timeout_http_keep_alive_s=55 to
HAProxyConfig directly rather than via HTTPOptions; the rendered
template still matches the golden config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 6 total unresolved issues (including 5 from previous reviews).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 8bb9997. Configure here.

server-state-base /tmp/haproxy-serve
server-state-file /tmp/haproxy-serve/server-state
hard-stop-after 120s
hard-stop-after 600s
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test expects wrong hard-stop-after value (600 vs 1800)

Medium Severity

The test's HAProxyConfig stub does not explicitly set hard_stop_after_s, so it inherits the default RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S. This constant was changed from 120 to 1800 in this PR, but the expected config string says hard-stop-after 600s. The actual rendered value will be 1800s, causing the full-config comparison assertion to fail.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8bb9997. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant