[serve] Enable HAProxy redispatch and retry-on for backend resilience#63159
[serve] Enable HAProxy redispatch and retry-on for backend resilience#63159harshit-anyscale wants to merge 27 commits into
Conversation
Add `option redispatch` and `retry-on conn-failure empty-response` to each generated backend block in the HAProxy template. Without these directives, when a Serve replica becomes unreachable between health-check intervals (e.g. during autoscaling churn), HAProxy exhausts its default `retries 3` against the same dead server, falls through to the slow `SERVE_PROXY_ACTOR-fallback` backup, and surfaces a synthesized 502 (remapped to 500 by the existing errorfile config) to the client. With `option redispatch`, the retries pick a fresh server. With `retry-on conn-failure empty-response`, retries trigger on connection-establishment failures (replica process gone) and connected-but-no-response cases (replica died after accept) — both are body-stream-safe and don't risk double-execution of side-effecting requests. The two `retry-on` conditions intentionally exclude `response-timeout` (no `timeout server` is set in the default config so it would never fire) and `5xx` (would mask `BackPressureError` from the application layer, causing thrash instead of respecting back-pressure). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the HAProxy configuration templates and associated tests to include "option redispatch" and "retry-on conn-failure empty-response" for improved reliability. The review feedback suggests explicitly defining the retry count and adding a note regarding the HAProxy 2.0+ version requirement for the "retry-on" directive to ensure environment compatibility and consistent behavior.
I am having trouble creating individual review comments. Click here to see my feedback.
python/ray/serve/_private/haproxy_templates.py (94-100)
The retry-on directive was introduced in HAProxy 2.0. If the environment uses an older version of HAProxy (e.g., stock versions on older LTS distributions), this configuration will cause HAProxy to fail at startup.
Additionally, while HAProxy defaults to retries 3, it is more robust to set this explicitly in the configuration to ensure consistent behavior across different environments and custom HAProxy builds.
# On a server failure, retry on a different server (combined with the
# explicit `retries 3`) instead of giving up to the backup or returning
# a synthesized 5xx. retry-on covers connection-establishment failures
# and "connected but no response yet" cases — both are body-stream-safe
# and don't risk double-execution.
# Note: retry-on requires HAProxy 2.0+.
option redispatch
retries 3
retry-on conn-failure empty-response
python/ray/serve/tests/test_haproxy_api.py (329-335)
Updating the test fixture to match the suggested template changes (explicit retries 3 and version note).
# On a server failure, retry on a different server (combined with the
# explicit `retries 3`) instead of giving up to the backup or returning
# a synthesized 5xx. retry-on covers connection-establishment failures
# and "connected but no response yet" cases — both are body-stream-safe
# and don't risk double-execution.
# Note: retry-on requires HAProxy 2.0+.
option redispatch
retries 3
retry-on conn-failure empty-response
python/ray/serve/tests/test_haproxy_api.py (354-360)
Updating the test fixture to match the suggested template changes (explicit retries 3 and version note).
# On a server failure, retry on a different server (combined with the
# explicit `retries 3`) instead of giving up to the backup or returning
# a synthesized 5xx. retry-on covers connection-establishment failures
# and "connected but no response yet" cases — both are body-stream-safe
# and don't risk double-execution.
# Note: retry-on requires HAProxy 2.0+.
option redispatch
retries 3
retry-on conn-failure empty-response
With HAProxy redispatch enabled, requests to a deployment whose only replica died mid-request now reach the backup proxy actor instead of being given up on by HAProxy. The backup raises DeploymentUnavailableError (→ 503) when the controller has broadcast unavailability, or the request fails as ActorDiedError (→ 500) if it hasn't. Both are valid 5xx responses for a deployment that is genuinely unavailable; loosen the assertion to accept either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related knobs that matter for HAProxy ingress reliability under
autoscaling churn, both previously hardcoded:
1. RAY_SERVE_HAPROXY_RELOAD_TIMEOUT_S (default: 5)
The wait window in _wait_for_hap_availability() used to be 5s and
hardcoded. With many backends/servers under load, HAProxy's startup
(config parse + initial health checks against every server + socket
transfer) regularly exceeds 5s, the new process is killed, and the
old process keeps running with stale config. Since Ray's
HAProxyManager regenerates config rather than using HAProxy's runtime
API, a failed reload silently drops new replica registrations:
primaries are added to the broadcast but never to HAProxy's server
list, leaving the pool to drift smaller than reality and concentrating
load on a shrinking subset of replicas (or on the single fallback
ProxyActor).
2. RAY_SERVE_HAPROXY_RETRIES (default: 3)
Number of connection-level retries. Combined with the existing
`option redispatch` (added in the same PR series), each retry picks
a different healthy server. Raising this is useful in environments
with autoscaling churn where any given primary may briefly be
unreachable before another sibling can serve the request.
Both default to the previous compiled-in values, preserving behavior
for users who don't opt in. Recommended overrides for high-churn
clusters:
RAY_SERVE_HAPROXY_RELOAD_TIMEOUT_S=30
RAY_SERVE_HAPROXY_RETRIES=100
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Always return 200 from HAProxy's /-/healthz when HAProxy itself is up (unless drain mode is active). Previously, the endpoint returned 503 whenever every backend reported nbsrv() == 0, which happens not only during real outages but also briefly during reload transitions (initial-check windows where servers are present in the config but HAProxy's per-server health checks haven't completed yet). Tying ALB target health to backend pool state via this endpoint causes flapping cascades: a 1-3 second 503 window during a reload makes the ALB mark the target unhealthy for minutes (interval × healthy_threshold to recover). With reloads happening repeatedly under autoscaling churn, all targets end up unhealthy simultaneously, the ALB has no healthy targets, and every client request becomes an ALB-synthesized 502. After this change, /-/healthz reflects only HAProxy's own liveness (the question the ALB target health check actually needs answered). Backend availability is still surfaced to clients per-request: if HAProxy can't reach any backend, it returns 503 from the actual request path — which is the correct HTTP semantic and uncorrelated across requests, not a cluster-wide blackout. Trade-off: clients may now see HAProxy-returned 503s for the small fraction of requests that genuinely have no backend, instead of ALB-returned 502s. Net effect is a substantial volume reduction in client-visible failures because the cascade pattern (one bad probe blanking out an entire target for minutes) is eliminated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At Ray Serve scales with many replicas under autoscaling churn, every controller broadcast triggers a full HAProxy config regeneration plus a graceful reload. Each reload blocks the proxy actor's event loop long enough (config-string templating, file write, fork+exec of the new HAProxy process, initial server-state probes) that the actor can miss the controller's `check_health` window. Three consecutive missed checks trip PROXY_HEALTH_CHECK_UNHEALTHY_THRESHOLD; the controller kills and recreates the actor; the new actor immediately receives a broadcast and starts another reload; repeat. Result: every proxy in the cluster ends up cycling continuously, the ALB sees its targets as permanently unhealthy, and clients receive 502s from "no healthy targets" not from any HAProxy or replica issue. This change adds an incremental update path: when the only difference between the previous and new backend configs is the set of servers within existing backends (the dominant case during autoscaling churn — adds/removes of individual replicas), apply the change via HAProxy's runtime API (`add server` / `del server` over the admin socket) instead of regenerating the config and reloading. Runtime-API server operations are sub-millisecond per command and don't block the actor's event loop, so the proxy stays responsive to the controller even while processing rapid-fire broadcasts. Structural changes — backend added or removed, ACL/path-prefix change, fallback-server change, timeout or health-check parameter change — still take the existing reload path. The diff is conservative: any field-level discrepancy outside the servers list forces a reload. If any single runtime-API command fails (socket transient, HAProxy version doesn't support `add server`, etc.) the change is abandoned mid-flight and the caller falls back to a reload, so partial state is never persisted. Internal state (HAProxyApi.backend_configs and the on-disk config file) is updated either way so a subsequent reload — for instance, triggered by a future structural change — sees the current set of servers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review of the prior commit flagged three issues; this addresses them. (High) `_send_socket_command` returns HAProxy's response body verbatim, including error strings — HAProxy conveys most command-level rejections through the response (e.g. "[ALERT] (123) : 'add server' : Backend X not found.") rather than via transport-level exceptions. The previous apply path only caught transport exceptions, so it would treat a rejection as success and skip the fallback reload, leaving live HAProxy state diverged from `self.backend_configs`. Add `_send_runtime_command` which wraps the send + checks the response for HAProxy's error markers (log-level prefixes plus a small set of common error tokens) and raises RuntimeError when any are present. Mutating commands in `try_apply_servers_dynamically` use the new wrapper so application-level rejections trigger the existing fallback-to-reload path. (Low) Add `BackendConfig.name` to the structural-change comparison in `_compute_backend_diff`. The current callers in this file always set the dict key equal to the BackendConfig.name, but the helper itself shouldn't assume that invariant — a name change has to reload because it affects the frontend ACL labels and `use_backend` directives. (Doc) Update the comment that claimed the apply loop "avoids leaving HAProxy in a partially-updated state" — commands are sent sequentially and partial state is live until the fallback reload converges things back to the desired state. The fallback reload is what guarantees consistency, not the abort-on-error. Tests: add coverage for the rejection-via-response-string case (the common one) and direct unit tests for `_send_runtime_command`'s validator behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| error_markers = ( | ||
| "[alert]", | ||
| "[warning]", | ||
| "[notice]", |
There was a problem hiding this comment.
Notice-level marker may defeat incremental optimization entirely
Medium Severity
Including "[notice]" in error_markers within _send_runtime_command may cause the method to raise RuntimeError on successful HAProxy operations. HAProxy's CLI can return [NOTICE]-level messages for non-error events (e.g., server state transitions after enable server, or registration confirmations after add server in some versions). If the HAProxy version in use returns [NOTICE] on any successful runtime command, the incremental path in try_apply_servers_dynamically will always exception out and fall back to a full reload, completely negating the performance optimization that this PR introduces.
Reviewed by Cursor Bugbot for commit efbe04e. Configure here.
When a prior reload's `-x admin.sock` socket transfer fails (admin socket overloaded under autoscaling churn), the new HAProxy exits 1 before binding listeners. The actor's `self._proc` ends up pointing at that dead process, and every subsequent `_apply_backend_update` raises "HAProxy crashed during startup: exit code 1" from `_wait_for_hap_availability(old_proc)` -- the actor is stuck with no listener on 8000 until the controller kills it. Detect `old_proc` already exited and start a fresh HAProxy without `-sf`/`-x`. Drops in-flight connections briefly but lets the actor return to a healthy state instead of staying down. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under autoscaling churn the Serve controller fires target-group / fallback-target broadcasts tens of ms apart; without coalescing, each one runs its own runtime-API command burst on the HAProxy admin socket. The CLI mux serializes everything (runtime API, stats, the `-x` socket transfer used by reload) so command-level timeouts and `-x` failures cluster together once admin-socket pressure exceeds HAProxy's processing rate. Coalesce broadcasts in `HAProxyManager` into a single sleeping flush task. Updates arriving during the sleep are absorbed implicitly via in-place writes to `self._target_groups` / fallback fields; updates during the apply re-arm a `_coalesce_pending` flag so the trailing broadcast in a flurry isn't dropped. Window is configurable via `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S` (default 0.5s, set 0 to disable). Trade-off: each broadcast incurs up to one window of additional latency before reaching HAProxy, which is small relative to typical replica-start time and well below the rate at which admin-socket saturation begins. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| # defensively (in case other fields on ServerConfig change in the | ||
| # future) and skip the reload. | ||
| self.set_backend_configs(new_backend_configs) | ||
| return True |
There was a problem hiding this comment.
Failed reload never retried due to premature state update
Medium Severity
try_apply_servers_dynamically calls set_backend_configs(new_backend_configs) on every code path before returning, including when it returns False for structural changes. In _apply_backend_update, if the subsequent reload() call then raises an exception, self.backend_configs already reflects the new desired state. On the next broadcast with identical target groups, _compute_backend_diff sees no difference and try_apply_servers_dynamically returns True (no-op), so the reload is never retried. HAProxy permanently serves stale routing until a different config change arrives. The old code unconditionally regenerated and reloaded on every broadcast, ensuring convergence after transient failures.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit ce527e4. Configure here.
| # Don't let an apply failure kill the coalescer. The next broadcast | ||
| # will schedule a fresh flush task; the underlying error is already | ||
| # logged by `_apply_backend_update` / `try_apply_servers_dynamically`. | ||
| logger.error(f"Coalesced backend apply failed: {e}") |
There was a problem hiding this comment.
Coalescer silently drops pending broadcast on apply failure
Low Severity
When _apply_backend_update raises inside _coalesce_and_apply, the except block logs the error and exits. If a broadcast arrived during the failed apply (setting _coalesce_pending = True), the task ends without re-checking the flag. No new task is scheduled until the next broadcast arrives. If no further broadcasts come, the pending update is silently lost and HAProxy stays stale indefinitely.
Reviewed by Cursor Bugbot for commit ce527e4. Configure here.
The dead-old-proc fresh-start path didn't account for orphaned HAProxy processes still holding ports 8000 / 9101 / 8404. When a prior reload's `-x admin.sock` socket transfer fails, the new process exits without sending SIGUSR1 to the old (the SIGUSR1 only fires after binding succeeds), and the old process keeps its listening sockets fully open indefinitely while the actor logs it as a finished old proc in `_old_procs`. Subsequent fresh-start attempts then race against those orphans for ports and crash with `cannot bind socket (Address already in use)`, only recovering when the orphans naturally drain (tens of seconds). SIGKILL tracked old procs before the fresh-start bind so the new process can take the ports immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The controller's `check_health` was calling `is_running()`, which sends `show info` over the HAProxy admin socket. Under load that socket is shared with our runtime-API commands and the `-x` reload FD transfer, and saturates: `show info` exceeds the 5s `_send_socket_command` timeout, `is_running()` returns False, the controller counts it as a missed health check, and after 3 consecutive misses (~30-50s) the controller marks the proxy unhealthy and kills the actor. The killed actor's HAProxy child gets orphaned, holds listener ports, and the replacement actor's HAProxy crashes with EADDRINUSE -- compounding the cascade we were trying to avoid. Replace the admin-socket check with a direct probe of the HAProxy HTTP listener at `/-/healthz`. The listener is served by worker threads independent of the CLI mux, so it stays responsive when admin socket is saturated. Matches what the upstream ALB target group health check sees, so proxy self-reported health aligns with load-balancer view. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| try: | ||
| reader, writer = await asyncio.wait_for( | ||
| asyncio.open_connection("127.0.0.1", self._http_options.port), | ||
| timeout=2.0, |
There was a problem hiding this comment.
Health probe hardcodes 127.0.0.1 ignoring configured bind address
Medium Severity
_probe_http_listener always connects to "127.0.0.1", but HAProxy's frontend_host property maps a non-loopback http_options.host (e.g., "192.168.1.5") directly into the bind directive — meaning HAProxy only listens on that specific interface. In that configuration the probe fails with connection refused on every health check, causing the controller to consider the proxy unhealthy and kill/recreate it in a loop, even though HAProxy is perfectly functional on its configured address.
Reviewed by Cursor Bugbot for commit aded8a7. Configure here.
`_wait_for_hap_availability` polled `is_running()` (which sends `show info` over the HAProxy admin socket) to decide whether the new HAProxy process had finished startup. Under load the admin socket is shared with runtime-API commands and the `-x` FD transfer, so a benign `show info` can take >5s to return. With the 15s reload timeout, only a few iterations get a chance to succeed; if all of them hit a slow admin socket we abort the reload as "did not enter running state within 15 seconds" -- even though the new HAProxy is actually serving. Switch the readiness probe to a direct HTTP HEAD/GET to /-/healthz on the frontend listener. The listener is independent of the CLI mux and stays responsive when the admin socket is saturated. Crash detection still uses `proc.returncode` polling, so a genuinely failed startup is still caught quickly. Also make `_save_server_state` failures non-fatal in `_graceful_reload`. The state file is purely an optimization for preserving per-server connection counts across reload; a `show servers state` admin-socket timeout used to abort the entire reload, which then cascaded as the next broadcast retried with the same overloaded socket. Refactor the HTTP probe into `HAProxyApi.probe_http_listener` so both `HAProxyManager.check_health` (controller-facing) and `_wait_for_hap_availability` (internal reload) share one implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit only treated `HTTP/1.x 200` as alive, which broke two legitimate states where HAProxy is fully functioning but /-/healthz returns 503: - Drain mode: `disable()` flips `pass_health_checks=False` so /-/healthz returns 503 to signal the upstream LB to stop routing. With strict-200, `check_health` returned False during drain and the controller would mark the proxy UNHEALTHY, killing a draining proxy and triggering the orphan-port cascade we just fixed elsewhere. - Initial startup: when `has_received_routes=False` (before the first controller broadcast), /-/healthz returns 503. Strict-200 made `_wait_for_hap_availability` time out during initial `start()`, so the actor never reached HEALTHY. The signal we want from the probe is "HAProxy is alive and parsing HTTP", not "HAProxy says traffic is healthy". Match the old admin-socket `is_running()` behavior, which returned True in both above states. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HTTP probe takes up to 4s per attempt (2s connect + 2s read), which exhausted the 5s startup budget in 1-2 iterations under load and made initial HAProxy startup fail with "did not enter running state within 5 seconds". Even though HAProxy was bound to its port, its worker threads were busy replaying the server-state file and dispatching health-check setup for hundreds of servers -- so the first HTTP request to /-/healthz could take a few hundred ms to be processed, and the cumulative wait blew the budget. Split the probe in two: - `_is_listener_bound()`: TCP open + close, 0.5s timeout. Used inside `_wait_for_hap_availability` for startup/reload readiness. Iterates fast (sub-millisecond when port not yet bound, sub-millisecond when bound) so we get many attempts per second within the budget. The signal "kernel accepted my SYN" is sufficient to know HAProxy finished binding listeners. - `probe_http_listener()`: full HTTP GET, kept for the controller's `check_health`. Runs every ~10s on stable proxies where HAProxy is long past startup, so the per-attempt cost doesn't matter. Verifies HAProxy is actually parsing HTTP, which is a stronger signal than TCP-bound for ongoing health. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_start_and_wait_for_haproxy` had a hardcoded 5s timeout, applied to both initial actor startup (via `start()`) and graceful reload (via `_graceful_reload`). With large configs (many backends + large server-state file replay) HAProxy can need more than 5s to reach the bind step, especially on hosts under CPU pressure from a cascade already in progress -- which manifests as "HAProxy did not enter running state within 5 seconds" right when we need recovery to be robust. Default to `RAY_SERVE_HAPROXY_RELOAD_TIMEOUT_S` (15s) to match the timeout already used elsewhere in the reload path. Belt-and-suspenders alongside the TCP-only readiness probe: the probe makes per-attempt cost cheap, the larger budget tolerates genuinely slow startup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reloads can fail with "did not enter running state within N seconds" without telling us where the time went. Add structured timing to the graceful reload path so failures expose the bottleneck: - `_start_and_wait_for_haproxy`: logs spawn vs wait_for_ready time and the args used (visible whether `-sf`/`-x` are involved). - `_graceful_reload`: logs each phase (old_proc_check, save_state, new_proc_start) with cumulative duration on success and failure. - `_wait_for_hap_availability`: logs probe-attempt count on success (debug) and includes attempt count in the timeout error. After deploy, "HAProxy graceful reload OK in 12.34s (old_proc_check=0.05s save_state=2.10s new_proc_start=10.19s)" or the corresponding failure message tells you exactly which phase is slow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two independent quick wins to reduce HAProxy reload latency, both via
the rendered config:
1. Split the admin socket into two:
- `<socket>.fd` (`expose-fd listeners`) — dedicated to the `-x`
FD-transfer during graceful reload.
- `<socket>` — runtime API (`add server`, `del server`, `show stat`)
and stats. No `expose-fd listeners`.
Previously a single socket carried both. With nbthread > 1 HAProxy
parallelizes CLI connections across threads, but the `_getsocks`
exchange during reload still has to share the same connection-slot
pool with our runtime-API command stream. Splitting them keeps the
FD-transfer socket idle except during reload, so reloads don't
queue behind in-flight `add server` commands.
`transfer_socket_path` is derived deterministically from
`socket_path` (`.fd` suffix), so existing per-node path scoping
carries over. `-x` now points at the transfer socket;
`_send_socket_command` continues to use `socket_path`.
2. Add `default-server init-addr last,libc,none` to the defaults
block. HAProxy's default `init-addr` order consults libc DNS even
for IP-literal servers, which can stall startup for seconds when a
config has thousands of server entries. `last,libc,none` reuses
the previous process's resolved address (free via the server-state
file), only consults libc if needed, and accepts "no IP yet" rather
than failing startup -- so HAProxy reaches its bind step
immediately.
Updates the rendered-config golden in `test_haproxy_api.py` to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each `add server`/`enable server`/`disable server`/`del server` was issued as its own Unix socket round-trip via `_send_runtime_command`. A real autoscaling broadcast (say 50 server changes across 24 backends) produced ~200 sequential per-command connections to the local HAProxy admin socket. Under traffic load each round-trip queues behind HAProxy worker threads serving HTTP, and the cumulative wallclock exceeds the 5s `_send_socket_command` timeout. The apply fails, falls back to reload, and (in the worst case) cascades into a cluster-wide event of runtime-API timeouts + fresh-start recoveries. Add `_send_runtime_commands(commands, chunk_size=64)` which joins batches of commands with `;` (HAProxy CLI's native multi-command separator) and sends each chunk over a single admin-socket connection. Same error-marker substring detection as the single-command path; loses per-command attribution on failure but the caller's response (fall back to reload) is the same regardless. Apply loop in `try_apply_servers_dynamically` switched from per-command awaits to building a command list and sending via the batched API. For a 200-command apply, this drops the round-trip count from 200 to ceil(200/64) = 4, amortizing the per-connection accept/dispatch cost into a single CLI session per chunk where HAProxy processes commands in-order without re-acquiring the socket. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chunk timings Under autoscaling churn at production scale, broadcasts of dozens to hundreds of `add/del/disable/enable` commands routinely exceeded the hardcoded 5s admin-socket timeout, falling back to a full reload that amplifies load cluster-wide. Two tactical fixes: * `RAY_SERVE_HAPROXY_RUNTIME_CHUNK_SIZE` (default 16, was 64): per-chunk timeout budget instead of per-batch budget. Smaller chunks complete the runtime-API path instead of failing into the slow reload path. * `RAY_SERVE_HAPROXY_SOCKET_TIMEOUT_S` (default 15, was hardcoded 5): wider per-chunk budget so HAProxy can finish processing even when the CLI mux is queueing behind heavy HTTP worker dispatch. Both knobs are env-configurable so they can be tuned at the next load test without a redeploy. `_send_runtime_commands` now logs per-chunk wallclock on failure (which chunk, how long before it failed, prior chunks' times, total elapsed) and a summary on success, so we can distinguish "just over the timeout budget" (smaller chunks help) from "HAProxy genuinely wedged" (a different problem). Success log is info-level + stderr so chunk-time distributions are visible in normal proxy stderr alongside the existing apply summaries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the dynamic `add server` / `del server` / `disable server` / `enable server` runtime-API model with HAProxy's `server-template` directive plus per-backend slot pools. Each backend's rendered config now contains a pre-allocated pool of `server-template srv 1-N ...` slots that start disabled at a placeholder address. The proxy actor binds slot ↔ replica mappings in-process (BackendSlotPool, first-free via min-heap) and uses `set server addr ... port ...` + `set server state ready/maint` to repurpose slots at runtime. This eliminates server-list mutation under HAProxy's CLI mux lock (which contends with HTTP worker dispatch under heavy traffic) and roughly halves command count per broadcast for replica removals. Architecture: * `RAY_SERVE_HAPROXY_TOTAL_SLOTS` (4096) caps total slots across all backends; `_compute_slot_split` partitions the budget per backend at config-generation time using replica-count * headroom proportional split with a per-backend minimum floor. * `BackendSlotPool` per backend tracks replica → slot. Lives on `HAProxyApi` so it's adjacent to the runtime-API call sites. * `_rebind_slot_pools()` runs after every successful (re)load: recreates pools sized from BackendConfig.slot_pool_size, then issues one batched runtime-API call to populate slots from the current backend_configs. This is what makes the post-reload window not lose the desired server set. * `try_apply_servers_dynamically` now translates diffs into `set server ... addr/state` commands. Slot-pool-size changes count as structural (the rendered template line itself changes), so they go through reload, which recomputes the split. * Pool exhaustion → return False → caller reloads → reload re-splits with bigger share for the exhausted backend. Failure modes: * Removals processed before adds so 1:1 churn doesn't need 2x slots. * Slot population after reload is best-effort; on failure HAProxy is up but no replicas are routable until next broadcast retries. * `slot_mapping.json` written to socket_dir after each apply for live debugging (translates srvN back to SERVE_REPLICA_ name). Constants added (all env-overridable): RAY_SERVE_HAPROXY_TOTAL_SLOTS=4096 RAY_SERVE_HAPROXY_MIN_SLOTS_PER_BACKEND=16 RAY_SERVE_HAPROXY_SLOT_HEADROOM_FACTOR=2.0 Behavior change: `show stat` on the admin socket now reports `srv1`/`srv2`/... instead of `SERVE_REPLICA_<app>-<replica>`. Use `slot_mapping.json` or the per-apply log lines to translate. Test golden config updated for the new template line. Pure-Python unit tests for BackendSlotPool and _compute_slot_split added in test_haproxy_slots.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| "invalid argument", | ||
| "unknown command", | ||
| "must be disabled", | ||
| ) |
There was a problem hiding this comment.
Duplicated error markers tuple across two methods
Low Severity
The error_markers tuple is defined identically inside both _send_runtime_command and _send_runtime_commands. If a marker is later added to one but not the other, error detection would silently diverge between single-command and batched-command paths. Additionally, _send_runtime_command (singular) appears to be unused — only _send_runtime_commands (plural) is called from _rebind_slot_pools and try_apply_servers_dynamically.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 9dbe27e. Configure here.
Five distinct bugs found while re-reading the prior commit: 1. CRITICAL — slot split recomputed on every broadcast → reload-storm. `_build_name_to_backend_configs` unconditionally called `_compute_slot_split`. The proportional split shifts every time any backend's replica count changes (since ratios reallocate among all backends), so every broadcast saw `slot_pool_size` differ → structural diff → reload. The runtime-API path was effectively never taken. Fix: cache `_current_split` on HAProxyManager; recompute only when the backend set changes or any backend has overflowed its current allocation. Steady-state churn now hits the runtime-API path as intended; reloads only happen on topology changes or genuine exhaustion. 2. HIGH — `_rebind_slot_pools` failure left inconsistent state. The method pre-assigns slots in the pool BEFORE sending runtime-API commands. On send failure the pool said slots were assigned but HAProxy still had them at the placeholder; next broadcast with the same servers would see an empty diff and skip work, leaving HAProxy serving nothing forever. Fix: on send failure wipe both pools and `backend_configs` so the next broadcast goes through the first-time setup path → reload → fresh rebind with a clean admin socket. 3. HIGH (tests) — `BackendConfig(servers=[...])` without `slot_pool_size` rendered `server-template srv 1-0`, which HAProxy rejects. Production sets the size via `_compute_slot_split`, but test fixtures construct BackendConfig directly. Fix: `__post_init__` defaults `slot_pool_size` to `max(len(servers)*2, MIN_SLOTS)` when not set; production assignment after construction still wins. 4. HIGH — `enable_hap_optimization` (server-state file) collided with server-template. The state file persists slot ↔ replica mappings across reloads and is loaded by the new HAProxy on startup. Our `_rebind_slot_pools` then assigns slots first-free from a FRESH pool and sends `set server addr/state` commands. Result: the same replica ends up bound to two slots (one restored by state-file load, one freshly assigned by the actor), doubling traffic to it. Compounds over reloads. Fix: stop loading the state file -- the mapping is now owned by the actor's in-memory pool. Removed `_save_server_state` call from `_graceful_reload` (and the now-unused method) and dropped `load-server-state-from-file global` from the template. Kept `-x` FD transfer (independent, still useful for socket migration). 5. LOW — "Slot population OK: X servers placed" log used the desired total instead of the actual placed count. Showed the wrong number on mid-batch pool exhaustion. Fix: track `placed_count` directly. Test golden config updated for the removed `load-server-state-from-file` line. No new tests; the existing slot-pool unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With server-template, `show stat` reports every templated slot (srv1..srv4096) including placeholders in MAINT. Callers of `get_all_stats` (and via it `total_servers`, `active_servers`, `HAProxyStats.*`) were getting inflated counts that don't reflect real replicas. `is_drained` / `is_system_idle` happened to be fine because placeholder slots have 0 sessions and 0 queued, so the session sums were correct. But anything counting servers -- including the existing integration tests -- broke. Filter in `get_all_stats`: * Slots actively assigned in the in-memory pool: kept. * Slots with current_sessions > 0 or queued > 0: kept (recently released slots still draining in-flight work; `is_system_idle` needs to see those). * Fallback servers (named after the actor, not srv-prefixed): kept. * Everything else (placeholders): dropped. Added a public `BackendSlotPool.assigned_slots() -> Set[int]` accessor so the filter doesn't have to reach into private state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e commands After a graceful reload, `_start_and_wait_for_haproxy` returns as soon as the new proc's frontend listener accepts TCP -- which is fast via `-x` FD transfer. But the runtime-API admin socket binding lags behind that by tens to hundreds of milliseconds. During that lag, connections to the admin-socket path still go to the OLD proc (it still has the file bound). For the initial reload (when the old proc started with an empty config), that means our `set server backend/srvN ...` commands hit a process that doesn't know about any backends -- "No such backend." -- and `_rebind_slot_pools` wipes pool+backend_configs to force recovery, which schedules another reload, which races the same window, and so on. Visible failure pattern: bursts of 503s on high-traffic endpoints during cluster bring-up (~3% aggregate failure rate observed on a recent load test, ~7% on /highscale-stress/ specifically), centered on the period between the initial empty-config start and the first broadcast's reload completing. Fix: probe `show info` for the new proc's PID after start and before issuing any runtime commands. Polls every 50ms with a 10s deadline. On the rare timeout, log a warning and proceed -- runtime commands may still fail, in which case the existing reload-fallback path kicks in. Applied in three reload paths: * HAProxyApi.start() (initial start) * _graceful_reload (normal graceful reload) * _graceful_reload fresh-start branch Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/hard-stop timeouts
Production load tests showed a reload cascade: under traffic spikes the
data-plane threads saturated and starved the admin socket, runtime-API
commands hit their 15s timeout, the proxy actor fell back to a full
reload, and the in-flight streaming and long-running requests on the
old HAProxy process were chopped when hard-stop-after fired.
Three changes that together break the cascade and survive the reloads
it can't prevent:
- haproxy_templates.py: pin both stats sockets to `thread 1` and
restrict all `bind` lines (prometheus, http_frontend, stats) to
`thread 2-{nbthread}`. With admin work on a dedicated thread, runtime
commands stay responsive even when threads 2..N are at 100% CPU.
- constants.py: RAY_SERVE_HAPROXY_SOCKET_TIMEOUT_S 15s -> 60s as a
safety margin for rare slow windows that isolation alone doesn't
cover.
- constants.py: RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S 120s -> 600s so
long-running requests (e.g. SSE streams, /long-runner/-style
endpoints) survive across reloads instead of being TCP-RST'd.
Thread isolation requires nbthread >= 2; HAProxyConfig.__post_init__
enforces this so a misconfigured template can't render `thread 2-1`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cket diagnostics Thread isolation didn't fix the admin-socket cascade. Logs from the post-isolation run show admin_wait consistently hitting the full 60s ceiling -- meaning the socket is unresponsive, not contended -- and rebind_slots variance got worse (single-threaded admin processing). The likely root cause is now: HAProxy in `-db` debug mode emits per backend-state-change / health-check-failure lines on stderr. The proxy actor leaves stderr=PIPE without an active reader, so the 64KB pipe buffer fills under sustained churn and HAProxy blocks on its next stderr write -- which freezes the admin socket (same proc) and looks identical to a deadlock from outside. - haproxy_templates.py: remove `thread 1` from stats sockets and `thread 2-N` from frontend/stats binds, restoring HAProxy's default thread distribution (all 4 threads handle admin work in parallel). - haproxy.py: drop nbthread>=2 validation from HAProxyConfig. - haproxy.py: spawn `_drain_proc_stderr` after each successful start to copy HAProxy stderr to a per-pid log file (`<socket_path>.stderr.<pid>.log`), keeping the OS pipe buffer drained. Bounded to 10MB with truncate-on-roll. - haproxy.py: on admin-socket-wait timeout, dump a snapshot (`_dump_admin_socket_diagnostics`) of the socket path's filesystem state, the new proc's `/proc/<pid>/status` (State: line shows Sleeping vs Running -- the smoking gun for a stderr write block), and a tail of the per-pid stderr log so the next run carries its own diagnostic evidence. - constants.py: trim the SOCKET_TIMEOUT_S comment that referenced the now-removed thread isolation. Keep 60s timeout and 600s hard-stop-after; both are independently useful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pool Three changes that address the residual /cpu-fanout/, /heavy-payload/, and /long-runner/ errors once HAProxy itself was stable: - constants.py: default `RAY_SERVE_HAPROXY_TIMEOUT_SERVER_S` to 3600 (matches timeout_client_s) and `RAY_SERVE_HAPROXY_TIMEOUT_CONNECT_S` to 5. Both were `None` before -- the rendered defaults block was emitting blank lines where `timeout server` and `timeout connect` should have been, leaving HAProxy on its undefined-behavior fallback for both. `timeout server` unset is particularly suspect during graceful reload edges. - constants.py: bump `RAY_SERVE_HAPROXY_MIN_SLOTS_PER_BACKEND` from 16 to 32. With `inter 1s fall 2`, a slot can be DOWN for ~2s after a replica goes away; backends with only 16 slots (like /heavy-payload/) can briefly have zero UP servers under churn and serve 503 to whatever requests arrive in that window. 32 gives headroom for small-pool backends without significantly affecting total budget. - haproxy_templates.py: drop `option abortonclose`. It RSTs the backend connection any time the client's TCP read-half goes quiet -- fine for sub-second GETs but catastrophic for endpoints that hold a connection open for tens of seconds (NAT idle reset, keep-alive timer, locust pause, all trigger spurious aborts that surface as 502 to the caller, normalized to 500 by errorfile). Tests: golden config in test_haproxy_api.py updated to drop the `option abortonclose` line; render verified to match exactly. Per- backend timeout/diff tests in test_haproxy_diff.py and the slot-split unit tests in test_haproxy_slots.py pass timeouts/min_per_backend explicitly, so the new global defaults don't affect them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…POptions
For long-running endpoints (e.g. /long-runner/ with med 73s requests),
two changes that together reduce the rate of mid-request kills during
graceful reload:
- constants.py: bump RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S 600s -> 1800s.
After `-sf`, the old proc is force-killed when hard-stop fires. A
request in flight on a near-end-of-life old proc gets RST'd, which
the replica logs as HTTP 499 (client closed before response). With
1800s, requests up to ~30 minutes have a wide margin to complete
even if they ride on a near-deadline proc.
- constants.py + haproxy.py: introduce
RAY_SERVE_HAPROXY_TIMEOUT_HTTP_KEEP_ALIVE_S (default 60s), promote
`timeout_http_keep_alive_s` from a property reading
`http_options.keep_alive_timeout_s` to a real field on HAProxyConfig
defaulting to the new env var. The old setup conflated two unrelated
timeouts:
* HTTPOptions.keep_alive_timeout_s -- uvicorn keep-alive on the
*replica* side (per Ray Serve's existing semantics)
* HAProxy keep-alive -- HAProxy <-> client idle timeout, used for
forcing locust/client TCP connections to rotate off old procs
after a reload
60s in HAProxy doesn't kill active long-running requests (those
are bounded by `timeout client`/`timeout server`, both 3600s). It
only bounds idle time between requests on a keep-alive connection,
which encourages clients to reopen connections and land on the new
proc.
Test: test_haproxy_api.py now passes timeout_http_keep_alive_s=55 to
HAProxyConfig directly rather than via HTTPOptions; the rendered
template still matches the golden config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 6 total unresolved issues (including 5 from previous reviews).
Reviewed by Cursor Bugbot for commit 8bb9997. Configure here.
| server-state-base /tmp/haproxy-serve | ||
| server-state-file /tmp/haproxy-serve/server-state | ||
| hard-stop-after 120s | ||
| hard-stop-after 600s |
There was a problem hiding this comment.
Test expects wrong hard-stop-after value (600 vs 1800)
Medium Severity
The test's HAProxyConfig stub does not explicitly set hard_stop_after_s, so it inherits the default RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S. This constant was changed from 120 to 1800 in this PR, but the expected config string says hard-stop-after 600s. The actual rendered value will be 1800s, causing the full-config comparison assertion to fail.
Reviewed by Cursor Bugbot for commit 8bb9997. Configure here.


Summary
Add
option redispatchandretry-on conn-failure empty-responseto each generated HAProxy backend block (python/ray/serve/_private/haproxy_templates.py). This makes the existing defaultretries 3actually pick a different server on retry, and extends retry coverage to the safe failure modes (connect failure, empty response).Motivation
During load testing of a multi-app Serve service with HAProxy ingress, we observed a tail of HTTP 5xx returned to clients despite zero 5xx coming from upstream replicas. The HAProxy Prometheus metrics showed:
haproxy_backend_http_responses_total{code="5xx"}≈ 0 (replicas weren't returning 5xx)haproxy_frontend_http_responses_total{code="5xx"}> 0 (clients saw 5xx)Root cause: when a replica enters the window between dying and being marked DOWN by HAProxy's health check, requests still get dispatched to it. HAProxy's default
retries 3retries against the same dead server (no redispatch), exhausts retries, falls through to theSERVE_PROXY_ACTOR-fallbackbackup, and ultimately surfaces a synthesized 502 (which the existingerrorfile 502 → 500.httpconfig maps to a 500 visible to the client).option redispatchmakes the retries pick a different healthy server, eliminating most of this tail.retry-on conn-failure empty-responseextends retry coverage from "connect-only" (default) to also handle the case where the connection succeeded but the server died before responding.Why these specific options
option redispatchretries 3from useless-on-same-dead-server to actually useful.retry-on conn-failureretry-on empty-responseThe two retry conditions chosen are the body-stream-safe subset that don't risk double-execution of side-effecting application logic.
🤖 Generated with Claude Code