[4/N] [HAProxy stability] - coalesce controller broadcasts into a single reload#63623
[4/N] [HAProxy stability] - coalesce controller broadcasts into a single reload#63623harshit-anyscale wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a coalescing mechanism for HAProxy reloads to prevent excessive churn during rapid autoscaling events. It adds a configurable window, RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S, and implements a background task to batch update requests. Feedback indicates that the current retry logic is ineffective because the backend update method is synchronous and does not await the actual reload, meaning exceptions won't be caught as intended. Additionally, improvements were suggested regarding the use of standardized environment variable parsing and the removal of redundant exception information in log messages.
Under autoscaling churn the Serve controller fires `target_groups` and `fallback_targets` broadcasts independently, often only tens of ms apart. Without coalescing, each broadcast triggers its own config regeneration and graceful reload via `-sf`. HAProxyManager now marks state dirty on each broadcast and arms a single sleeping coalesce task; updates arriving during the sleep window are absorbed into the same pending apply. Window defaults to 100 ms via RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S; set to 0 to disable. If an apply fails, the task re-arms and retries on the next tick (up to 3 consecutive failures, then waits for a new broadcast — avoids busy-spinning on a persistent error). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9f1c879 to
4d487f0
Compare
Why are fallback_targets being broadcasted so frequently if it only include the head node serve proxy? We should fix that if that's the case. |
Add a serve_haproxy_update_latency_s histogram measuring seconds from the first coalesced controller broadcast to the HAProxy reload completing. It spans the coalesce window, any time queued behind an in-flight reload, and the reload itself — so a slow-updates investigation can see which part dominates. Latency is measured from the first broadcast since the last apply (_coalesce_armed_at), so coalesced broadcasts report one end-to-end sample, and a failed-then-retried apply still reflects the full delay. Addresses review request on ray-project#63623. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a serve_haproxy_update_latency_s histogram measuring seconds from the first coalesced controller broadcast to the HAProxy reload completing. It spans the coalesce window, any time queued behind an in-flight reload, and the reload itself — so a slow-updates investigation can see which part dominates. Latency is measured from the first broadcast since the last apply (_coalesce_armed_at), so coalesced broadcasts report one end-to-end sample, and a failed-then-retried apply still reflects the full delay. All broadcasts now run through the apply task; a coalesce window of 0 just skips the sleep (apply immediately) instead of taking a separate synchronous path. This keeps the latency metric recorded regardless of the coalesce setting — including when it is disabled for debugging. Addresses review request on ray-project#63623. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
b785566 to
702207a
Compare
my bad, target_groups are getting broadcasted very frequently, not fallback. thanks for pointing, change description as well. |
|
You're right — My description overstated it: the frequent broadcaster under churn is The repeated — Claude (Claude Code) |
_update_haproxy_backends previously returned create_task(_reload_haproxy()) and its only caller, _coalesce_and_apply, awaited that task. The task was leftover indirection from when the synchronous long-poll callbacks called this directly; coalescing moved the sync->async boundary up into _schedule_haproxy_update, so the only caller is now async. Make _update_haproxy_backends async and await _reload_haproxy() directly — same failure propagation and serialization, one less task and no orphaned reload if the coalesce task is cancelled. Addresses review feedback on ray-project#63623. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| self._update_pending: bool = False | ||
| self._coalesce_task: Optional[asyncio.Task] = None | ||
| self._coalesce_armed_at: Optional[float] = None | ||
| self._haproxy_update_latency = metrics.Histogram( |
There was a problem hiding this comment.
include the metrics in the monitoring docs
| "HAProxy reload completing. Includes the coalesce window, time " | ||
| "queued behind an in-flight reload, and the reload itself." | ||
| ), | ||
| boundaries=[0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30], |
There was a problem hiding this comment.
move boundaries in constants.py
Summary
Carved out of #63308 so the coalescing change can land independently of the stderr-redirect (#63621) and redispatch (#63622) changes in that PR.
Under autoscaling churn the controller's
target_groupsbroadcast fires frequently — its replica set changes on essentially every reconcile that adds or removes a replica — and each broadcast triggers its own config regeneration and graceful reload via-sf. Bothtarget_groupsandfallback_targetsare emitted from the same control-loop step, so when both change in one ~100 ms tick they reach the proxy microseconds apart and (pre-coalescing) cause two back-to-back reloads.Both broadcasts are already change-gated on the controller (
broadcast_*_if_changed), so this PR collapses genuine consecutive / same-tick changes into one reload — it is not suppressing redundant no-op broadcasts. (fallback_targetsin particular is near-static in steady state; it only changes when the fallback proxy's health flips or its actor is replaced.)This PR collapses adjacent broadcasts into a single reload:
HAProxyManagerkeeps two pieces of state:_update_pending: booland_coalesce_task: Optional[asyncio.Task]._update_pending = Trueand arms (or re-arms) a single sleeping coalesce task.RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S(default 100 ms; set to 0 to disable) and then runs one_update_haproxy_backends()call against whatever the latest state is.Failure handling. If the apply fails, the task re-arms itself and retries on the next tick. After 3 consecutive failures it stops re-arming and waits for the next broadcast to trigger fresh state — this prevents busy-spinning against a persistent error (e.g. HAProxy crashed and is stuck restarting) while still recovering automatically once a new broadcast lands.
Knob.
RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S(env var, float seconds). 100 ms by default — small enough that the worst-case extra latency from a single broadcast is bounded, large enough to catch the common back-to-back pattern observed during scale events. Set to0to opt out entirely.