Skip to content

Commit 0f219e0

Browse files
author
Ignacio Van Droogenbroeck
committed
docs(arc-enterprise): document catch-up gate self-heal
Updates the "Query Gating During Replication Catch-Up" section to reflect the self-heal recovery semantics added in PR #419 (gemini review pass 3). A catch-up failure no longer requires a process restart to clear: when a later pull succeeds for a previously-failed path (reactive FSM callback or subsequent catch-up scan), catchup_failed is decremented and the gate re-opens automatically. Catch-up drops still require restart or operator retry — explained why (no inflight slot to attribute a later success to).
1 parent bbcc1e3 commit 0f219e0

1 file changed

Lines changed: 4 additions & 2 deletions

File tree

docs-arc-enterprise/configuration/clustering.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,9 +86,11 @@ A node is considered ready when **all** of the following are true:
8686
3. No catch-up-batch pulls failed after retries (`catchup_failed == 0`).
8787
4. No catch-up-batch pulls were dropped due to queue saturation (`catchup_dropped == 0`).
8888

89-
Failures and drops outside the catch-up window do **not** keep the gate red. They're operational concerns surfaced via puller stats but not correctness blockers — by the time the catch-up batch has settled, the reader has reconciled its view of the manifest as of walker start. Steady-state failures are handled by reactive FSM callbacks (which re-enqueue), the [Phase 5 reconciler](/arc-enterprise/configuration/clustering), and operator alerting via the cumulative `failed` / `dropped` counters.
89+
Failures and drops outside the catch-up window do **not** keep the gate red. They're operational concerns surfaced via puller stats but not correctness blockers — by the time the catch-up batch has settled, the reader has reconciled its view of the manifest as of walker start. Steady-state failures are handled by reactive FSM callbacks (which re-enqueue), the Phase 5 reconciler, and operator alerting via the cumulative `failed` / `dropped` counters.
9090

91-
If a catch-up-batch failure or drop happens, the gate stays red until the node restarts (re-runs catch-up) or a reactive FSM callback successfully re-enqueues the same path. Both `catchup_failed` and `catchup_dropped` are surfaced in the 503 body so operators see exactly what happened.
91+
**Self-heal**: a catch-up failure does not require a process restart to clear. When a later pull succeeds for a previously-failed path (a reactive FSM callback re-enqueueing the path after the underlying issue resolves, or a subsequent catch-up scan), `catchup_failed` is decremented and the gate re-opens automatically. Catch-up drops (`catchup_dropped`) do not self-heal in the same way: a drop means no inflight slot was ever taken, so there is no later success to attribute to the original drop. Recovery in that case requires either a node restart (which re-runs the walker) or an operator-initiated retry.
92+
93+
Both `catchup_failed` and `catchup_dropped` are surfaced in the 503 body so operators see exactly what happened.
9294

9395
:::warning Combining with `replication_catchup_enabled=false`
9496
If you set `cluster.replication_catchup_enabled=false` (the emergency off-switch for pathologically large manifests), the catch-up walker never runs and the gate would never clear. Arc detects this combination at startup, logs a `WARN`, and **auto-disables the gate** so the node isn't permanently 503'd. Operators see a clear log line and can fix the configuration at their leisure. Don't enable the gate if you've also disabled the walker.

0 commit comments

Comments
 (0)