Skip to content

Commit bbcc1e3

Browse files
author
Ignacio Van Droogenbroeck
committed
docs(arc-enterprise): correct query_gate_on_catchup semantics
Updates the "Query Gating During Replication Catch-Up" section to reflect the corrected gate behavior after gemini review on PR #419: - Predicate is scoped to the startup catch-up batch only, not all puller activity. Adds context explaining why (a naive "wait for everything to settle" would 503 every flush in a busy cluster). - Replaces the four-point predicate to use the catch-up-scoped counters (catchup_inflight, catchup_failed, catchup_dropped) and clarifies that steady-state failures don't keep the gate red. - New :::warning admonition for the replication_catchup_enabled=false conflict — Arc auto-disables the gate with a WARN at startup rather than locking the node into permanent 503. - Updated 503 body shape to match the new fields exposed by ReplicationCatchUpStatus().
1 parent dba6ba5 commit bbcc1e3

1 file changed

Lines changed: 16 additions & 7 deletions

File tree

docs-arc-enterprise/configuration/clustering.md

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -77,14 +77,22 @@ Turn this on if you'd rather a reader return 503 for a few seconds at startup th
7777

7878
### What "fully converged" means
7979

80+
The gate is scoped to the **startup catch-up batch only** — not to all pull activity on the node. This distinction matters: in a busy cluster, steady-state ingest constantly puts new files in flight, and a naive "wait for everything to settle" predicate would mean the reader returns 503 every few seconds in normal operation. The gate's job is *"the reader has finished bootstrapping its view of the manifest as of startup,"* not *"no pulls are happening anywhere right now."*
81+
8082
A node is considered ready when **all** of the following are true:
8183

8284
1. The startup catch-up walker has finished its pass over the manifest.
83-
2. No pulls are in-flight (queue and worker set both empty).
84-
3. No pulls have failed after retries since the puller started.
85-
4. No pulls have been dropped due to queue saturation since the puller started.
85+
2. No paths the walker tagged are still in flight (`catchup_inflight == 0`). Steady-state pulls from reactive FSM callbacks are deliberately excluded.
86+
3. No catch-up-batch pulls failed after retries (`catchup_failed == 0`).
87+
4. No catch-up-batch pulls were dropped due to queue saturation (`catchup_dropped == 0`).
88+
89+
Failures and drops outside the catch-up window do **not** keep the gate red. They're operational concerns surfaced via puller stats but not correctness blockers — by the time the catch-up batch has settled, the reader has reconciled its view of the manifest as of walker start. Steady-state failures are handled by reactive FSM callbacks (which re-enqueue), the [Phase 5 reconciler](/arc-enterprise/configuration/clustering), and operator alerting via the cumulative `failed` / `dropped` counters.
8690

87-
Failed and dropped pulls indicate files the manifest promised but this reader does not have. Re-converging requires either restarting the node (re-runs catch-up) or a new FSM callback re-enqueueing the missing path. Both `failed` and `dropped` counts are surfaced in the 503 response body so operators can see when this happens.
91+
If a catch-up-batch failure or drop happens, the gate stays red until the node restarts (re-runs catch-up) or a reactive FSM callback successfully re-enqueues the same path. Both `catchup_failed` and `catchup_dropped` are surfaced in the 503 body so operators see exactly what happened.
92+
93+
:::warning Combining with `replication_catchup_enabled=false`
94+
If you set `cluster.replication_catchup_enabled=false` (the emergency off-switch for pathologically large manifests), the catch-up walker never runs and the gate would never clear. Arc detects this combination at startup, logs a `WARN`, and **auto-disables the gate** so the node isn't permanently 503'd. Operators see a clear log line and can fix the configuration at their leisure. Don't enable the gate if you've also disabled the walker.
95+
:::
8896

8997
### Endpoints affected
9098

@@ -110,11 +118,12 @@ Internal endpoints (cache invalidation, cluster status, replication-control APIs
110118
"completed_at": 0,
111119
"entries_walked": 1287,
112120
"enqueued": 1287,
121+
"catchup_inflight": 2,
122+
"catchup_failed": 0,
123+
"catchup_dropped": 0,
113124
"queue_depth": 7,
114125
"inflight_count": 2,
115-
"pulled": 1278,
116-
"failed": 0,
117-
"dropped": 0
126+
"pulled": 1278
118127
}
119128
}
120129
```

0 commit comments

Comments
 (0)