Skip to content

Commit dba6ba5

Browse files
author
Ignacio Van Droogenbroeck
committed
docs(arc-enterprise): document cluster.query_gate_on_catchup setting
Adds the new opt-in catch-up query gate (#392 in basekick-labs/arc, landing in 26.06.1) to the Clustering & High Availability page: - TOML and env-var entries in the existing config blocks - New "Query Gating During Replication Catch-Up" section covering when to enable it, the readiness predicate, affected endpoints, the 503 response shape with catchup_status fields, Retry-After header, observability (Prometheus counter, sampled WARN log, /api/v1/cluster status), the known sub-millisecond FSM-apply race, and Pattern A vs B applicability.
1 parent f6d6d16 commit dba6ba5

1 file changed

Lines changed: 77 additions & 0 deletions

File tree

docs-arc-enterprise/configuration/clustering.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ coordinator_addr = ":9000" # Address for inter-node communication
4747
health_check_interval = 10 # Health check interval (seconds)
4848
heartbeat_interval = 5 # Heartbeat interval (seconds)
4949
replication_enabled = true # Enable WAL replication to readers
50+
query_gate_on_catchup = false # See "Query Gating During Replication Catch-Up" below
5051
```
5152

5253
### Environment Variables
@@ -61,8 +62,84 @@ ARC_CLUSTER_COORDINATOR_ADDR=:9000
6162
ARC_CLUSTER_HEALTH_CHECK_INTERVAL=10
6263
ARC_CLUSTER_HEARTBEAT_INTERVAL=5
6364
ARC_CLUSTER_REPLICATION_ENABLED=true
65+
ARC_CLUSTER_QUERY_GATE_ON_CATCHUP=false
6466
```
6567

68+
## Query Gating During Replication Catch-Up
69+
70+
In a [local-storage cluster](/arc-enterprise/configuration/deployment-patterns) with peer replication, a reader node may serve queries before its background puller has finished pulling all the Parquet files the cluster manifest references. Without gating, those queries silently return partial results: the manifest knows about the missing files, but `read_parquet()` globs against local storage and only finds what's already on disk. WAL replication (added in 26.05.1) closes part of this gap for unflushed writer data, but flushed Parquet files still depend on the asynchronous puller.
71+
72+
`cluster.query_gate_on_catchup` (added in 26.06.1, off by default) closes the remaining gap. When enabled, all user-facing read endpoints return `503 Service Unavailable` until peer file replication has fully converged on this node.
73+
74+
:::tip When to enable it
75+
Turn this on if you'd rather a reader return 503 for a few seconds at startup than serve incomplete results. Leave it off if your application can tolerate eventual consistency during catch-up and you'd rather queries always succeed (the existing pre-26.06.1 behavior). Either choice is defensible; this is a correctness-vs-availability knob.
76+
:::
77+
78+
### What "fully converged" means
79+
80+
A node is considered ready when **all** of the following are true:
81+
82+
1. The startup catch-up walker has finished its pass over the manifest.
83+
2. No pulls are in-flight (queue and worker set both empty).
84+
3. No pulls have failed after retries since the puller started.
85+
4. No pulls have been dropped due to queue saturation since the puller started.
86+
87+
Failed and dropped pulls indicate files the manifest promised but this reader does not have. Re-converging requires either restarting the node (re-runs catch-up) or a new FSM callback re-enqueueing the missing path. Both `failed` and `dropped` counts are surfaced in the 503 response body so operators can see when this happens.
88+
89+
### Endpoints affected
90+
91+
When the gate is enabled and the node is still catching up, these endpoints return 503:
92+
93+
- `POST /api/v1/query`
94+
- `POST /api/v1/query/arrow`
95+
- `POST /api/v1/query/estimate`
96+
- `GET /api/v1/query/:measurement`
97+
- `GET /api/v1/measurements`
98+
99+
Internal endpoints (cache invalidation, cluster status, replication-control APIs) are deliberately **not** gated — peer nodes need them to fire during catch-up.
100+
101+
### 503 response shape
102+
103+
```json
104+
{
105+
"success": false,
106+
"error": "replication_catch_up_in_progress",
107+
"message": "Reader is still catching up on replicated files. Retry shortly or check /api/v1/cluster for catch-up progress.",
108+
"catchup_status": {
109+
"started_at": 1714912800,
110+
"completed_at": 0,
111+
"entries_walked": 1287,
112+
"enqueued": 1287,
113+
"queue_depth": 7,
114+
"inflight_count": 2,
115+
"pulled": 1278,
116+
"failed": 0,
117+
"dropped": 0
118+
}
119+
}
120+
```
121+
122+
A `Retry-After: 5` header is also set so HTTP-aware load balancers and clients can back off automatically.
123+
124+
`completed_at = 0` means the catch-up walker is still enumerating; once it flips non-zero, watch `queue_depth + inflight_count` go to zero. Non-zero `failed` or `dropped` means the gate will not clear without a node restart or a follow-up FSM callback.
125+
126+
### Observability
127+
128+
- **Cumulative gate fires**: `QueryHandler.QueryGate503Total()` is exposed for Prometheus / metrics scrapes. Alert on a non-zero rate to detect that the gate is firing without inferring from generic HTTP error logs.
129+
- **Sampled log line**: while the gate is active, Arc emits at most one `WARN` log per second with the gate counter and request path. Avoids flooding under sustained catch-up while still surfacing the degraded state.
130+
- **Live status**: the `/api/v1/cluster` endpoint exposes `replication_catchup_status` with the same fields shown in the 503 body, so dashboards can show catch-up progress without waiting for a query to fail.
131+
132+
### Known limitation
133+
134+
There is a sub-millisecond window between the Raft FSM committing a `RegisterFile` entry and the puller's `Enqueue` callback firing. A query landing in that window can observe `ReplicationReady() == true` while a manifest entry from the same Raft commit is not yet in the in-flight set. Closing this gap requires a per-query Raft `LastApplied()` barrier on the query path, which is out of scope for this gate.
135+
136+
The gate's contract is *"every file the puller has observed has been pulled,"* not *"every file the manifest currently contains has been pulled."* In practice this means the gate may unblock a fraction of a second before the very last files committed before the gate-clear are queryable. This is a tracked follow-up.
137+
138+
### Pattern A vs. Pattern B
139+
140+
- **Shared object storage** (Pattern A): the puller is disabled (`replication_enabled = false`), so `query_gate_on_catchup` is effectively a no-op — readers see the bucket directly and don't need to catch up. Safe to leave the flag at any value.
141+
- **Local storage with peer replication** (Pattern B): this is where the gate matters. Enable it on readers whose application cannot tolerate partial results during cold start or after a network partition.
142+
66143
## Deployment Example
67144

68145
A minimal 3-node cluster with one writer and two readers using Docker Compose:

0 commit comments

Comments
 (0)