You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds the new opt-in catch-up query gate (#392 in basekick-labs/arc, landing
in 26.06.1) to the Clustering & High Availability page:
- TOML and env-var entries in the existing config blocks
- New "Query Gating During Replication Catch-Up" section covering when to
enable it, the readiness predicate, affected endpoints, the 503 response
shape with catchup_status fields, Retry-After header, observability
(Prometheus counter, sampled WARN log, /api/v1/cluster status), the
known sub-millisecond FSM-apply race, and Pattern A vs B applicability.
In a [local-storage cluster](/arc-enterprise/configuration/deployment-patterns) with peer replication, a reader node may serve queries before its background puller has finished pulling all the Parquet files the cluster manifest references. Without gating, those queries silently return partial results: the manifest knows about the missing files, but `read_parquet()` globs against local storage and only finds what's already on disk. WAL replication (added in 26.05.1) closes part of this gap for unflushed writer data, but flushed Parquet files still depend on the asynchronous puller.
71
+
72
+
`cluster.query_gate_on_catchup` (added in 26.06.1, off by default) closes the remaining gap. When enabled, all user-facing read endpoints return `503 Service Unavailable` until peer file replication has fully converged on this node.
73
+
74
+
:::tip When to enable it
75
+
Turn this on if you'd rather a reader return 503 for a few seconds at startup than serve incomplete results. Leave it off if your application can tolerate eventual consistency during catch-up and you'd rather queries always succeed (the existing pre-26.06.1 behavior). Either choice is defensible; this is a correctness-vs-availability knob.
76
+
:::
77
+
78
+
### What "fully converged" means
79
+
80
+
A node is considered ready when **all** of the following are true:
81
+
82
+
1. The startup catch-up walker has finished its pass over the manifest.
83
+
2. No pulls are in-flight (queue and worker set both empty).
84
+
3. No pulls have failed after retries since the puller started.
85
+
4. No pulls have been dropped due to queue saturation since the puller started.
86
+
87
+
Failed and dropped pulls indicate files the manifest promised but this reader does not have. Re-converging requires either restarting the node (re-runs catch-up) or a new FSM callback re-enqueueing the missing path. Both `failed` and `dropped` counts are surfaced in the 503 response body so operators can see when this happens.
88
+
89
+
### Endpoints affected
90
+
91
+
When the gate is enabled and the node is still catching up, these endpoints return 503:
92
+
93
+
-`POST /api/v1/query`
94
+
-`POST /api/v1/query/arrow`
95
+
-`POST /api/v1/query/estimate`
96
+
-`GET /api/v1/query/:measurement`
97
+
-`GET /api/v1/measurements`
98
+
99
+
Internal endpoints (cache invalidation, cluster status, replication-control APIs) are deliberately **not** gated — peer nodes need them to fire during catch-up.
100
+
101
+
### 503 response shape
102
+
103
+
```json
104
+
{
105
+
"success": false,
106
+
"error": "replication_catch_up_in_progress",
107
+
"message": "Reader is still catching up on replicated files. Retry shortly or check /api/v1/cluster for catch-up progress.",
108
+
"catchup_status": {
109
+
"started_at": 1714912800,
110
+
"completed_at": 0,
111
+
"entries_walked": 1287,
112
+
"enqueued": 1287,
113
+
"queue_depth": 7,
114
+
"inflight_count": 2,
115
+
"pulled": 1278,
116
+
"failed": 0,
117
+
"dropped": 0
118
+
}
119
+
}
120
+
```
121
+
122
+
A `Retry-After: 5` header is also set so HTTP-aware load balancers and clients can back off automatically.
123
+
124
+
`completed_at = 0` means the catch-up walker is still enumerating; once it flips non-zero, watch `queue_depth + inflight_count` go to zero. Non-zero `failed` or `dropped` means the gate will not clear without a node restart or a follow-up FSM callback.
125
+
126
+
### Observability
127
+
128
+
-**Cumulative gate fires**: `QueryHandler.QueryGate503Total()` is exposed for Prometheus / metrics scrapes. Alert on a non-zero rate to detect that the gate is firing without inferring from generic HTTP error logs.
129
+
-**Sampled log line**: while the gate is active, Arc emits at most one `WARN` log per second with the gate counter and request path. Avoids flooding under sustained catch-up while still surfacing the degraded state.
130
+
-**Live status**: the `/api/v1/cluster` endpoint exposes `replication_catchup_status` with the same fields shown in the 503 body, so dashboards can show catch-up progress without waiting for a query to fail.
131
+
132
+
### Known limitation
133
+
134
+
There is a sub-millisecond window between the Raft FSM committing a `RegisterFile` entry and the puller's `Enqueue` callback firing. A query landing in that window can observe `ReplicationReady() == true` while a manifest entry from the same Raft commit is not yet in the in-flight set. Closing this gap requires a per-query Raft `LastApplied()` barrier on the query path, which is out of scope for this gate.
135
+
136
+
The gate's contract is *"every file the puller has observed has been pulled,"* not *"every file the manifest currently contains has been pulled."* In practice this means the gate may unblock a fraction of a second before the very last files committed before the gate-clear are queryable. This is a tracked follow-up.
137
+
138
+
### Pattern A vs. Pattern B
139
+
140
+
-**Shared object storage** (Pattern A): the puller is disabled (`replication_enabled = false`), so `query_gate_on_catchup` is effectively a no-op — readers see the bucket directly and don't need to catch up. Safe to leave the flag at any value.
141
+
-**Local storage with peer replication** (Pattern B): this is where the gate matters. Enable it on readers whose application cannot tolerate partial results during cold start or after a network partition.
142
+
66
143
## Deployment Example
67
144
68
145
A minimal 3-node cluster with one writer and two readers using Docker Compose:
0 commit comments