You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(authentication): document cluster auth replication (Phase A, v26.06.1+)
Arc Enterprise v26.06.1 routes API token writes through the Raft FSM
so tokens propagate cluster-wide (a token created on any node is
valid on every node; revocation propagates cluster-wide within ~50
ms). This docs update reflects that.
**docs/configuration/authentication.md** — new "Cluster auth
replication (Enterprise)" section under "Bootstrap & Recovery":
- What replicates (tokens yes; RBAC tables in v26.07.1; SSO/audit
intentionally not).
- Eventual-consistency semantics (~50 ms convergence, customer SDKs
already retry on transient 401).
- Leader-only bootstrap banner: every cluster node calls
EnsureInitialToken on boot, only the Raft leader's proposal lands,
followers silently no-op. Operators previously had to scrape every
pod for the banner; now there's exactly one.
- Plaintext-secrecy invariant: only bcrypt hash + prefix go through
Raft; snapshot dumps don't contain plaintext.
- Six new Prometheus counters (arc_cluster_auth_apply_* +
arc_cluster_auth_rejected_total) with operator-facing semantics
(which divergence patterns mean what).
- No-auto-migration policy: pre-26.06.1 tokens remain valid only on
the node that issued them; operators re-issue via the API after
upgrade.
- Divergence-detection error log + remediation: the v26.06.1
materialiser refuses to overwrite a pre-existing AUTOINCREMENT row
that collides with a new cluster-stamped ID, surfacing the upgrade
hazard rather than silently diverging. Includes the sqlite3 cleanup
command.
- Required configuration: ARC_CLUSTER_SHARED_SECRET (32+ chars, same
on every node).
**docs/installation/kubernetes.md** — updated the "Get Your Admin
Token" section with an info callout for v26.06.1+ explaining the
leader-only banner behaviour. The existing `kubectl logs -l app=arc
| grep -i "admin"` selector still works — it just returns the
single banner from whichever pod won the election.
Copy file name to clipboardExpand all lines: docs/configuration/authentication.md
+127Lines changed: 127 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -105,6 +105,133 @@ After recovering access:
105
105
If Arc restarts with `ARC_AUTH_FORCE_BOOTSTRAP=true` and the `arc-recovery` token already exists, it is a no-op. You still hold the token value you provided.
106
106
:::
107
107
108
+
## Cluster auth replication (Enterprise)
109
+
110
+
:::info Available since v26.06.1
111
+
Cluster-wide token replication is available in Arc Enterprise v26.06.1 and later. OSS / standalone deployments are unaffected — tokens stay in the local SQLite as they always have.
112
+
:::
113
+
114
+
Before v26.06.1, every Arc Enterprise cluster node carried its own SQLite auth DB. A token created via `POST /api/v1/auth/tokens` on the writer was **not** valid on the reader — the reader's local SQLite never saw the row. Operators worked around this by pre-seeding `ARC_AUTH_BOOTSTRAP_TOKEN` with the same value on every node, but API-created tokens and revocations did not propagate. A revocation on the writer left the same token still valid on every reader for the lifetime of those reader processes.
115
+
116
+
v26.06.1 routes auth **writes** through the cluster's Raft consensus. Auth **reads** still hit the local cache — there's no Raft round trip on every API call.
Eventual consistency, typically under 50 ms via Raft apply on local loopback or LAN. Customer SDKs already retry on transient 401, so the brief window between leader commit and follower materialise is invisible in normal usage.
135
+
136
+
A read-after-write barrier is **not** included in v26.06.1. If a real customer report surfaces the window, a follow-up will add a per-query `LastApplied()` barrier on the query path.
137
+
138
+
### Bootstrap banner now prints on the leader only
139
+
140
+
Before v26.06.1, every cluster node printed its own randomly-generated admin token at first start, so a 4-node boot produced 4 banners and 4 different admin tokens (each valid only on its own node).
141
+
142
+
From v26.06.1, every node still calls `EnsureInitialToken` on boot with its own random plaintext, but only the Raft leader's proposal lands cluster-wide. Followers receive the FSM's `"token name already exists"` rejection from the leader and return an empty plaintext to the caller, so **no banner is emitted on losers**. The losing node's local SQLite still gets the winner's bcrypt hash + prefix via the FSM materialise callback — every node converges on the same admin token.
143
+
144
+
If you watch a 3-writer + 1-reader cluster boot, expect:
145
+
- 1 `Admin API token:` banner on the Raft leader's stderr
146
+
- 3 `INFO Deferring initial token bootstrap until cluster Raft proposer is wired (Phase A)` lines during startup (one per node)
147
+
- 1 `INFO Cluster auth state replication enabled — token writes now propagate via Raft` line per node after Raft elects a leader
148
+
149
+
### Security posture
150
+
151
+
Plaintext token values are **never** written to the Raft log. The proposer generates the token, hashes it with bcrypt locally, and only the hash + prefix go into the replicated payload. The plaintext is returned to the API caller out-of-band before any Raft work begins. Snapshot dumps don't contain plaintext either — verified by a snapshot-grep test in the test suite.
152
+
153
+
Applier-side validation runs on every node before a token command lands in the FSM. Empty name, missing bcrypt hash, missing prefix, malformed permission string, or zero `created_at` all cause the entry to be rejected on every node. The cluster-wide rejection counter (`arc_cluster_auth_rejected_total`) increments and the rejection is logged at `Error`.
154
+
155
+
### Prometheus counters
156
+
157
+
Per node, on the `/metrics` endpoint:
158
+
159
+
```
160
+
arc_cluster_auth_apply_create_total
161
+
arc_cluster_auth_apply_update_total
162
+
arc_cluster_auth_apply_revoke_total
163
+
arc_cluster_auth_apply_delete_total
164
+
arc_cluster_auth_apply_rotate_total
165
+
arc_cluster_auth_rejected_total
166
+
```
167
+
168
+
In a healthy cluster every node sees the same monotonic count for each `apply_*` counter — they all apply the same Raft log. Divergence across nodes is the load-bearing signal that one of them is missing applies (network partition, FSM stall).
169
+
170
+
`arc_cluster_auth_rejected_total` is the **security alerting signal** — non-zero growth means somebody is proposing tokens that fail applier-side validation (malformed payload, fuzz attempt, or a buggy client). Alert on growth, not on absolute value.
171
+
172
+
### Pre-existing tokens DO NOT auto-migrate
173
+
174
+
Tokens created on a pre-v26.06.1 node by the local-only API path remain valid **only on that node** after upgrade. The expected migration path is:
175
+
176
+
1. Upgrade every cluster node to v26.06.1.
177
+
2. Re-issue API tokens via `POST /api/v1/auth/tokens` after restart. New tokens are cluster-wide automatically.
178
+
3. Revoke the old per-node tokens via `POST /api/v1/auth/tokens/:id/revoke` (the revoke also propagates cluster-wide).
179
+
180
+
Bootstrap tokens set via `ARC_AUTH_BOOTSTRAP_TOKEN` are unaffected if the same value was used on every node (which the pre-v26.06.1 workaround required) — the bytes match, so all nodes effectively share the same admin token already.
181
+
182
+
### Divergence detection
183
+
184
+
If a pre-v26.06.1 AUTOINCREMENT row in your local `auth.db` happens to share an ID with a new cluster-replicated token (Raft log indices land in the same `INTEGER` space), the cluster apply on that node will detect the collision and **refuse to overwrite** the pre-existing row. The cluster's in-memory FSM map remains authoritative; the local SQLite cache stays divergent until the operator resolves it.
185
+
186
+
You'll see an `Error`-level log line on the affected node:
187
+
188
+
```
189
+
ApplyCreateToken: id <N> already exists locally with different token (cluster<->local divergence; see upgrade notes for pre-26.06.1 tokens)
190
+
```
191
+
192
+
And the `arc_cluster_auth_rejected_total` counter increments on that node only.
193
+
194
+
**Remediation**: drop the diverging rows from the local `auth.db`, or drop the whole local auth DB and let the FSM repopulate from the cluster's snapshot. Stop Arc on the affected node, run:
195
+
196
+
```sql
197
+
sqlite3 /app/data/arc.db"DELETE FROM api_tokens WHERE id = <N>"
198
+
```
199
+
200
+
Then restart Arc — it'll re-apply the cluster's authoritative state on the affected ID range.
201
+
202
+
Identical hash + name is treated as idempotent log replay (no-op, no error), so a normal cluster restart never surfaces this.
203
+
204
+
### `arcx`-style upgrade path
205
+
206
+
Operators of pre-v26.06.1 Enterprise clusters with many AUTOINCREMENT tokens may prefer to **drain the local auth DB** before re-joining:
207
+
208
+
1. Note down the names of any service tokens currently in active use.
4. Restart Arc with `ARC_AUTH_BOOTSTRAP_TOKEN` matching the cluster's admin token.
212
+
5. Re-issue the service tokens cluster-wide via the API on any leader-eligible node.
213
+
214
+
The cluster's FSM is the source of truth, so this drain-and-rejoin is non-destructive — the only state at risk is per-node tokens that weren't intended to be cluster-wide, and the release notes already document that they don't carry over.
215
+
216
+
### Required configuration
217
+
218
+
In addition to standard cluster mode (`cluster.enabled = true`, `cluster.raft_data_dir`, etc.), token replication requires:
219
+
220
+
```toml
221
+
[cluster]
222
+
shared_secret = "..."# min 32 chars; same value on every node
223
+
```
224
+
225
+
The shared secret authenticates leader-forward HMAC for non-leader nodes proposing token writes. Without it, follower nodes refuse to forward auth proposals and the cluster falls back to OSS-mode bootstrap on every node (you'll see 4 banners again).
If nodes have different secrets, follower-to-leader forward-apply fails HMAC validation and token writes that originate on a non-leader silently fail. There is no graceful fallback — operators must ensure the secret is identical across the cluster (e.g. via Kubernetes Secrets or environment-variable injection from a single source).
233
+
:::
234
+
108
235
## Token Management
109
236
110
237
All token management endpoints require **admin** authentication.
Copy file name to clipboardExpand all lines: docs/installation/kubernetes.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -53,6 +53,10 @@ You should see:
53
53
Copy this token immediately - you won't see it again!
54
54
:::
55
55
56
+
:::info Arc Enterprise cluster mode (v26.06.1+)
57
+
In a multi-pod Arc Enterprise cluster, **only one pod prints the banner** — the Raft leader that wins the bootstrap election. Other pods log `INFO Deferring initial token bootstrap until cluster Raft proposer is wired` during startup and then `INFO Cluster auth state replication enabled — token writes now propagate via Raft` once the leader is elected. The non-leader pods silently no-op the bootstrap (they get an "already exists" response from the leader's FSM) and converge on the leader's token via Raft. The `kubectl logs -l app=arc | grep -i "admin"` command above still works — it just returns the single banner from whichever pod won the election. See [Cluster auth replication](/docs/configuration/authentication#cluster-auth-replication-enterprise) for the full semantics, including token-propagation behaviour and the divergence-detection error log.
0 commit comments