Skip to content

Commit 07bdc20

Browse files
authored
docs: add admin dashboard operator guide (P4) (#674)
P4 deliverable from `docs/design/2026_04_24_proposed_admin_dashboard.md` Section 8: a single self-contained operator-facing reference for the admin HTTP listener. **Stacked on #669 (slice 2a) + #673 (slice 2b)** so the API-surface table can describe S3 write endpoints as shipped. Rebases cleanly onto main once those land. ## Sections - **Quick start** — minimal loopback-dev invocation - **Configuration reference** — required vs optional flags, with the rationale for each guard - **Hard-error startup conditions** — the explicit cases where the process refuses to start (missing creds, partial TLS, non-loopback without TLS, role-list overlap) - **TLS setup** — loopback / TLS / discouraged plaintext-non-loopback topologies - **Roles** — read-only vs full + how live role re-validation works on every state-changing request - **API surface** — full `/admin/api/v1/*` table including the slice 2 S3 write endpoints - **Follower → leader forwarding** — what the SPA sees, the rolling-update caveat, and the election-period 503 + Retry-After contract - **Audit log** — `admin_audit` slog shape and the `forwarded_from` field - **Troubleshooting** — the common bring-up failures (missing creds, TLS hard-error, 401 ambiguity, 503 leader_unavailable, bucket_not_empty, placeholder bundle blank-screen) ## What is NOT in this PR Section 8's P4 plan also called out "TLS, read-only role, CSRF" as deliverables; those are already implemented in main (see `config.go`'s `validateTLS` / `validateAccessKeyRoles`, the role gates in `DynamoHandler.principalForWrite` + `S3Handler.principalForWrite`, and the `CSRFDoubleSubmit` middleware). This doc stitches them into a single reference operators can land on without reading code. ## Test plan - [x] Markdown renders cleanly (manual check) - [x] Cross-references match real file paths (design doc, architecture overview, proto file) - [x] Flag names match `main.go`'s flag definitions verbatim - [x] The `bucket_not_empty` 409 response shape matches `S3Handler.writeBucketsError`
2 parents 71a8514 + 85df320 commit 07bdc20

1 file changed

Lines changed: 340 additions & 0 deletions

File tree

docs/admin.md

Lines changed: 340 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,340 @@
1+
# elastickv admin dashboard — operator guide
2+
3+
This document covers configuration and day-2 operation of the admin
4+
HTTP listener. Architecture and design rationale live in
5+
[docs/design/2026_04_24_proposed_admin_dashboard.md](design/2026_04_24_proposed_admin_dashboard.md);
6+
read that first if you're touching the code.
7+
8+
## What the admin dashboard is
9+
10+
A separate HTTP listener (default `127.0.0.1:8080`) that exposes a
11+
React SPA + JSON API for inspecting the cluster and managing
12+
DynamoDB tables / S3 buckets without having to construct SigV4
13+
requests. It is **disabled by default**: set `-adminEnabled` to turn
14+
it on.
15+
16+
The listener is independent of the data-plane DynamoDB
17+
(`-dynamoAddress`) and S3 (`-s3Address`) endpoints — credentials,
18+
TLS, and auth are configured separately.
19+
20+
## Quick start (loopback dev)
21+
22+
The minimum invocation that produces a working dashboard:
23+
24+
```sh
25+
./elastickv \
26+
-raftId=n1 -raftBootstrap \
27+
-dynamoAddress=127.0.0.1:8000 \
28+
-s3Address=127.0.0.1:9000 \
29+
-s3CredentialsFile=/path/to/creds.json \
30+
-adminEnabled \
31+
-adminSessionSigningKeyFile=/path/to/admin-hs256.b64 \
32+
-adminFullAccessKeys=AKIA_ADMIN
33+
```
34+
35+
Then open `http://127.0.0.1:8080/admin/` in a browser and log in
36+
with the access key + secret pair from the credentials file.
37+
38+
## Configuration reference
39+
40+
### Required when `-adminEnabled=true`
41+
42+
| Flag | Description |
43+
|---|---|
44+
| `-adminEnabled` | Master on/off switch. Default `false`. |
45+
| `-adminSessionSigningKey` *or* `-adminSessionSigningKeyFile` *or* `ELASTICKV_ADMIN_SESSION_SIGNING_KEY` | Cluster-shared base64-encoded HS256 key — **exactly 64 raw bytes** (88 base64 chars with standard padding, or 86 with `RawURLEncoding`). The validator rejects any other length at startup with a precise error message. **Must be the same on every node** — JWTs minted by node A are verified by node B during follower→leader forwarding, so a mismatch breaks the dashboard's read paths on follower nodes. The `*File` / env-var forms keep the secret out of `/proc/<pid>/cmdline`. |
46+
| `-s3CredentialsFile` | JSON file with at least one access key + secret key pair. Same file the S3 adapter uses for SigV4; the admin dashboard reuses it for login authentication. |
47+
| `-adminFullAccessKeys` *and/or* `-adminReadOnlyAccessKeys` | Comma-separated allow-lists. Only access keys listed here may log into the dashboard, even if their SigV4 secret validates against the credentials file. Keys must not appear in both lists. |
48+
49+
### Optional
50+
51+
| Flag | Description |
52+
|---|---|
53+
| `-adminListen` | host:port for the admin listener. Defaults to `127.0.0.1:8080`. |
54+
| `-adminTLSCertFile` / `-adminTLSKeyFile` | PEM cert + key. Both must be set together; a partial config fails validation at startup. |
55+
| `-adminAllowPlaintextNonLoopback` | Explicit opt-out for the non-loopback-without-TLS startup hard-error. **Strongly discouraged** — lets the listener accept plaintext on a non-loopback bind. **Does not** affect the cookie `Secure` attribute (that is `-adminAllowInsecureDevCookie` below); a deployment that sets only this flag will mint `Secure` cookies that the browser refuses to send over the plaintext channel, breaking session lifetime end-to-end. Pair it with `-adminAllowInsecureDevCookie` if the goal is a working plaintext rig. |
56+
| `-adminSessionSigningKeyPrevious` *or* `-adminSessionSigningKeyPreviousFile` *or* `ELASTICKV_ADMIN_SESSION_SIGNING_KEY_PREVIOUS` | Previous HS256 key used only for verification during a rotation window. New tokens always use the primary key; existing tokens minted under the previous key continue to verify until they expire. |
57+
| `-adminAllowInsecureDevCookie` | Mints session cookies without `Secure` for local plaintext development. Do not set on any deployment that touches a network. |
58+
59+
### Hard-error startup conditions
60+
61+
The process fails to start (non-zero exit) when:
62+
63+
- `-adminEnabled=true` but `-s3CredentialsFile` is empty or missing, or its parsed map has zero entries — without credentials every login is rejected, and "locked-down admin" is `-adminEnabled=false`.
64+
- `-adminEnabled=true` but `-adminSessionSigningKey` (and the `*File` / env var) all decode to empty.
65+
- `-adminEnabled=true` but `-adminListen` is empty or not a valid host:port.
66+
- `-adminTLSCertFile` xor `-adminTLSKeyFile` is set (partial TLS config).
67+
- `-adminListen` is bound to a non-loopback address, TLS is not configured, **and** `-adminAllowPlaintextNonLoopback` is not set. The error message names the flag combinations that resolve it.
68+
- `-adminFullAccessKeys` and `-adminReadOnlyAccessKeys` overlap (the same access key listed in both).
69+
70+
These are deliberate — silent fallbacks to "auth disabled" or "TLS
71+
off" would downgrade security guarantees the operator is unaware of.
72+
73+
## TLS setup
74+
75+
Two supported topologies:
76+
77+
### A. Loopback only (`127.0.0.1` / `::1`)
78+
79+
No TLS required. By default the dashboard mints cookies with
80+
`Secure=true`, which most modern browsers accept on the loopback
81+
origin even without TLS (the loopback-is-trusted policy). If a
82+
specific browser refuses the cookie in this configuration, set
83+
`-adminAllowInsecureDevCookie` to mint without `Secure` — the flag
84+
is intentionally distinct from `-adminAllowPlaintextNonLoopback`
85+
because the listener can be plaintext for entirely separate
86+
reasons (loopback) than the cookie needing to drop `Secure`.
87+
88+
### B. Reachable address with TLS
89+
90+
Set `-adminListen` to the public bind, plus `-adminTLSCertFile` and
91+
`-adminTLSKeyFile`. TLS 1.2+ is enforced. Cookies are issued with
92+
`Secure; SameSite=Strict; HttpOnly`.
93+
94+
Cert renewal: the listener picks up the cert files at startup only;
95+
restart the process after rotating certs. Hot-reload is not
96+
implemented (out of scope for the dashboard's maintenance model).
97+
98+
### Discouraged: plaintext non-loopback
99+
100+
`-adminAllowPlaintextNonLoopback` exists as an escape hatch for
101+
short-lived test deployments. The session JWT and its bearer cookie
102+
travel in clear text in this mode; anyone on the path can replay
103+
the token until it expires. Do not enable on a long-running
104+
deployment.
105+
106+
A working plaintext rig also needs `-adminAllowInsecureDevCookie`
107+
otherwise the dashboard mints cookies with `Secure=true` and the
108+
browser refuses to send them back over plaintext, so login appears
109+
to succeed but every subsequent request 401s. The two flags are
110+
deliberately separate so a misconfigured deployment fails closed
111+
on either axis (TLS guard or cookie attribute) rather than
112+
silently downgrading both at once.
113+
114+
## Roles
115+
116+
Two roles, both checked against the live `-adminFullAccessKeys` /
117+
`-adminReadOnlyAccessKeys` lists on **every** state-changing
118+
request (not just at login):
119+
120+
- **read-only** — may list / describe Dynamo tables and S3 buckets, view cluster status. Cannot create, mutate ACL, or delete.
121+
- **full** — adds POST / PUT / DELETE on `/dynamo/tables` and `/s3/buckets`.
122+
123+
A key revoked from `-adminFullAccessKeys` immediately loses
124+
write access on the next request — the dashboard does not wait for
125+
the token to expire. The token's role claim is treated as a hint;
126+
the live role index is authoritative.
127+
128+
## API surface
129+
130+
All endpoints are under `/admin/api/v1/`. Authentication: cookie
131+
session minted by `POST /auth/login`; CSRF: double-submit token in
132+
`admin_csrf` cookie + `X-Admin-CSRF` header on every state-changing
133+
method.
134+
135+
| Method | Path | Role | Notes |
136+
|---|---|---|---|
137+
| `POST` | `/auth/login` | none | Body `{access_key, secret_key}`. Sets `admin_session` and `admin_csrf` cookies. |
138+
| `POST` | `/auth/logout` | any | Invalidates the session cookie. |
139+
| `GET` | `/cluster` | any | Node ID, Raft leader, version. |
140+
| `GET` | `/dynamo/tables` | any | Paginated list. `?limit=` (default 100, max 1000). |
141+
| `POST` | `/dynamo/tables` | full | Body schema in design 4.2. |
142+
| `GET` | `/dynamo/tables/{name}` | any | Schema + GSI summary. |
143+
| `DELETE` | `/dynamo/tables/{name}` | full | 204 on success. |
144+
| `GET` | `/s3/buckets` | any | Paginated list with the same `?limit=` semantics. |
145+
| `POST` | `/s3/buckets` | full | Body `{bucket_name, acl?}`. ACL omitted defaults to `private`. |
146+
| `GET` | `/s3/buckets/{name}` | any | Bucket meta + ACL. |
147+
| `PUT` | `/s3/buckets/{name}/acl` | full | Body `{acl}`. Only `private` and `public-read` are accepted. |
148+
| `DELETE` | `/s3/buckets/{name}` | full | 204 on success. The bucket must be empty (no objects); a non-empty bucket returns 409 `bucket_not_empty`. |
149+
150+
## Follower → leader forwarding
151+
152+
Writes (`POST` / `PUT` / `DELETE`) require the local node to be the
153+
Raft leader. When the SPA's request hits a follower, the dashboard
154+
transparently forwards the call to the leader over an internal
155+
gRPC service (`AdminForward`). The leader re-validates the
156+
principal against its own `adminFullAccessKeys` list before
157+
acting — a follower cannot smuggle a downgraded key past the
158+
leader's view.
159+
160+
This means there is **no need to point the SPA at a specific
161+
node**: any node with `-adminEnabled` can serve the dashboard.
162+
Operators that fan out behind a load balancer get the same
163+
behaviour as a single-node cluster, with one caveat below.
164+
165+
### Follower forwarding caveat: rolling configuration changes
166+
167+
A configuration change (e.g. adding `AKIA_NEW` to
168+
`-adminFullAccessKeys`) must propagate to **every node** before
169+
the new key works against any follower's dashboard. During the
170+
rollout window:
171+
172+
- A login against a node that has not yet been restarted with the new flags fails with 403.
173+
- A token minted by an updated node, replayed against a not-yet-updated node, will be re-validated against that node's stale role list. If the key is missing on the older node, the request fails with 403 even though the token is structurally valid.
174+
175+
The dashboard does not have an automatic role-refresh path — restart
176+
each node after editing the access-key flags.
177+
178+
### Election-period 503
179+
180+
When the leader steps down mid-write (or has not yet been elected
181+
after a fresh start), the forwarder cannot reach a leader and the
182+
SPA receives `503 Service Unavailable` with a `Retry-After: 1`
183+
header. The current SPA client (`web/admin/src/api/client.ts`)
184+
makes a single `fetch` call with no automatic retry, so the user
185+
sees the 503 surfaced directly and must re-issue the action. The
186+
`Retry-After: 1` header is still emitted so a future client (or an
187+
external operator script driving the JSON API) can implement the
188+
one-second back-off the server is asking for. Operators
189+
investigating "intermittent 503s" should look at Raft leader-churn
190+
logs first.
191+
192+
## Audit log
193+
194+
Every state-changing admin request emits structured slog lines at
195+
`INFO` level under the `admin_audit` key on the leader's stdout (or
196+
wherever the process slog handler is wired). A protected-chain
197+
mutation (Dynamo / S3 / cluster / keyviz writes) typically produces
198+
**two** audit lines: one operation-specific line from the source
199+
that performed the mutation, plus one generic HTTP-shaped line from
200+
the `Audit` middleware. Auth endpoints (`/auth/login`, `/auth/logout`)
201+
produce **one** line — the action-specific one from `AuthService`
202+
because the generic middleware is intentionally not wrapped around
203+
them (see the per-shape section below for why). The shapes differ
204+
by source — log parsers should treat the `admin_audit` key as a
205+
union and dispatch on the fields present.
206+
207+
**`Audit` middleware** — emitted for non-GET/HEAD/OPTIONS requests
208+
on the **protected mux chain** (Dynamo, S3, cluster, keyviz) after
209+
`SessionAuth` accepts the session, but **before** `CSRFDoubleSubmit`
210+
runs. That ordering is deliberate: a CSRF-rejected protected
211+
request still produces an audit line because the actor is already
212+
known, but an unauthenticated request (no / invalid session) is
213+
rejected at `SessionAuth` and never reaches the middleware. The
214+
following endpoints are **not** wrapped by this middleware and rely
215+
on their own `admin_audit` emission instead:
216+
217+
- `/auth/login` — runs without a pre-existing session, so the
218+
generic middleware cannot identify the actor; `AuthService`
219+
emits `admin_audit action=login` (success and failure) directly.
220+
- `/auth/logout` — runs through `protectNoAudit` so logout produces
221+
exactly one `admin_audit action=logout` line from `AuthService`
222+
rather than two (a generic line plus the action-specific one).
223+
224+
For requests that *do* reach the middleware, the line is always
225+
present on the node that received the HTTP request — which may be
226+
a follower if the request was then forwarded:
227+
228+
```
229+
admin_audit actor=AKIA_ADMIN role=full method=POST path=/admin/api/v1/buckets status=201 remote=10.0.0.7:51234 duration=8.2ms
230+
```
231+
232+
**`S3Handler` operation line** — emitted on the leader after a
233+
successful bucket mutation. Only the S3 admin path emits these; the
234+
DynamoDB admin path relies on the middleware line plus the forwarded
235+
line below for its audit trail:
236+
237+
```
238+
admin_audit actor=AKIA_ADMIN role=full operation=create_bucket bucket=my-bucket
239+
admin_audit actor=AKIA_ADMIN role=full operation=put_bucket_acl bucket=my-bucket acl=public-read
240+
admin_audit actor=AKIA_ADMIN role=full operation=delete_bucket bucket=my-bucket
241+
```
242+
243+
**`ForwardServer` operation line** — emitted on the leader when a
244+
follower forwarded the request via `AdminForward`. Carries the
245+
originating follower's node ID in `forwarded_from`. Covers both
246+
DynamoDB and S3 admin operations:
247+
248+
```
249+
admin_audit actor=AKIA_ADMIN role=full forwarded_from=n2 operation=create_table table=orders
250+
admin_audit actor=AKIA_ADMIN role=full forwarded_from=n2 operation=delete_table table=orders
251+
admin_audit actor=AKIA_ADMIN role=full forwarded_from=n2 operation=put_bucket_acl bucket=my-bucket acl=public-read
252+
```
253+
254+
CR and LF in `forwarded_from` are stripped at the entry point — a
255+
hostile follower cannot split a single audit line into two by
256+
smuggling control characters into its node ID.
257+
258+
Login and logout emit their own `admin_audit` lines so the JWT's
259+
lifetime can be correlated with the mutations it authorised. The
260+
two shapes differ on a single field — login carries `claimed_actor`
261+
because the access key the operator typed is distinct from the
262+
authenticated `actor` (a successful login proves they match; a
263+
failed login records what was claimed), while logout has no claim
264+
to verify and omits the field:
265+
266+
```
267+
admin_audit action=login actor=AKIA_ADMIN claimed_actor=AKIA_ADMIN remote=10.0.0.7:51234 status=200
268+
admin_audit action=logout actor=AKIA_ADMIN remote=10.0.0.7:51234 status=200
269+
```
270+
271+
Log parsers consuming this shape should treat `claimed_actor` as
272+
present-only-on-login.
273+
274+
## Troubleshooting
275+
276+
### "admin listener is enabled but no static credentials are configured"
277+
278+
Either `-s3CredentialsFile` is unset or the file parses to an empty
279+
map. Check the file exists and contains at least one entry:
280+
```json
281+
{"credentials":[{"access_key_id":"AKIA_ADMIN","secret_access_key":"..."}]}
282+
```
283+
284+
### "is not loopback but TLS is not configured"
285+
286+
Default-deny safety net. Either set `-adminTLSCertFile` +
287+
`-adminTLSKeyFile`, or pass `-adminAllowPlaintextNonLoopback` (and
288+
read the TLS section above before doing so).
289+
290+
### Login returns 401 invalid_credentials
291+
292+
The access key + secret pair did not match an entry in
293+
`-s3CredentialsFile`. Either the access key is unknown or the secret
294+
is wrong. Verify the credentials file is the one the running process
295+
loaded (it is read once at startup) and that the secret matches
296+
exactly — secrets are compared with `subtle.ConstantTimeCompare`, so
297+
trailing whitespace counts.
298+
299+
### Login returns 403 forbidden
300+
301+
The credentials matched, but the access key is not listed in either
302+
`-adminFullAccessKeys` or `-adminReadOnlyAccessKeys`. This is a
303+
distinct case from the 401 above: the operator has valid SigV4
304+
credentials for the data plane but no admin role assignment. Add the
305+
key to one of the role flags and **restart every node** so each
306+
node's live role index picks up the change.
307+
308+
### Write returns 403 forbidden
309+
310+
The principal's role is read-only. Move the access key into
311+
`-adminFullAccessKeys` (and remove it from
312+
`-adminReadOnlyAccessKeys`), then **restart every node** so each
313+
node's live role index picks up the change.
314+
315+
### Write returns 503 leader_unavailable
316+
317+
The Raft cluster is mid-election. Re-issue the request after the
318+
`Retry-After: 1` header tells you to. If it persists past one or
319+
two seconds, check Raft leader status via the admin
320+
`/admin/api/v1/cluster` endpoint or `cmd/elastickv-admin`.
321+
322+
### `bucket_not_empty` on DELETE
323+
324+
The dashboard cannot force a recursive delete by design — the
325+
SPA's job is to surface the error and guide the operator to clean
326+
up first. Use the SigV4 S3 path (`aws s3 rm s3://<bucket> --recursive`)
327+
to drain the bucket, then retry the DELETE on the dashboard.
328+
329+
### Stuck SPA / blank screen
330+
331+
The dashboard ships a placeholder `internal/admin/dist/index.html`
332+
that renders a "bundle missing" page when `make` was run without
333+
the SPA build step. Run `cd web/admin && npm install && npm run build`
334+
to populate the embedded `dist` directory, then rebuild the binary.
335+
336+
## Cross-references
337+
338+
- Design rationale: [docs/design/2026_04_24_proposed_admin_dashboard.md](design/2026_04_24_proposed_admin_dashboard.md) (renamed to `_partial_` in PR #675; this link will follow once that lands)
339+
- Architecture overview: [docs/architecture_overview.md](architecture_overview.md)
340+
- AdminForward RPC contract: `proto/admin_forward.proto`

0 commit comments

Comments
 (0)