Skip to content

Commit 01f29fe

Browse files
committed
docs: add admin dashboard operator guide (P4)
P4 deliverable from docs/design/2026_04_24_proposed_admin_dashboard.md Section 8: a single self-contained operator-facing reference for the admin HTTP listener. Covers: - Quick-start invocation for a loopback dev cluster - Required + optional flag reference, with explanations of why each guard exists (TLS hard-error, rolling-update caveats, etc.) - TLS topologies (loopback / TLS / discouraged plaintext-non-loopback) - Role model + how live role re-validation works on every state- changing request - The full /admin/api/v1/* surface (auth + cluster + dynamo + s3, including the slice 2 write paths and the AdminForward forwarding contract) - forwarded_from audit log shape and why it carries the follower's node ID - Troubleshooting guide for the common failure modes operators hit during initial bring-up (missing credentials, TLS hard-error, 401 ambiguity, 503 leader_unavailable, bucket_not_empty, blank-screen / placeholder bundle) - Cross-references to the design doc and architecture overview The Section 8 P4 plan also called out "TLS, read-only role, CSRF" as deliverables; those are already implemented (see config.go's validateTLS / validateAccessKeyRoles, the role gates in DynamoHandler.principalForWrite + S3Handler.principalForWrite, and the CSRFDoubleSubmit middleware. This doc stitches them into a single reference operators can land on without reading code. Stacked on #669 (P2 slice 2a) + #673 (P2 slice 2b) so the API- surface table can describe S3 write endpoints as shipped. Once both land in main, this rebases cleanly. EOF )
1 parent f3e9278 commit 01f29fe

1 file changed

Lines changed: 251 additions & 0 deletions

File tree

docs/admin.md

Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
# elastickv admin dashboard — operator guide
2+
3+
This document covers configuration and day-2 operation of the admin
4+
HTTP listener. Architecture and design rationale live in
5+
[docs/design/2026_04_24_proposed_admin_dashboard.md](design/2026_04_24_proposed_admin_dashboard.md);
6+
read that first if you're touching the code.
7+
8+
## What the admin dashboard is
9+
10+
A separate HTTP listener (default `127.0.0.1:8080`) that exposes a
11+
React SPA + JSON API for inspecting the cluster and managing
12+
DynamoDB tables / S3 buckets without having to construct SigV4
13+
requests. It is **disabled by default**: set `-adminEnabled` to turn
14+
it on.
15+
16+
The listener is independent of the data-plane DynamoDB
17+
(`-dynamoAddress`) and S3 (`-s3Address`) endpoints — credentials,
18+
TLS, and auth are configured separately.
19+
20+
## Quick start (loopback dev)
21+
22+
The minimum invocation that produces a working dashboard:
23+
24+
```sh
25+
./elastickv \
26+
-raftId=n1 -raftBootstrap \
27+
-dynamoAddress=127.0.0.1:8000 \
28+
-s3Address=127.0.0.1:9000 \
29+
-s3CredentialsFile=/path/to/creds.json \
30+
-adminEnabled \
31+
-adminSessionSigningKeyFile=/path/to/admin-hs256.b64 \
32+
-adminFullAccessKeys=AKIA_ADMIN
33+
```
34+
35+
Then open `http://127.0.0.1:8080/admin/` in a browser and log in
36+
with the access key + secret pair from the credentials file.
37+
38+
## Configuration reference
39+
40+
### Required when `-adminEnabled=true`
41+
42+
| Flag | Description |
43+
|---|---|
44+
| `-adminEnabled` | Master on/off switch. Default `false`. |
45+
| `-adminSessionSigningKey` *or* `-adminSessionSigningKeyFile` *or* `ELASTICKV_ADMIN_SESSION_SIGNING_KEY` | Cluster-shared base64-encoded HS256 key (≥ 32 raw bytes / 44 base64 chars). **Must be the same on every node** — JWTs minted by node A are verified by node B during follower→leader forwarding, so a mismatch breaks the dashboard's read paths on follower nodes. The `*File` / env-var forms keep the secret out of `/proc/<pid>/cmdline`. |
46+
| `-s3CredentialsFile` | JSON file with at least one access key + secret key pair. Same file the S3 adapter uses for SigV4; the admin dashboard reuses it for login authentication. |
47+
| `-adminFullAccessKeys` *and/or* `-adminReadOnlyAccessKeys` | Comma-separated allow-lists. Only access keys listed here may log into the dashboard, even if their SigV4 secret validates against the credentials file. Keys must not appear in both lists. |
48+
49+
### Optional
50+
51+
| Flag | Description |
52+
|---|---|
53+
| `-adminListen` | host:port for the admin listener. Defaults to `127.0.0.1:8080`. |
54+
| `-adminTLSCertFile` / `-adminTLSKeyFile` | PEM cert + key. Both must be set together; a partial config fails validation at startup. |
55+
| `-adminAllowPlaintextNonLoopback` | Explicit opt-out for the non-loopback-without-TLS startup hard-error. **Strongly discouraged** — enables the dashboard to mint cookies without the `Secure` attribute and ship session JWTs over plaintext. Use only for short-lived test rigs you control. |
56+
| `-adminSessionSigningKeyPrevious` *or* `-adminSessionSigningKeyPreviousFile` *or* `ELASTICKV_ADMIN_SESSION_SIGNING_KEY_PREVIOUS` | Previous HS256 key used only for verification during a rotation window. New tokens always use the primary key; existing tokens minted under the previous key continue to verify until they expire. |
57+
| `-adminAllowInsecureDevCookie` | Mints session cookies without `Secure` for local plaintext development. Do not set on any deployment that touches a network. |
58+
59+
### Hard-error startup conditions
60+
61+
The process fails to start (non-zero exit) when:
62+
63+
- `-adminEnabled=true` but `-s3CredentialsFile` is empty or missing, or its parsed map has zero entries — without credentials every login is rejected, and "locked-down admin" is `-adminEnabled=false`.
64+
- `-adminEnabled=true` but `-adminSessionSigningKey` (and the `*File` / env var) all decode to empty.
65+
- `-adminEnabled=true` but `-adminListen` is empty or not a valid host:port.
66+
- `-adminTLSCertFile` xor `-adminTLSKeyFile` is set (partial TLS config).
67+
- `-adminListen` is bound to a non-loopback address, TLS is not configured, **and** `-adminAllowPlaintextNonLoopback` is not set. The error message names the flag combinations that resolve it.
68+
- `-adminFullAccessKeys` and `-adminReadOnlyAccessKeys` overlap (the same access key listed in both).
69+
70+
These are deliberate — silent fallbacks to "auth disabled" or "TLS
71+
off" would downgrade security guarantees the operator is unaware of.
72+
73+
## TLS setup
74+
75+
Two supported topologies:
76+
77+
### A. Loopback only (`127.0.0.1` / `::1`)
78+
79+
No TLS required. The dashboard cookies still carry `Secure=false`
80+
when `-adminAllowInsecureDevCookie` is set; in normal loopback
81+
operation cookies are minted with `Secure` regardless and rely on
82+
the browser's loopback-is-trusted policy.
83+
84+
### B. Reachable address with TLS
85+
86+
Set `-adminListen` to the public bind, plus `-adminTLSCertFile` and
87+
`-adminTLSKeyFile`. TLS 1.2+ is enforced. Cookies are issued with
88+
`Secure; SameSite=Strict; HttpOnly`.
89+
90+
Cert renewal: the listener picks up the cert files at startup only;
91+
restart the process after rotating certs. Hot-reload is not
92+
implemented (out of scope for the dashboard's maintenance model).
93+
94+
### Discouraged: plaintext non-loopback
95+
96+
`-adminAllowPlaintextNonLoopback` exists as an escape hatch for
97+
short-lived test deployments. The session JWT and its bearer cookie
98+
travel in clear text in this mode; anyone on the path can replay
99+
the token until it expires. Do not enable on a long-running
100+
deployment.
101+
102+
## Roles
103+
104+
Two roles, both checked against the live `-adminFullAccessKeys` /
105+
`-adminReadOnlyAccessKeys` lists on **every** state-changing
106+
request (not just at login):
107+
108+
- **read-only** — may list / describe Dynamo tables and S3 buckets, view cluster status. Cannot create, mutate ACL, or delete.
109+
- **full** — adds POST / PUT / DELETE on `/dynamo/tables` and `/s3/buckets`.
110+
111+
A key revoked from `-adminFullAccessKeys` immediately loses
112+
write access on the next request — the dashboard does not wait for
113+
the token to expire. The token's role claim is treated as a hint;
114+
the live role index is authoritative.
115+
116+
## API surface
117+
118+
All endpoints are under `/admin/api/v1/`. Authentication: cookie
119+
session minted by `POST /auth/login`; CSRF: double-submit token in
120+
`admin_csrf` cookie + `X-Admin-CSRF` header on every state-changing
121+
method.
122+
123+
| Method | Path | Role | Notes |
124+
|---|---|---|---|
125+
| `POST` | `/auth/login` | none | Body `{access_key, secret_key}`. Sets `admin_session` and `admin_csrf` cookies. |
126+
| `POST` | `/auth/logout` | any | Invalidates the session cookie. |
127+
| `GET` | `/cluster` | any | Node ID, Raft leader, version. |
128+
| `GET` | `/dynamo/tables` | any | Paginated list. `?limit=` (default 100, max 1000). |
129+
| `POST` | `/dynamo/tables` | full | Body schema in design 4.2. |
130+
| `GET` | `/dynamo/tables/{name}` | any | Schema + GSI summary. |
131+
| `DELETE` | `/dynamo/tables/{name}` | full | 204 on success. |
132+
| `GET` | `/s3/buckets` | any | Paginated list with the same `?limit=` semantics. |
133+
| `POST` | `/s3/buckets` | full | Body `{bucket_name, acl?}`. ACL omitted defaults to `private`. |
134+
| `GET` | `/s3/buckets/{name}` | any | Bucket meta + ACL. |
135+
| `PUT` | `/s3/buckets/{name}/acl` | full | Body `{acl}`. Only `private` and `public-read` are accepted. |
136+
| `DELETE` | `/s3/buckets/{name}` | full | 204 on success. The bucket must be empty (no objects); a non-empty bucket returns 409 `bucket_not_empty`. |
137+
138+
## Follower → leader forwarding
139+
140+
Writes (`POST` / `PUT` / `DELETE`) require the local node to be the
141+
Raft leader. When the SPA's request hits a follower, the dashboard
142+
transparently forwards the call to the leader over an internal
143+
gRPC service (`AdminForward`). The leader re-validates the
144+
principal against its own `adminFullAccessKeys` list before
145+
acting — a follower cannot smuggle a downgraded key past the
146+
leader's view.
147+
148+
This means there is **no need to point the SPA at a specific
149+
node**: any node with `-adminEnabled` can serve the dashboard.
150+
Operators that fan out behind a load balancer get the same
151+
behaviour as a single-node cluster, with one caveat below.
152+
153+
### Follower forwarding caveat: rolling configuration changes
154+
155+
A configuration change (e.g. adding `AKIA_NEW` to
156+
`-adminFullAccessKeys`) must propagate to **every node** before
157+
the new key works against any follower's dashboard. During the
158+
rollout window:
159+
160+
- A login against a node that has not yet been restarted with the new flags fails with 403.
161+
- A token minted by an updated node, replayed against a not-yet-updated node, will be re-validated against that node's stale role list. If the key is missing on the older node, the request fails with 403 even though the token is structurally valid.
162+
163+
The dashboard does not have an automatic role-refresh path — restart
164+
each node after editing the access-key flags.
165+
166+
### Election-period 503
167+
168+
When the leader steps down mid-write (or has not yet been elected
169+
after a fresh start), the forwarder cannot reach a leader and the
170+
SPA receives `503 Service Unavailable` with a `Retry-After: 1`
171+
header. The SPA's API client honours `Retry-After` and re-issues
172+
the request once. Operators investigating "intermittent 503s"
173+
should look at Raft leader-churn logs first.
174+
175+
## Audit log
176+
177+
Every state-changing admin request emits a structured slog line at
178+
`INFO` level on the leader's stdout (or wherever the process slog
179+
handler is wired):
180+
181+
```
182+
admin_audit actor=AKIA_ADMIN role=full method=POST path=/admin/api/v1/dynamo/tables status=201 duration=8.2ms
183+
```
184+
185+
For forwarded requests, an extra `forwarded_from=<node-id>` field
186+
identifies the follower that received the original HTTP call. CR
187+
and LF in the field are stripped at the entry point — a hostile
188+
follower cannot split a single audit line into two by smuggling
189+
control characters into its node ID.
190+
191+
Login and logout emit their own audit lines (`action=login` /
192+
`action=logout`) so the JWT's lifetime can be correlated with the
193+
mutations it authorised.
194+
195+
## Troubleshooting
196+
197+
### "admin listener is enabled but no static credentials are configured"
198+
199+
Either `-s3CredentialsFile` is unset or the file parses to an empty
200+
map. Check the file exists and contains at least one entry:
201+
```json
202+
{"credentials":[{"access_key_id":"AKIA_ADMIN","secret_access_key":"..."}]}
203+
```
204+
205+
### "is not loopback but TLS is not configured"
206+
207+
Default-deny safety net. Either set `-adminTLSCertFile` +
208+
`-adminTLSKeyFile`, or pass `-adminAllowPlaintextNonLoopback` (and
209+
read the TLS section above before doing so).
210+
211+
### Login returns 401 invalid_credentials
212+
213+
The access key + secret pair did not match the credentials file, or
214+
the key is not listed in `-adminFullAccessKeys` /
215+
`-adminReadOnlyAccessKeys`. The dashboard does not distinguish the
216+
two cases on the wire — both produce 401 — but the leader's audit
217+
log shows the precise reason.
218+
219+
### Write returns 403 forbidden
220+
221+
The principal's role is read-only. Move the access key into
222+
`-adminFullAccessKeys` (and remove it from
223+
`-adminReadOnlyAccessKeys`), then **restart every node** so each
224+
node's live role index picks up the change.
225+
226+
### Write returns 503 leader_unavailable
227+
228+
The Raft cluster is mid-election. Re-issue the request after the
229+
`Retry-After: 1` header tells you to. If it persists past one or
230+
two seconds, check Raft leader status via the data-plane
231+
`/admin/api/v1/cluster` endpoint or `cmd/elastickv-admin`.
232+
233+
### `bucket_not_empty` on DELETE
234+
235+
The dashboard cannot force a recursive delete by design — the
236+
SPA's job is to surface the error and guide the operator to clean
237+
up first. Use the SigV4 S3 path (`aws s3 rm s3://<bucket> --recursive`)
238+
to drain the bucket, then retry the DELETE on the dashboard.
239+
240+
### Stuck SPA / blank screen
241+
242+
The dashboard ships a placeholder `internal/admin/dist/index.html`
243+
that renders a "bundle missing" page when `make` was run without
244+
the SPA build step. Run `cd web/admin && npm install && npm run build`
245+
to populate the embedded `dist` directory, then rebuild the binary.
246+
247+
## Cross-references
248+
249+
- Design rationale: [docs/design/2026_04_24_partial_admin_dashboard.md](design/2026_04_24_partial_admin_dashboard.md)
250+
- Architecture overview: [docs/architecture_overview.md](architecture_overview.md)
251+
- AdminForward RPC contract: `proto/admin_forward.proto`

0 commit comments

Comments
 (0)