feat(sqs): Phase 3 — Approximate counters + admin SPA queues pages#659
feat(sqs): Phase 3 — Approximate counters + admin SPA queues pages#659
Conversation
Phase P3 of docs/design/2026_04_24_proposed_admin_dashboard.md. The Go side (auth, JWT, cluster handler, dynamo handler, follower-> leader forwarding, embed.go skeleton) shipped in earlier PRs but StaticFS was wired as nil, leaving /admin/* answering 404 for any non-API path. This PR adds the React bundle + flips StaticFS on. Source layout (web/admin/): - React 18 + TypeScript + Vite per design doc Section 5.1 - Tailwind CSS for tokens; no UI framework dependency (kept the bundle at ~60 KB gzip per Section 7.2's ~150 KB target) - vite.config.ts writes the build output straight into internal/admin/dist so `npm run build && go build` produces a single binary with the latest SPA Pages (Section 5.2): - /login - SigV4 access-key + secret-key form, server-side JWT Cookie issued, status-specific error messages for 401/403/429 - / (dashboard) - cluster overview, raft groups, summary cards - /dynamo - table list + Create dialog (full role only) - /dynamo/:name - schema + GSIs + Delete confirmation modal - /s3 - bucket list + Create dialog (endpoint pending, renders graceful "endpoint pending" notice on 404) - /s3/:name - bucket meta + ACL toggle + Delete (same) Auth (Section 6): - Cookie-based session (admin_session, HttpOnly) handled by browser - CSRF double-submit: admin_csrf cookie value echoed in X-Admin-CSRF header on every POST/PUT/DELETE (apiFetch in src/api/client.ts) - Read-only sessions hide Create/Delete affordances and surface a RequireFullAccess notice; backend re-evaluates the role - Session expiry tracked in sessionStorage; expires_at trip clears in-memory state and redirects to /login Backend wiring: - main_admin.go: load admin.StaticFS() and pass to ServerDeps so /admin/assets/* and the SPA fallback start serving real files - internal/admin/embed.go: //go:embed all:dist on the build output - .gitignore: keep the placeholder index.html committed (so go:embed always has at least one file in a fresh clone), exclude hashed Vite assets and the npm/vite caches - committed bundles invariably drift from source Verification: - npm run build: 192 KB JS / 60 KB gzip + 14 KB CSS - npm run lint (tsc strict, noUnusedLocals, noUncheckedSideEffects) - go build ./... - go test -race ./internal/admin/... - golangci-lint run ./internal/admin/... S3 admin endpoints (CreateBucket / ListBuckets / PutBucketAcl / DeleteBucket) are not yet wired on the Go side; the SPA pages render a soft "endpoint pending" notice on the 404 they currently get and will populate transparently once the handlers ship.
Two converging review findings (Claude bot + Gemini Code Assist): 1. AbortSignal not plumbed through listTables / describeTable / listBuckets / describeBucket. useApiQuery created an AbortController and passed signal to the loader, but those four methods discarded it - so requests ran to completion after a component unmounted, wasting network and server connections. Now every read method mirrors cluster's signature (`signal?: AbortSignal`) and every page-level useApiQuery callsite passes it through. 2. The safe() wrapper in Dashboard.tsx was a no-op that received `signal` and never passed it on. The "404 doesn't cancel render" suppression it claimed to provide is already handled by SummaryCard testing `error?.status === 404 && pendingMessage`. Removed. Drive-by polish from the same reviews: - useApiQuery's effect deps no longer include `markUnauthorized`. Read it through markUnauthorizedRef so an AuthProvider remount (React Fast Refresh in dev, or any future wrapping change) does not invalidate every active query and trigger a fresh round-trip on every page. Mirrors the existing loaderRef pattern. - Login.tsx now trims `secretKey` in addition to `accessKey`. SigV4 secrets never contain whitespace, so trimming is safe and saves an operator from a cryptic 401 after pasting a key with a trailing newline. Modal accessibility (no focus trap, missing aria-modal / role=dialog) was flagged as low-priority polish for an internal operator tool. Tracked as a follow-up; not addressed here to keep this change focused on the functional bugs. Verified with `npm run build` and `npm run lint` (tsc --strict --noUnusedLocals --noUncheckedSideEffectImports), and `go build ./...`.
Two follow-ups from the review of #649: 1. useApi.ts: removed the `if (ctrl.signal.aborted) return;` check inside the catch handler. The cleanup function sets `cancelled = true` *before* calling `ctrl.abort()`, so any abort path is already covered by the preceding `cancelled` check. The second guard was harmless belt-and-suspenders dead code; the comment now records why one check is enough. 2. Modal.tsx: added the WAI-ARIA dialog contract (`role="dialog"`, `aria-modal="true"`, `aria-labelledby` pointing at the title) and a focus trap: - On open: store the previously-focused element, then focus the first focusable inside the dialog so keyboard users do not stay stranded on the page underneath. Empty dialogs fall back to focusing the dialog container itself with tabindex=-1. - Tab / Shift+Tab cycle within the dialog instead of escaping to the page underneath. - On close: restore focus to whoever opened the dialog (skipped when that element has been unmounted, e.g. the dialog deleted the row whose button opened it). - Escape and backdrop click stay wired through the existing `busy` gate so a half-applied operation cannot be dismissed. Implementation is vanilla DOM querying — focus-trap-react / react-focus-lock would have added 5–7 KB gzip for what is a ~40 line vanilla solution. Bundle delta from this PR is +0.4 KB gzip (60.23 -> 60.59 KB). Verified with `npm run build` and `npm run lint` (tsc --strict --noUnusedLocals --noUncheckedSideEffectImports), and `go build ./...`. Stacked on top of #649 (feat/admin-dashboard-spa); merge that one first, then this rebases cleanly onto main.
Phase 3.A and 3.E from the SQS design doc, lifecycle-bumped from
proposed_ to partial_. Phase 3.B / 3.C / 3.D are recorded as TODOs
in the same doc with the rough scope each needs (XML query protocol,
per-queue throttling, split-queue FIFO).
3.A — ApproximateNumberOfMessages{,NotVisible,Delayed} on
GetQueueAttributes. The handler scans the visibility index at the
read TS and classifies each entry by (visible_at, available_at) vs
now. Visible (available and vis<=now), NotVisible (in flight after
receive bumped vis), Delayed (available_at>now, never received).
Bounded at 5000 vis-index entries per call so a runaway queue
cannot turn GetQueueAttributes into an O(stream) scan; the
"truncated" signal is reported back so callers can log.
The counters fire only when the caller asks for any of the three
or for "All" — a config-only GetQueueAttributes call still pays
zero scan cost. Two regression tests pin both behaviors:
- TestSQSServer_GetQueueAttributesApproxCounters
- TestSQSServer_GetQueueAttributesApproxCountersOnlyWhenSelected
3.E — Admin SPA queues pages. The original Section 13 console UI
(separate listener, bearer token, dedicated bundle) is superseded
by extending the existing admin dashboard SPA. Reuses the JWT
cookie session, CSRF double-submit, role model, and embed.FS that
the dashboard already enforces — no second listener, no second
auth surface.
Backend (adapter/sqs_admin.go + internal/admin/sqs_handler.go):
- SQSServer.AdminListQueues / AdminDescribeQueue / AdminDeleteQueue
are SigV4-bypass entrypoints, mirroring the DynamoDB pattern.
AdminDescribeQueue returns the queue meta plus the typed
counters from §3.A (no string round-trip).
- sqsQueuesBridge in main_admin.go re-shapes the adapter DTO
into the admin package's QueueSummary, keeping internal/admin
free of the heavy adapter dependency tree.
- admin.QueuesSource interface is opt-in; deployments that don't
run --sqsAddress leave /admin/api/v1/sqs/* off the wire and
the SPA renders a soft "endpoint pending" notice on the 404.
- Role re-evaluation against the live RoleStore on DELETE so a
downgraded key cannot keep mutating with a still-valid JWT.
- buildAPIMux dispatch loop refactored into apiRoutes.dispatch +
hasResourcePrefix helper to keep cyclop <= 10 as new resource
prefixes get added (3.E here, future endpoints later).
Frontend (web/admin/src/pages/Sqs{List,Detail}.tsx):
- /sqs queue list with refresh + per-row link to detail.
- /sqs/:name detail showing FIFO badge, counters card (visible /
in-flight / delayed) with a "truncated" pill when the scan
budget capped, raw attributes table, and a Delete confirmation
Modal gated by RequireFullAccess.
- api/client.ts gains listQueues / describeQueue / deleteQueue
with the same AbortSignal pattern used for cluster / dynamo
reads (Modal a11y / abort handling carry over from PR #650).
Out of scope (recorded in the design doc):
- PurgeQueue from the SPA — the underlying adapter PurgeQueue
method ships in PR #638 (still open). Trivial to add as a
follow-up once that lands.
- Send / Peek from the SPA — Send needs an SigV4-bypass
entrypoint; Peek is genuinely new functionality (SQS has no
peek primitive) and merits its own design discussion.
- CreateQueue from the SPA — modest backend addition deferred to
a separate PR to keep this one focused.
Verified with `go build ./...`, `go test -race ./internal/admin/...`,
`go test -race -run TestSQS ./adapter/`, `go test -run TestStartAdmin .`,
`golangci-lint run ./adapter/... ./internal/admin/... ./...`,
`cd web/admin && npm run build && npm run lint`.
Bundle delta: 199 KB JS / 61 KB gzip (+1 KB vs PR #650 baseline).
Stacked on PR #650 (admin SPA followups) which is itself stacked on
PR #649 (admin SPA). Merge order: #649 → #650 → this. Once the
chain merges this rebases onto main with no conflicts.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements Phase 3.A of the SQS-compatible adapter, introducing approximate message counters for the GetQueueAttributes API and the admin dashboard. It also adds the admin SPA frontend and wires up management endpoints for listing, describing, and deleting queues. Feedback highlights a performance risk in the counter computation logic, which may perform up to 5,000 random reads, and a consistency bug where message classification uses the current system time instead of a timestamp derived from the MVCC readTS.
| } | ||
| // visible_at > now: either Delayed or NotVisible. Load the | ||
| // data record so we can distinguish on available_at. | ||
| state, err := s.classifyHiddenCandidate(ctx, queueName, gen, string(kvp.Value), now, readTS) |
There was a problem hiding this comment.
The computeApproxCounters implementation performs a point lookup (GetAt) for every message in the visibility index that is not immediately visible (i.e., in-flight or delayed). With a budget of 5,000 entries, a single GetQueueAttributes call could trigger up to 5,000 random reads. While this is bounded, it can significantly increase latency for standard API callers and monitoring tools if the data records are not in the block cache. Consider if AvailableAtMillis can be included in the visibility index value to allow classification without point lookups, or document the potential latency impact for large queues.
| // bound legal. | ||
| func (s *SQSServer) computeApproxCounters(ctx context.Context, queueName string, gen uint64, readTS uint64) (sqsApproxCounters, error) { | ||
| var out sqsApproxCounters | ||
| now := time.Now().UnixMilli() |
There was a problem hiding this comment.
Using time.Now().UnixMilli() for bucket classification while scanning at a specific readTS can lead to inconsistencies. In an MVCC system, all operations within a single transaction or snapshot scan must share the same start timestamp to ensure atomicity and a consistent snapshot view. Using a timestamp derived from the HLC at readTS would ensure the classification is strictly consistent with the snapshot view.
References
- In an MVCC system, all operations within a single transaction must share the same start timestamp to ensure atomicity and a consistent snapshot view.
|
Superseded by #670. The PR #659 branch became unrebaseable after main moved (PR #649 squash-merged, PR #658 added S3 admin endpoints, and the Approximate counters now live in adapter/sqs_catalog.go directly via |
Replaces PR #659, which conflicted heavily after main moved (PR #649 squashed; PR #658 added S3 admin endpoints; the Approximate counters implementation now lives directly in adapter/sqs_catalog.go). This PR: Backend (adapter/sqs_admin.go + internal/admin/sqs_handler.go): - SQSServer.AdminListQueues / AdminDescribeQueue / AdminDeleteQueue are SigV4-bypass entrypoints, mirroring the AdminListTables / AdminListBuckets pattern. - AdminDescribeQueue uses the existing scanApproxCounters from sqs_catalog.go (already on main) so the admin path returns the same Visible / NotVisible / Delayed numbers as GetQueueAttributes("All") would, taken at one snapshot read TS. - sqsQueuesBridge in main_admin.go re-shapes adapter.AdminQueueSummary into admin.QueueSummary, keeping internal/admin free of the heavy adapter dependency tree — same pattern as dynamoTablesBridge / s3BucketsBridge. - admin.QueuesSource is opt-in; deployments that don't run --sqsAddress leave /admin/api/v1/sqs/* off the wire and the SPA renders a soft "endpoint pending" notice on the 404. - Role re-evaluation against the live RoleStore on DELETE so a downgraded key cannot keep mutating with a still-valid JWT. - apiRouteTable.dispatch refactored: resourceHandlerFor extracted so the dispatcher stays under cyclop=10 as new resources land (Dynamo, S3, SQS, future). Frontend (web/admin/src/pages/SqsList.tsx, SqsDetail.tsx): - /sqs queue list with refresh + per-row link to detail. - /sqs/:name detail showing FIFO badge, counters card (Visible / In-flight / Delayed), raw attributes table, and a Delete confirmation Modal gated by RequireFullAccess. - api/client.ts gains listQueues / describeQueue / deleteQueue with the same AbortSignal pattern used for cluster / dynamo / s3 reads. - Layout nav adds an SQS tab between DynamoDB and S3. Out of scope (recorded in the SQS partial design doc §16.2): - PurgeQueue from the SPA. Underlying purgeQueueWithRetry is on main; the admin entrypoint is a trivial follow-up. - Send / Peek / CreateQueue from the SPA. Each needs its own adapter entrypoint and form UX; deferred to keep this PR focused. Verified with go build ./..., go test -race ./internal/admin/..., go test -race -run TestSQS ./adapter/, go test -run TestStartAdmin ., golangci-lint run ./adapter/... ./internal/admin/... ./... (0 issues, no //nolint), and cd web/admin && npm run build.
## Summary Phase **3.B** of [`docs/design/2026_04_24_partial_sqs_compatible_adapter.md`](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_04_24_partial_sqs_compatible_adapter.md) §16.4. Adds the AWS SQS **query protocol** (form-encoded request, XML response) alongside the existing JSON protocol on the same listener — older `aws-sdk-java` v1, `boto < 1.34`, and AWS CLI clients can now talk to elastickv without modification. Detection happens per-request via `Content-Type` / `X-Amz-Target` / `Action` presence; no flag, no separate port. See the new proposal doc committed alongside. This is an **architectural proof PR**: dispatch + decoding + encoding + error envelope + the `*Core` refactor pattern are all in place, with **three verbs** wired end-to-end as concrete proof. Each follow-up verb is a parser + response struct + one switch arm — no further design work needed. ### Verbs in this PR | Verb | Why it's in the proof set | |---|---| | `CreateQueue` | Exercises the `Attribute.N.{Name,Value}` indexed-collection parser. | | `ListQueues` | Exercises the repeated-element XML response shape. | | `GetQueueUrl` | Exercises the `<ErrorResponse>` envelope path via `QueueDoesNotExist`. | Every other verb returns a structured **501 `NotImplementedYet`** XML envelope so operators see the gap explicitly. `SendMessage` / `ReceiveMessage` / `DeleteMessage` are the highest-priority follow-ups (they need the same `*Core` refactor on the FIFO send loop). ### Key design points - **No new listener / no flag.** `pickSqsProtocol(*http.Request)` decides per request. JSON and Query share the SQS port and the SigV4 path. - **Wire-format-free cores.** `createQueue` / `listQueues` / `getQueueUrl` are now `decode → core → encode` with `core(ctx, in) (out, error)`. The JSON wrappers are unchanged in behavior; existing JSON tests pass without modification. - **DoS protection inherited.** Body read is bounded by the same `sqsMaxRequestBodyBytes` the JSON path uses. - **SigV4 unchanged.** The signed canonical request includes the form-encoded body, so the existing SigV4 middleware verifies query requests without code changes. - **Error parity.** `<Code>` reuses the existing `sqsErr*` constants. HTTP status mirrors what the JSON path returns, so SDK retry classifiers work across protocols. - **Cyclomatic budget honoured.** `handle()` was refactored to extract `handleQueryProtocol` — `cyclop ≤ 10` per project rules, no `//nolint`. ### Known limitation (design §11.4) `proxyToLeader`'s error writer always emits the JSON envelope, so a query-protocol client hitting a follower during a leader flip sees one JSON error before retry lands on the new leader. Follow-up PR threads the detected protocol onto the request context so the proxy emits matching XML. ## Test plan - [x] `go build ./...` — clean - [x] `go test -count=1 -race -run "TestSQS|QueryProtocol|TestPickSqs|TestCollectIndexedKV|TestWriteSQSQueryError" ./adapter/` — passes - [x] `golangci-lint run ./adapter/...` — `0 issues.` - [x] `pickSqsProtocol` table tests cover documented edge cases (header precedence, charset suffix, GET with Action, missing Action, nil request). - [x] `collectIndexedKVPairs` tests cover happy path, orphan Name, empty input, unrelated prefix. - [x] End-to-end via the in-process listener: CreateQueue / ListQueues / GetQueueUrl round-trip on the query side. - [x] **Cross-protocol parity**: a queue created via Query is visible via JSON `GetQueueUrl` with the same URL. - [x] Error envelope: 4xx maps to `<Type>Sender</Type>`, 5xx to `<Type>Receiver</Type>`, namespace pinned, `x-amzn-ErrorType` header set. - [x] Unknown verb returns 501 with the `NotImplementedYet` XML envelope. - [x] Missing `Action` parameter returns 400 (per design §3). ## Self-review (5 lenses) 1. **Data loss** — Wire-format change only. Cores are byte-for-byte identical to the previous handler bodies; no Raft / OCC / MVCC code is touched. 2. **Concurrency** — No new shared state. Detection is request-local. Body parsing is bounded. 3. **Performance** — One additional `Content-Type` string compare per request on the dispatch hot path. Negligible. 4. **Data consistency** — `*Core` returns the same business-logic outputs as before; the JSON tests are the regression net for parity. Cross-protocol parity test pins behaviour. 5. **Test coverage** — 10 new test cases cover detection, parsing, envelope shape, and three end-to-end verbs. Existing `TestSQS*` race suite passes on the refactor. ## Stacking This PR is **independent** — branched from current `main` (which has #638 + #649 merged). It does not depend on PR #650 / PR #659. Merge whenever ready.
) ## Summary **Replaces #659**, which has unresolvable conflicts now that main has moved on (PR #649 squashed into main; PR #658 added the S3 admin endpoints; the Approximate counters implementation now lives directly in `adapter/sqs_catalog.go` via `scanApproxCounters`). Rather than a multi-day rebase, this PR re-applies the unique SQS admin code on a fresh branch off current main. What survived from #659: - `adapter/sqs_admin.go` — `SQSServer.AdminListQueues` / `AdminDescribeQueue` / `AdminDeleteQueue` (SigV4-bypass entrypoints, same shape as `AdminListTables` / `AdminListBuckets`). - `internal/admin/sqs_handler.go` — HTTP handler for `/admin/api/v1/sqs/queues{,/{name}}` with role re-evaluation on DELETE. - `web/admin/src/pages/SqsList.tsx` / `SqsDetail.tsx` — SPA pages for the queues view + delete confirmation. What changed during the re-apply: - `AdminQueueCounters` is now `int64` (matches `sqsApproxCounters` from main; bridge does no width conversion). - `AdminDescribeQueue` calls main's `scanApproxCounters` instead of the duplicate `computeApproxCounters` from the old branch — same numeric output, single implementation. - Dropped the `CountersTruncated` field; main's counter type doesn't expose truncation. SPA's "truncated" pill came out with it. - `apiRouteTable.dispatch` refactored to extract `resourceHandlerFor` so the dispatcher stays under cyclop=10 as new resources land. ## Backend - Re-evaluates the principal's role against the live `MapRoleStore` on every `DELETE` so a downgraded key cannot keep mutating with a still-valid JWT (Codex P1 pattern from earlier admin PRs). - `admin.QueuesSource` is **opt-in**: deployments without `--sqsAddress` leave `/admin/api/v1/sqs/*` off the wire; the SPA renders a soft "endpoint pending" notice on the 404, mirroring the Tables / Buckets `nil` contract. - The bridge in `main_admin.go` (`sqsQueuesBridge`, `convertAdminQueueSummary`, `translateAdminQueuesError`) keeps `internal/admin` free of the heavy adapter dependency tree, same architectural pattern as Dynamo and S3. ## Frontend - **/sqs** queue list with refresh + per-row link to detail. - **/sqs/:name** detail showing FIFO badge, counters card (Visible / In-flight / Delayed), raw attributes table, and a Delete confirmation `Modal` gated by `RequireFullAccess`. - `api/client.ts` gains `listQueues` / `describeQueue` / `deleteQueue` with the same `AbortSignal` pattern used for `cluster` / `dynamo` / `s3` reads. - Layout nav adds an SQS tab between DynamoDB and S3. ## Out of scope (recorded in `docs/design/2026_04_24_proposed_sqs_compatible_adapter.md` Section 14, deferred per the SQS partial doc §16.2) - **PurgeQueue from the SPA** — the underlying `purgeQueueWithRetry` adapter method is on main; the admin entrypoint is a trivial follow-up. - **Send / Peek / CreateQueue from the SPA** — each needs its own SigV4-bypass adapter entrypoint and form UX; deferred to keep this PR focused. ## Test plan - [x] `go build ./...` — clean - [x] `go test -race ./internal/admin/...` — passes - [x] `go test -race -run TestSQS ./adapter/` — passes - [x] `go test -run TestStartAdmin .` — passes - [x] `golangci-lint run ./adapter/... ./internal/admin/... ./...` — `0 issues.`, no `//nolint` - [x] `cd web/admin && npm run build` — 49 modules, 199 KB JS / 61 KB gzip + 14.7 KB CSS - [ ] Manual smoke (after PR lands): start a node with `--sqsAddress :4566 --adminEnabled --adminAllowInsecureDevCookie`, create a queue, send a few messages, hit `/admin/sqs/<name>` → counters match `GetQueueAttributes("All")`, Delete dialog returns to list. ## Self-review (5 lenses) 1. **Data loss** — Delete reuses the existing `deleteQueueWithRetry` OCC path; counters are read-only. No new write paths. 2. **Concurrency** — Per-request leader check on Delete; counters scan uses one snapshot read TS. 3. **Performance** — Counters bounded by main's existing `sqsApproxCounterScanLimit`; admin reads are cheap point lookups + one bounded scan. 4. **Data consistency** — `AdminDescribeQueue` and SigV4 `GetQueueAttributes` both call `scanApproxCounters` at a fresh `nextTxnReadTS`, so a single point in time produces the same counters via either surface. 5. **Test coverage** — Existing admin / SQS race suites stay green via the new `nil` Queues argument added to `startAdminServer` call sites; the new bridge is exercised by the cross-package build itself. ## Stacking This PR is **independent** — branched from current `main` (which has the merged versions of #649 / #658 / #650 / counter implementation). Closing #659 in favour of this clean rewrite.
…FO) (#664) ## Summary **Docs-only PR.** Two design proposals for the remaining Phase 3 SQS items per [`docs/design/2026_04_24_proposed_sqs_compatible_adapter.md`](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_04_24_proposed_sqs_compatible_adapter.md) Section 14 (Phase 3 bullets). Both items were explicitly called out as needing separate design docs before any implementation work; this PR lands those proposals so the implementation PRs have a reviewed architecture to build on. ### 3.C — Per-queue throttling and tenant fairness ([proposal](docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md)) Per-queue token-bucket throttling configured on queue meta (no separate keyspace), evaluated at the SQS adapter layer on the leader (no Raft per request), surfaced as the AWS `Throttling` error envelope so SDK retry/backoff just works. Key decisions: - **Default-off**. Existing queues are unaffected; operators opt in per queue via `SetQueueAttributes`. - **Per-action buckets** (Send / Receive / Default) so a slow consumer cannot pin the producer. - **Per-leader buckets, no replication**. Worst case on failover: one extra burst on the new leader. Acceptable per AWS-equivalent behaviour at region failover boundaries; replicating would cost a Raft commit per `SendMessage`. - **Batch verbs charge by entry count**, not call count, with all-or-nothing rejection (matches AWS). - **Admin-only configuration plane**. Standard SQS clients see `InvalidAttributeName` on the `Throttle*` attributes (matches AWS behaviour for unknown attributes); the data-plane enforcement runs for everyone. ### 3.D — Split-queue FIFO (high-throughput FIFO) ([proposal](docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md)) Per-`MessageGroupId` hash partitioning across multiple Raft groups, mirroring AWS High Throughput FIFO. Within-group ordering preserved; across-group throughput scales with the partition count. Key decisions: - **Existing single-partition FIFO queues stay byte-identical** (`PartitionCount = 0` path is the legacy layout; no migration runs implicitly). - **Power-of-two partition counts only** (1, 2, 4, 8, 16, 32) so the routing step is `hash & (N-1)` and future offline rebuilds stay tractable. - **Partition count is immutable after first SendMessage**. Live re-partitioning would break ordering for in-flight messages of every group whose hash bucket changed; out of scope. - **Multi-PR rollout plan** with an explicit "gate of no return" called out at PR 5 (the data-plane PR). PRs 1–4 are reversible no-ops on data layout; once a partitioned FIFO holds real data, rollback means draining and recreating the queue. - **FNV-1a hash** (deterministic across processes / Go versions / architectures). Risk of attacker-controlled `MessageGroupId` pinning all traffic to one partition is documented and accepted (the feature is for cooperative operators). ## Test plan - [x] Markdown renders correctly on GitHub (manually previewed). - [x] Cross-references resolve (the partial-doc rename in PR #659 is in flight; both proposals reference the partial filename — they will resolve once #659 merges, otherwise resolve to the current `_proposed_` filename for a 1-character mismatch that is fine to leave for now). - [x] No code changes; CI is irrelevant beyond the markdown lint pass. ## Self-review This is a docs-only PR; the 5-lens self-review collapses to: 1. **Data loss / Concurrency / Performance / Consistency** — N/A, no code touched. 2. **Test coverage** — N/A, no code touched. The proposals themselves include a Testing Strategy section (§6 / §9) so the implementation PRs have explicit acceptance criteria. ## Stacking This PR is **independent** of #650, #659, and #662. Branched from current `main`. Merge whenever ready — landing the proposal docs early lets reviewers comment on the architecture before the implementation PRs go up. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added detailed designs for optional per-queue token-bucket throttling (rules, batch semantics, AWS-shaped Throttling errors with Retry-After, metrics, rollout default-off) and for HT‑FIFO “split-queue” partitioning (routing, immutability, gating/rollout). * **New Features** * Optional queue-level send/receive throttling with batch-aware charging and Retry-After; HT‑FIFO partitioning with deterministic partition routing and immutable partition attributes. * **Tests** * Extensive unit and integration tests for throttling, HT‑FIFO partitioning, attribute validation, immutability, idempotency, and cache invalidation. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Phase 3.A and 3.E of
docs/design/2026_04_24_partial_sqs_compatible_adapter.md(renamed from_proposed_in this PR — Phase 1+2 already landed). Phase 3.B / 3.C / 3.D are recorded as TODOs in §16.4 / §16.5 / §16.6 of the same doc with the rough scope each needs (XML query protocol, per-queue throttling, split-queue FIFO).3.A — Approximate counters on
GetQueueAttributesGetQueueAttributesnow returnsApproximateNumberOfMessages,ApproximateNumberOfMessagesNotVisible, andApproximateNumberOfMessagesDelayed(previously: none of the three). Implementation walks the visibility index at the read TS and bucket-classifies each entry by(visible_at, available_at)vsnow:available_at ≤ now ∧ visible_at ≤ now— ready for the nextReceiveMessage.available_at ≤ now ∧ visible_at > now— in flight after a receive bumped vis.available_at > now— sent withDelaySeconds, not yet eligible.Bounded at 5,000 vis-index entries per call so a runaway queue cannot turn
GetQueueAttributesinto an O(stream) scan; over the budget the counts are reported as a lower bound andTruncated=trueso callers can log. The counters fire only when the caller asks for them or forAll— config-only callers still pay zero scan cost (matches AWS's "approximate" contract and keeps the existingGetQueueAttributescallers cheap).3.E — Admin SPA queues pages (Section 13 superseded)
The original Section 13 console UI design (separate
--consoleAddresslistener, bearer token, dedicatedgo:embedbundle) is superseded — see §16.2 of the design doc for the rationale. Operator UI for SQS lives on the existing admin dashboard SPA (#649) instead, reusing its JWT cookie session, CSRF double-submit, role model, and embed.FS.Backend (
adapter/sqs_admin.go+internal/admin/sqs_handler.go):SQSServer.AdminListQueues/AdminDescribeQueue/AdminDeleteQueueare SigV4-bypass entrypoints, mirroring theAdminListTables/AdminDescribeTable/AdminDeleteTablepattern.sqsQueuesBridgeinmain_admin.gore-shapesadapter.AdminQueueSummary→admin.QueueSummary, keepinginternal/adminfree of the heavy adapter dependency tree.admin.QueuesSourceis opt-in; deployments that don't run--sqsAddressleave/admin/api/v1/sqs/*off the wire and the SPA renders a soft "endpoint pending" notice on the 404.RoleStoreonDELETEso a downgraded key cannot keep mutating with a still-valid JWT.buildAPIMuxdispatch refactored intoapiRoutes.dispatch+hasResourcePrefixto keepcyclop ≤ 10as new resource prefixes get added.Frontend (
web/admin/src/pages/Sqs{List,Detail}.tsx):/sqsqueue list with refresh + per-row link./sqs/:namedetail showing FIFO badge, counters card (visible / in flight / delayed) with a "truncated" pill when the scan budget capped, raw attributes table, and a Delete confirmationModalgated byRequireFullAccess.api/client.tsgainslistQueues/describeQueue/deleteQueuewith the sameAbortSignalpattern used forcluster/dynamoreads.Out of scope (recorded in
docs/design/...partial_sqs_compatible_adapter.md§16.2)purgeQueueWithRetryadapter method ships in feat(sqs): finish Milestone 1 - purge, batch, tags, redrive, FIFO, retention reaper #638. Trivial to add as a follow-up once feat(sqs): finish Milestone 1 - purge, batch, tags, redrive, FIFO, retention reaper #638 lands.AdminSendMessage; Peek is genuinely new functionality (SQS has no peek primitive) and merits its own design discussion.Test plan
go build ./...go test -race ./internal/admin/...go test -race -run TestSQS ./adapter/go test -run TestStartAdmin .golangci-lint run ./adapter/... ./internal/admin/... ./...—0 issues.cd web/admin && npm run build && npm run lint— strict TS, no//nolint--sqsAddress :4566 --adminEnabled --adminAllowInsecureDevCookie, create a queue, send N + delay one + receive K, hit/admin/sqs/<name>→ counters match, Delete dialog returns to list.Self-review (5 lenses)
deleteQueueWithRetryOCC path.AdminDeleteQueuere-checks leader before each attempt; admin role re-evaluated per request.nextTxnReadTSand bucket from onenow. The "approximate" contract matches AWS.GetQueueAttributescover both happy path and the "only when requested" property. Admin handler tests stay green via thenilQueuesargument added to the existingstartAdminServercall sites.Stacking
This PR is stacked on #650 (Modal a11y + abort guard cleanup), which is itself stacked on #649 (admin SPA). Merge order: #649 → #650 → this. Once the chain merges, this rebases onto main with no conflicts.