docs(sqs): proposals for Phase 3.C (throttling) + 3.D (split-queue FIFO)#664
docs(sqs): proposals for Phase 3.C (throttling) + 3.D (split-queue FIFO)#664
Conversation
Two design proposals for the remaining Phase 3 SQS items per docs/design/2026_04_24_proposed_sqs_compatible_adapter.md (Section 14, Phase 3 bullets). Both are explicitly called out as needing separate design docs before implementation; this lands them so implementation work has a reviewed architecture to start from. 3.C — per-queue token-bucket throttling, configured on queue meta, evaluated at the SQS adapter layer on the leader. Default-off, AWS-shape Throttling error envelope, per-leader buckets (no Raft per request). Proposal covers: bucket model, charging per verb / batch entry, configuration validation rules, AWS-compatibility posture (admin-only attribute, hidden from standard principals), multi-shard correctness, observability, alternatives, open questions. 3.D — high-throughput FIFO via per-MessageGroupId hash partitioning across multiple Raft groups. Multi-PR rollout plan with an explicit "gate of no return" called out at the data-plane PR. Existing single-partition FIFO queues stay byte-identical (PartitionCount=0 path). Proposal covers: data model and key encoding, routing, SendMessage / ReceiveMessage / PurgeQueue / DeleteMessage flows, shard router config syntax, reaper implications, migration story (no live re-partitioning), failure modes, mixed-version cluster behaviour, alternatives, open questions. No code changes; review-only PR.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds two design proposals (per-queue token-bucket throttling and HT‑FIFO split-queue partitioning) and implements runtime support: queue metadata parsing/validation, deterministic partition mapping, an in-memory per-queue token-bucket store with eviction/invalidation, handler-level charging for send/receive (including batch semantics), and extensive unit and integration tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Server as SQSServer
participant Catalog as QueueMetaStore
participant Bucket as BucketStore
participant Handler as QueueHandler
Client->>Server: SendMessage / ReceiveMessage
Server->>Catalog: Resolve QueueUrl → meta (throttle config, incarnation)
Catalog-->>Server: Queue meta (exists / not found)
alt queue exists
Server->>Bucket: charge(queue, action, count, incarnation)
alt tokens available
Bucket-->>Server: allowed
Server->>Handler: Dispatch to queue handler
Handler-->>Server: Handler result
Server-->>Client: 200 OK / response
else throttled
Bucket-->>Server: denied + retryAfter
Server-->>Client: 400 Throttling + Retry-After header
end
else queue missing
Server-->>Client: QueueDoesNotExist error
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces two design documents for the SQS adapter: one for per-queue throttling and another for split-queue FIFO partitioning. The throttling proposal aims to ensure tenant fairness and prevent resource exhaustion, while the split-queue FIFO design targets high-throughput requirements by distributing queues across multiple Raft groups. Review feedback identifies a critical backward compatibility risk in the proposed keyspace schema that would break existing data access. Additionally, there are concerns regarding potential performance bottlenecks from coarse-grained locking in the throttling store and a design flaw in the partitioned ReceiveMessage logic that could lead to 'false empty' responses for clients.
| A partition is identified by the tuple `(queueName, partitionIndex)` where `partitionIndex ∈ [0, N)`. Every existing message-keyspace key gains the `partitionIndex` as a fixed-width segment immediately after the queue name. Concretely, today's `!sqs|msg|data|<queue>|<gen>|<msgID>` becomes `!sqs|msg|data|<queue>|<partition>|<gen>|<msgID>`. The `<partition>` segment is `appendU64(partitionIndex)` (fixed 8 bytes BE), so: | ||
|
|
||
| - Single-partition FIFOs (and all Standard queues) write `partitionIndex = 0` and read with the same prefix as today — the schema is byte-identical for `N=1`. |
There was a problem hiding this comment.
The proposed key format !sqs|msg|data|<queue>|<partition>|<gen>|<msgID> is not backward compatible with the existing format. Inserting the 8-byte <partition> segment shifts the offsets for <gen> and <msgID>, meaning the new code will fail to find existing messages on disk. This introduces a significant operational risk and prevents rolling upgrades. Per repository rules, the design must detail mitigation strategies (e.g., a temporary 'bridge' or 'proxy' mode) and strategies for zero-downtime cutovers (e.g., dual-write proxy or logical replication) to avoid service interruption.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
| mu sync.Mutex | ||
| buckets map[bucketKey]*tokenBucket // (queueName, action) -> bucket |
There was a problem hiding this comment.
Using a single sync.Mutex to protect the buckets map will likely become a performance bottleneck, as this lock must be acquired on every SendMessage and ReceiveMessage call. Per the general rules, frequently accessed fields on hot paths should prefer more granular locking or atomic operations to improve performance.
| mu sync.Mutex | |
| buckets map[bucketKey]*tokenBucket // (queueName, action) -> bucket | |
| buckets sync.Map // map[bucketKey]*tokenBucket |
References
- For frequently accessed fields that require monotonic updates, prefer atomic operations or granular locking over mutexes to improve performance on hot paths.
| A consumer making one ReceiveMessage call sees results from **only the partitions whose leader is the local node**. For a deployment where every partition has its own leader, a single client's ReceiveMessage hits one partition's worth of messages per call. This matches AWS HT-FIFO behaviour: clients are expected to spread their consumer pool across partitions, and an idle partition does not block a busy one. | ||
|
|
||
| For deployments that co-locate partitions on one Raft group, all partitions' leaders are the same node and a single ReceiveMessage scans all partitions in sequence. The scan is bounded by `MaxNumberOfMessages` and the existing per-partition page limit — no separate budget. | ||
|
|
There was a problem hiding this comment.
Restricting ReceiveMessage to only scan partitions led by the local node is a significant departure from the SQS model. In a multi-node deployment, this can lead to "false empty" responses if a client hits a node that is not the leader for partitions containing messages. This forces clients to implement cluster-aware round-robin logic, which leaks internal sharding details and breaks the SQS abstraction.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 341b0deb14
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| b. If this node is not the leader for that partition, skip (the | ||
| consumer's next call will land on a different node and that | ||
| leader will serve its partition). |
There was a problem hiding this comment.
Proxy non-local partitions during ReceiveMessage
Step 2b explicitly skips partitions whose leader is not the local node, which means a consumer pinned to one endpoint will only ever drain that node’s leader-owned partitions while messages on other partitions can backlog indefinitely. This turns ReceiveMessage into node-local visibility rather than queue-wide visibility and breaks normal single-endpoint client deployments unless they add explicit endpoint round-robin behavior.
Useful? React with 👍 / 👎.
| | `adapter/sqs.go` | After `authorizeSQSRequest`, call `bucketStore.charge(queueName, action, count)`. On reject, write the `Throttling` envelope and return. | | ||
| | `adapter/sqs_throttle_test.go` (new) | Unit tests for bucket math (edge cases: idle drift, burst, partial refill, batch over-charge, default-off). ~300 lines. | | ||
| | `adapter/sqs_throttle_integration_test.go` (new) | End-to-end: configure a queue with low limits, send N messages back-to-back, confirm the (N+1)th gets `Throttling` with `Retry-After`. ~150 lines. | | ||
| | `monitoring/registry.go` | Two new counter vectors: `sqs_throttled_requests_total{queue, action}` and `sqs_throttle_tokens_remaining{queue, action}`. | |
There was a problem hiding this comment.
Define tokens_remaining as a gauge metric
This section defines sqs_throttle_tokens_remaining as a counter, but token-bucket remaining capacity must go up and down over time. If implemented as a counter, the value can only increase and any dashboard based on “trending toward zero” becomes meaningless, masking real throttling pressure; this metric needs to be specified as a gauge.
Useful? React with 👍 / 👎.
Four issues raised by Codex + Gemini Code Assist on PR #664: 3.D split-queue FIFO doc: 1. Backward-compatibility risk in keyspace (Codex P1 + Gemini high). The original §3.1 said "every existing key gains a <partition> segment" but also "the schema is byte-identical for N=1" — those are contradictory; inserting bytes shifts every downstream offset and breaks readback of existing data on disk. Rewrote §3.1 to make the key shape conditional on whether the queue is partitioned: legacy / Standard queues use today's keys byte-identically, partitioned queues use a new prefix (!sqs|msg|data|p|<queue>|<partition>|...) with an explicit "p|" discriminator that cannot collide with the legacy prefix even when partition=0. Constructor sketch updated. 2. ReceiveMessage "false empty" on non-local-leader partitions (Codex P1 + Gemini medium). Original §4.2 had clients pinned to one node only seeing partitions whose leader is local — messages on other partitions backlog forever and clients hit empty replies that look like "queue is drained". Rewrote §4.2 so ReceiveMessage proxies the per-partition fanout to the right leader via the existing leader-proxy machinery (extended to take a partition argument). Added a partitionOrder rotation seeded by RequestId so successive calls don't bias toward partition 0. Co-located deployments still pay nothing; spread deployments pay one extra hop per non-local partition only until MaxNumberOfMessages are collected. 3.C per-queue throttling doc: 3. Single sync.Mutex on the bucket store would be a hot-path contention point (Gemini medium). Switched §3.1 to a sync.Map for the bucket lookup (read-mostly access pattern) plus a per-bucket sync.Mutex for charge / refill. Cross-queue traffic never serialises on the same lock. Documented the three-step charge operation explicitly so readers see the locking story. 4. sqs_throttle_tokens_remaining must be a gauge, not a counter (Codex P2). Token budgets go up and down; a counter would mask exactly the depletion operators need to see. Updated both §4.1 (files-touched table) and §7 (operational dashboard) to call the metric a gauge. Counter is correct for the rejections total. Docs-only PR; no code changes, no implementation gate.
|
Pushed addressing all four review findings (Codex + Gemini converged): 3.D split-queue FIFO doc:
3.C per-queue throttling doc:
Docs-only changes; no implementation gate. @claude review |
|
Claude finished @bootjp's task in 3m 1s —— View job Review
All four issues from the Codex/Gemini review have been correctly addressed. New issues found below, one of them medium-severity. All four previous issues: resolved ✓3.D — Keyspace backward-compat (Gemini high + Codex P1) ✓ 3.D — ReceiveMessage false-empty (Gemini medium + Codex P1) ✓ 3.C — Hot-path lock contention (Gemini medium) ✓ 3.C — Metric type (Codex P2) ✓ New issues in the updated doc[Medium]
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6fa02faa35
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ## 8. Failure Modes and Edge Cases | ||
|
|
||
| 1. **Consumer never sees a partition's leader**: a consumer that always lands on node A sees only the partitions whose leader is A. If partitions are spread evenly, this is `1/N` of the queue. Mitigation: AWS SDK consumers naturally distribute across endpoints; for elastickv, document that clients should round-robin endpoints when consuming a partitioned FIFO. |
There was a problem hiding this comment.
Remove node-local receive behavior from failure guidance
Fresh evidence: §8.1 explicitly says an endpoint-pinned consumer only sees 1/N of partitions, which contradicts §4.2’s required proxy fanout behavior. Keeping this guidance in the proposal makes the implementation ambiguous, and following it would regress ReceiveMessage to node-local visibility where messages on remote-leader partitions can backlog indefinitely for single-endpoint consumers.
Useful? React with 👍 / 👎.
| if meta.PartitionCount <= 1 { | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Align partition routing with perQueue throughput mode
Section 3.2 says FifoThroughputLimit=perQueue should effectively collapse routing to one partition, but the partitionFor pseudocode only checks PartitionCount and will hash across all partitions whenever PartitionCount > 1. If this is implemented as written, perQueue mode cannot behave as documented and will still distribute groups across partitions.
Useful? React with 👍 / 👎.
| | JSON | HTTP 400, body `{"__type":"Throttling","message":"Rate exceeded for queue '<name>' action '<action>'"}`, header `x-amzn-ErrorType: Throttling`, header `Retry-After: 1` | | ||
| | Query | HTTP 400, body `<ErrorResponse><Error><Type>Sender</Type><Code>Throttling</Code><Message>...</Message></Error><RequestId>...</RequestId></ErrorResponse>`, headers as above | | ||
|
|
||
| `Retry-After: 1` is the conservative default — at the configured refill rate, one second is enough for at least one fresh token. A future iteration could compute the precise wait from `(1 - currentTokens) / refillRate` but the constant is enough for SDK backoff logic. |
There was a problem hiding this comment.
Compute Retry-After from refill rate
This claim assumes every bucket refills at least one token per second, but the proposed validator allows any positive refill rate. With configurations like RefillPerSecond=0.1, a Retry-After: 1 response guarantees another immediate throttle, creating avoidable retry pressure and misleading backoff behavior; the wait should be derived from the bucket’s refill rate (or the config should enforce a minimum of 1).
Useful? React with 👍 / 👎.
Five issues raised by Claude on PR #664 (one medium, four low — all docs-only edits to the partial doc trio): 3.D split-queue FIFO doc: 1. (Medium) §8.1 stale failure-mode description contradicted §4.2. Original §8.1 said "Consumer never sees a partition's leader → mitigation: round-robin endpoints" — but §4.2 was rewritten earlier in this PR to proxy non-local partitions server-side, eliminating the false-empty risk entirely. Implementors reading §8.1 in isolation would have built the wrong mental model and possibly suppressed the proxy logic. §8.1 now describes the actual remaining failure mode — proxy round-trip latency under spread deployments — and explains that single-partition / co-located deployments pay nothing. 2. (Low) §3.1 variadic sqsMsgDataKey was a footgun. The variadic form would silently accept extra args without compiler diagnosis. Replaced with two named constructors (legacyMsgDataKey / partitionedMsgDataKey) and an explicit dispatch at the call site based on meta.PartitionCount > 1. 3. (Low) §3.1 p| discriminator safety constraint was implicit. Added an explicit paragraph explaining why the prefix cannot collide with the legacy keyspace: queue names admit only [A-Za-z0-9_-] (+ optional .fifo suffix), the validator must continue to reject |, and base32 segments never start with the literal ASCII p. Future relaxations of the name validator have to revisit this prefix scheme. 4. (Low) §8.5 upgrade-gate mechanism was underspecified. Added the concrete protocol: each node's /sqs_health endpoint advertises capabilities: ["htfifo"] once this PR's code is in the binary; the catalog reads the live node set from the distribution layer's node registry, polls /sqs_health on each, and gates PartitionCount > 1 on every node reporting htfifo. Fail-closed on timeouts. References PR #644 as the prior art for the same pattern. 3.C per-queue throttling doc: 5. (Low) §3.4 Retry-After: 1 lied for slow-refill queues. A queue with SendRefillPerSecond = 0.1 needs 10 s for the next token; returning 1 burned 9 unnecessary client retries before the request would succeed. Replaced with a per-queue computation: ceil((1 - currentTokens) / refillRate), floored at 1 s to match Retry-After's integer-second granularity. Fast-refill queues (>= 1 RPS) preserve the existing Retry-After: 1 behaviour. Docs-only PR; no code changes, no implementation gate.
|
Pushed addressing all five findings from your follow-up review (1 medium, 4 low): 3.D split-queue FIFO doc:
3.C per-queue throttling doc:
Docs-only changes; no implementation gate. @claude review |
|
Claude finished @bootjp's task in 5m 50s —— View job Review
All five issues from the last Claude review are correctly resolved. One pre-existing Codex P2 finding from the second-pass review was not in the trigger's scope but remains open; surfacing it here. Two low-severity findings are new. All five previous issues: resolved ✓3.D — §8.1 stale failure-mode description (medium) ✓ 3.D — §3.1 variadic 3.D — §3.1 3.D — §8.5 upgrade-gate mechanism (low) ✓ 3.C — §3.4 New findings[Medium]
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f289d6c291
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 2. On miss, build the bucket from queue meta and `LoadOrStore` it (one-shot insert race tolerated — both racers will agree on the same configuration). | ||
| 3. Acquire the bucket's own `mu`, refill based on elapsed time, take or reject the requested tokens, release `mu`. |
There was a problem hiding this comment.
Invalidate cached buckets on throttle config updates
The design caches each (queue, action) bucket on first use via LoadOrStore, but it never defines a refresh/invalidatation path when SetQueueAttributes changes throttle values. For active queues this means updated limits can stay stale until idle eviction or failover, so operators may keep over-throttling or under-throttling traffic long after they reconfigure the queue.
Useful? React with 👍 / 👎.
| secondsToNextToken := math.Ceil((1.0 - currentTokens) / refillRate) | ||
| retryAfter := max(1, int(secondsToNextToken)) // never less than 1 |
There was a problem hiding this comment.
Base Retry-After on requested token cost
The proposed Retry-After calculation always targets the next single token, but this proposal charges batch requests by entry count. When a large SendMessageBatch is throttled, clients can be told to retry in 1 second even though many more seconds are needed to refill enough tokens for that same batch, which guarantees repeated throttles and adds avoidable retry pressure.
Useful? React with 👍 / 👎.
| `kv/shard_router.go` today routes by queue name. With partitions, the routing key becomes `(queueName, partitionIndex)`. The existing `--raftSqsMap` flag accepts entries keyed by queue name; the syntax extends to accept a partition suffix: | ||
|
|
||
| ``` | ||
| --raftSqsMap "orders.fifo:8=group-7,group-8,group-9,group-10,group-11,group-12,group-13,group-14" |
There was a problem hiding this comment.
Split partition routing from raftSqsMap endpoint mapping
This section assigns queue/partition placement to --raftSqsMap, but that flag is currently defined as raftAddr=sqsAddr endpoint mapping for leader proxying. Reusing it for partition-to-group routing makes the config surface ambiguous and can break proxy address resolution, especially in the receive fanout flow that depends on correct leader endpoint lookup.
Useful? React with 👍 / 👎.
Three findings (1 medium, 2 low):
3.D split-queue FIFO doc:
1. (Medium) §3.3 partitionFor missed FifoThroughputLimit=perQueue.
§3.2 documents that perQueue "reduces routing to a single partition
regardless of PartitionCount" but the §3.3 pseudocode only checked
PartitionCount <= 1. A queue created with PartitionCount=8 +
FifoThroughputLimit=perQueue would hash across all 8 partitions,
directly contradicting the documented semantics. Updated the early-
return guard to:
if meta.PartitionCount <= 1 || meta.FifoThroughputLimit == "perQueue" {
return 0
}
Codex P2 caught this in the second-pass review but it was outside
the scope of the previous fix iteration.
2. (Low) §4.1 step 5 still referenced the old variadic
sqsMsgDataKey(queue, partitionIndex, gen, msgID). §3.1 had already
moved to two named constructors (legacyMsgDataKey /
partitionedMsgDataKey) but §4.1's pseudocode still used the
variadic form, contradicting §3.1. Replaced with explicit if/else
dispatch through the named constructors so an implementor reading
§4.1 in isolation gets the same naming as §3.1.
3.C per-queue throttling doc:
3. (Low) §3.4 Retry-After formula's numerator was hardcoded to 1.0
regardless of charge size. SendMessageBatch of 10 entries against
a 0-token bucket with refillRate = 1.0 would have returned
Retry-After: 1 (insufficient) instead of 10. Updated formula to:
needed := float64(requestedCount) - currentTokens
secondsToNextRefill := math.Ceil(needed / refillRate)
retryAfter := max(1, int(secondsToNextRefill))
requestedCount is the same value §3.3's charge step uses
(1 for single, len(Entries) for batch).
Docs-only PR; no code changes, no implementation gate.
|
Pushed addressing all three findings from your third review (1 medium, 2 low): 3.D split-queue FIFO doc:
3.C per-queue throttling doc:
Docs-only changes; @claude review |
|
Claude finished @bootjp's task in 2m 9s —— View job Review
All three issues from the trigger are correctly resolved. Two Codex P1 findings from the previous review remain open and are surfaced below (they were not in scope of this push). Two new low-severity issues found. All three previous issues: resolved ✓3.D — §3.3 if meta.PartitionCount <= 1 || meta.FifoThroughputLimit == "perQueue" {
return 0
}The comment explicitly documents that 3.D — §4.1 step 5 named constructor dispatch (low) ✓ 3.C — §3.4 Both cases are correctly handled: a Open Codex P1 findings from the previous round (not addressed in this push)These were raised against [Medium]
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d424d18d7c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| queue string | ||
| action string // "Send" | "Receive" | "*" |
There was a problem hiding this comment.
Key throttle buckets by queue generation
The bucket key is defined as only (queue, action), so a deleted/recreated queue with the same name will reuse the old in-memory bucket state until idle eviction, inheriting stale token balance and potentially stale limits. In practice this can throttle a brand-new queue immediately (or let it burst unexpectedly) after DeleteQueue/recreate, which breaks queue incarnation isolation; include generation (or another meta version) in the key or define explicit invalidation on queue lifecycle changes.
Useful? React with 👍 / 👎.
| 3. For each partitionIndex in partitionOrder, until MaxNumberOfMessages | ||
| are collected or every partition has been tried: | ||
| a. Resolve the leader for (queue, partitionIndex). | ||
| b. If this node is the leader: scan locally, deliver candidates. | ||
| c. Otherwise: forward the request to the leader-of-partition via |
There was a problem hiding this comment.
Bound per-partition proxy calls by shared wait budget
This flow fans out ReceiveMessage across partitions by forwarding requests partition-by-partition, but it does not require reducing WaitTimeSeconds across hops. If each forwarded call long-polls independently, an empty queue can take roughly PartitionCount × WaitTimeSeconds, which can exceed SQS-style wait expectations and cause client/LB timeouts; the design should require a single request-level deadline and pass only remaining budget to each partition probe.
Useful? React with 👍 / 👎.
Four findings (2 medium, 2 low): 3.C per-queue throttling doc: 1. (Medium) §3.1 missed bucket cache invalidation on SetQueueAttributes. The doc described LoadOrStore on first use but no path for refreshing the in-memory bucket when SetQueueAttributes changes the throttle config. An operator lowering a limit to stop a noisy tenant would wait up to the idle-eviction window (1h default) before the change took effect. Added a paragraph: SetQueueAttributes calls buckets.Delete(key) for every action key after the Raft commit so the next request rebuilds from the new meta. The race with LoadOrStore is benign (the rebuilt bucket starts at full capacity, same semantics as the failover case). 2. (Low) §4.3 forward-compat sentence used the wrong key for the per-partition bucket. "sharded by MessageGroupId" was wrong — MessageGroupId is the input to partitionFor, not the bucket key. Using it directly would create one bucket per unique group value (unbounded, attacker-amplifiable). The bucket key is (queueName, partitionID) which is bounded by PartitionCount. Replaced the sentence with the correct keying and added the bucketKey-grows-a-partition-field note. 3.D split-queue FIFO doc: 3. (Medium) §5 reused --raftSqsMap for partition-to-Raft-group assignment. That flag already maps raftAddr=sqsAddr for proxyToLeader's endpoint resolution; overloading the same parser with partition assignments creates a parsing ambiguity that could silently produce the wrong proxy target in §4.2's fanout. Replaced with a new dedicated --sqsFifoPartitionMap flag. The existing --raftSqsMap is unchanged. Updated §11's rollout table accordingly. 4. (Low) §9 unit-test list was missing a case for FifoThroughputLimit=perQueue routing. The §3.3 short-circuit guard (added in the previous fix iteration) would silently regress without a test. Added: "FifoThroughputLimit=perQueue with PartitionCount=8 routes every group ID to partition 0." Docs-only PR; no code changes, no implementation gate.
|
Pushed addressing all four findings from your fourth review (2 medium, 2 low): 3.C per-queue throttling doc:
3.D split-queue FIFO doc:
Docs-only changes; @claude review |
|
Claude finished @bootjp's task in 3m 26s —— View job Review
All four issues from the trigger are correctly resolved. Two Codex P1 findings first surfaced in my fourth-round review — and noted there as "not in scope of this push" — remain open in the current doc. One new low-severity finding. All four trigger fixes: resolved ✓3.C — §3.1 3.C — §4.3 bucket key 3.D — §5 3.D — §9 Still-open Codex P1 findings from
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f6bd33ab70
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - All four `Send*` / `Recv*` fields must be either both zero (disabled) or both positive. | ||
| - Capacity ≥ refill (otherwise the bucket can never burst above the steady state). | ||
| - A hard ceiling per queue (e.g. 100,000 RPS) so a typo (`SendCapacity = 1e9`) does not silently mean "no limit at all" but rejects with `InvalidAttributeValue`. |
There was a problem hiding this comment.
Require bucket capacity to cover max batch token cost
The validator currently only enforces paired positivity and capacity >= refill, but §3.3 charges batch APIs by len(Entries) (up to 10). With configs like SendCapacity=5, a valid SendMessageBatch of 10 can never succeed because the bucket can never accumulate 10 tokens, so clients get permanent throttling and misleading Retry-After values. Please add validation that per-action capacity is at least the maximum single-request charge (or change the charging rule) to avoid creating unrecoverable throttling states.
Useful? React with 👍 / 👎.
| 5. **Mixed-version cluster**: a rolling upgrade where some nodes have HT-FIFO and others don't. The new feature gates on the queue's `PartitionCount > 1` field, which is set at create time; old nodes that try to scan a partitioned queue's keyspace will simply not find anything (the prefix has changed). The catalog rejects `CreateQueue` with `PartitionCount > 1` until every node in the cluster reports the new feature flag. | ||
|
|
||
| **The capability advertisement mechanism**: each node's existing `/sqs_health` endpoint (`adapter/sqs.go: serveSQSHealthz`) gains a new field in its JSON body — `capabilities: ["htfifo"]` once this PR's code is in the binary. The catalog's CreateQueue handler reads the live node set from the distribution layer's node registry (the same registry used by `proxyToLeader` to locate leaders), polls `/sqs_health` on each, and gates `PartitionCount > 1` on every node reporting the `htfifo` capability. Nodes that don't respond within a short timeout are treated as not-yet-upgraded — a deliberate fail-closed default so a network blip does not let a partitioned queue land in a partially-upgraded cluster. This mirrors the §3.3.2 admin-forwarding upgrade gate from the admin dashboard design (PR #644), which uses the same "all-nodes-must-report" pattern for `AdminForward`. |
There was a problem hiding this comment.
Enforce HT-FIFO capability beyond queue creation
This guard only blocks CreateQueue when not all nodes advertise htfifo, but the same section notes older nodes will not see partitioned-keyspace data. That leaves a failure mode where a downgraded/replaced old node joins after creation, becomes leader for a partitioned queue, and serves false-empty reads/writes against the wrong prefix. The proposal should require a runtime safeguard (startup or leadership admission) that prevents non-HT-FIFO binaries from serving once any PartitionCount > 1 queue exists.
Useful? React with 👍 / 👎.
…conds, §3.4) Three findings from the fifth-round Claude review on PR #664: 3.C per-queue throttling — DeleteQueue bucket lifecycle (medium): §3.1 now requires DeleteQueue to call buckets.Delete(key) for every bucketKey of the deleted queue (Send, Recv, Default), so a same-name recreation does not inherit stale token balance/limits for up to 1h. Alternative (generation in bucketKey) considered and rejected as more expensive than an explicit Delete. 3.D split-queue FIFO — WaitTimeSeconds budget across fanout (medium): §4.2 now threads a shared deadline through the partition fanout (step 3) and passes remainingWait per sub-call (step 4a/c/d). Without this, an empty queue with WaitTimeSeconds=20 + PartitionCount=8 would hold the connection for up to 160s. §8.1 also drops the contradictory "long-poll mitigates the cost" sentence — long-poll was actually the trigger for the bug. 3.D split-queue FIFO — §3.4 --raftSqsMap reference (low): rephrased to distinguish the router infrastructure (kv/shard_router.go, which already supports multi-group routing by key) from the configuration surface (the new --sqsFifoPartitionMap flag in §5). Eliminates the contradiction with §5.
|
Pushed addressing all three findings from your fifth-round review (2 medium, 1 low): 3.C per-queue throttling — §3.1
3.D split-queue FIFO — §4.2
3.D split-queue FIFO — §3.4
@claude review |
|
Claude finished @bootjp's task in 3m 20s —— View job Review
All three fixes from the trigger are correctly applied. Two Codex P1 items raised in the sixth-round review remain open and are the only findings I would consider blocking before implementation PRs. One new low-severity finding. All three trigger fixes: resolved ✓3.C — §3.1 3.D — §4.2 3.D — §3.4 Still-open Codex P1 findings from
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 624d411a93
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| No global lock is held during step 3; concurrent traffic on different queues runs in parallel. | ||
|
|
||
| **Cache invalidation on `SetQueueAttributes`**: when an operator updates the throttle config via `SetQueueAttributes`, the handler — *after* the Raft commit that persists the new `sqsQueueThrottle` — calls `buckets.Delete(key)` for every `bucketKey` belonging to the updated queue (`Send`, `Recv`, `Default`). Without this step the in-memory bucket would keep enforcing the old limits until the idle-eviction sweep removes the stale entry (default 1 h window), defeating the operator's intent to throttle a noisy tenant in real time. The `LoadOrStore` race with the `Delete` call is benign: the next request rebuilds from the freshly-committed meta, and the rebuilt bucket starts at full capacity (same semantics as the failover case documented below). Claude P1 on PR #664 caught the gap. |
There was a problem hiding this comment.
Use canonical action keys when deleting throttle buckets
The design defines bucketKey.action as "Send" | "Receive" | "*", but this invalidation step deletes ("Send", "Recv", "Default"). If implemented as written, Receive and default buckets will not be removed on SetQueueAttributes, so updated limits can remain stale until idle eviction (and the same mismatch appears in the DeleteQueue path), which directly defeats the real-time reconfiguration behavior this section is trying to guarantee.
Useful? React with 👍 / 👎.
| } | ||
| ``` | ||
|
|
||
| `PartitionCount` is **immutable after first SendMessage**. The validator on `SetQueueAttributes` rejects any change; operators who want a different partition count create a new queue. Why immutable: changing it would require re-hashing every existing message into a new partition, which (a) breaks ordering for in-flight messages of every group whose hash bucket changed, and (b) is a multi-second / multi-minute operation that cannot be expressed as one OCC transaction. |
There was a problem hiding this comment.
Keep FIFO routing attributes immutable after first send
This section only freezes PartitionCount, but routing also depends on FifoThroughputLimit (perQueue forces partition 0 in partitionFor). Allowing SetQueueAttributes to flip throughput mode after messages exist can move the same MessageGroupId to a different partition mid-stream, splitting one group across partitions and violating FIFO within-group ordering; the proposal should lock routing-affecting attributes after first send (or require queue recreation).
Useful? React with 👍 / 👎.
…rade, key) Three findings from the sixth-round review on PR #664: 3.C per-queue throttling §3.2 — capacity bound on max batch charge (Codex P1, medium): SetQueueAttributes validator now requires capacity >= max single-request charge. Send/Recv buckets must be >= 10 when non-zero (full SendMessageBatch / DeleteMessageBatch is 10 entries). Without this rule, SendCapacity=5 makes every full batch permanently unserviceable — the bucket can never accumulate enough tokens. 3.D split-queue FIFO §8 item 5 — runtime safeguard for downgraded leaders (Codex P1, medium): create-time gate only blocks new partitioned queues. Once one exists, a downgraded node lacking htfifo capability could be elected leader and produce false-empty reads + key-prefix corruption. Adds a node-admission rule: on startup + leadership acquisition, enumerate local-shard queues with PartitionCount > 1; if any and the binary lacks htfifo, refuse leadership and step down via TransferLeadership. 3.C per-queue throttling §3.1 — DeleteQueue paragraph bucketKey text (Claude low): "(queueName, partitionID, action)" contradicted the struct definition above it (no partitionID until Phase 3.D §4.3). Corrected to "(queueName, action) for Phase 3.C".
|
Pushed addressing all three findings from your sixth-round review (2 medium Codex P1 + 1 low Claude): 3.C per-queue throttling — §3.2 capacity ≥ max batch charge (Codex P1, medium):
3.D split-queue FIFO — §8 item 5 runtime safeguard for downgraded leaders (Codex P1, medium):
3.C per-queue throttling — §3.1
@claude review |
|
Claude finished @bootjp's task in 3m 10s —— View job Review
All three fixes from this push are correctly applied. Two Codex P1 findings from the previous round's Codex review (on commit All three trigger fixes: resolved ✓3.C — §3.2 capacity ≥ max batch charge (Codex P1, medium) ✓ 3.D — §8 item 5 runtime safeguard for downgraded leaders (Codex P1, medium) ✓ 3.C — §3.1 Still-open Codex P1 findings from the commit
|
…immutability, test capacity) Three findings from the seventh-round review on PR #664: 3.C per-queue throttling §3.1 -- action key naming consistency (Codex P1, medium): bucketKey.action canonical values are "Send"|"Receive"|"*" but cache invalidation paragraphs were saying "Send","Recv","Default" (the JSON config field-name prefixes). An implementation following the struct would never match the wrong key names, leaving Receive and "*" buckets stale after config changes. Changed both invalidation paragraphs to use the struct vocabulary, and added a "Config-field -> bucket-action mapping" paragraph after the sqsQueueThrottle struct so the short-form / canonical-form distinction is explicit. 3.D split-queue FIFO §3.2 -- FifoThroughputLimit and DeduplicationScope must also be immutable (Codex P1, medium): prior text only marked PartitionCount as immutable. But changing FifoThroughputLimit (perMessageGroupId<->perQueue) flips the §3.3 short-circuit on partition routing -- in-flight messages stay where they are, new sends route differently, within-group FIFO ordering silently breaks. Changing DeduplicationScope reshapes the dedup key so the prior scope's de-duped messages can resurrect or new sends can be incorrectly suppressed. Extended the immutability sentence to cover all three with one bullet per attribute explaining the failure mode. 3.C per-queue throttling §6 item 2 -- test capacity inconsistent with new validator (Claude low): integration test used SendCapacity=5 but §3.2's new fourth bullet requires SendCapacity >= 10. Test would have failed at setup, not on the 6th send. Bumped to SendCapacity=10 / 10 messages back-to-back / 11th throttled, with an inline note explaining why the value changed.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Pushed addressing all three findings from your seventh-round review (2 medium Codex P1 + 1 low Claude): 3.C per-queue throttling — §3.1 action key naming consistency (Codex P1, medium):
3.D split-queue FIFO — §3.2 immutability extended to FifoThroughputLimit + DeduplicationScope (Codex P1, medium):
3.C per-queue throttling — §6 item 2 test capacity (Claude low):
@claude review |
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
Pull request overview
This PR adds SQS Phase 3 design proposals and (despite the PR’s “docs-only” framing) lands significant Go implementation + test changes to introduce per-queue throttling and HT-FIFO (split-queue FIFO) schema/validation scaffolding in the SQS adapter.
Changes:
- Add design proposal docs for per-queue throttling (Phase 3.C) and HT-FIFO partitioning (Phase 3.D).
- Implement per-queue token-bucket throttling (bucket store, config parsing/validation, request-path charging, cache invalidation) plus integration/unit tests.
- Introduce HT-FIFO queue meta fields and validation (PartitionCount / FifoThroughputLimit / DeduplicationScope), including a temporary “dormancy gate”, with unit/integration tests.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md | Proposal doc for per-queue throttling behavior, config surface, and rollout/testing strategy. |
| docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md | Proposal doc for HT-FIFO partitioning model, routing, keyspaces, rollout plan, and failure modes. |
| adapter/sqs_throttle.go | New throttling implementation (bucket store, charging, invalidation, sweep loop, helpers). |
| adapter/sqs_throttle_test.go | Unit tests for bucket math, eviction/reconciliation races, config validation, and helpers. |
| adapter/sqs_throttle_integration_test.go | End-to-end tests for throttling behavior and SetQueueAttributes invalidation semantics. |
| adapter/sqs_partitioning.go | New HT-FIFO vocabulary, routing primitive (partitionFor), validators, and attribute rendering helpers. |
| adapter/sqs_partitioning_test.go | Unit tests for partition routing, validation, immutability, and idempotency helpers. |
| adapter/sqs_partitioning_integration_test.go | Wire-level integration tests for HT-FIFO validation/immutability and the temporary dormancy gate. |
| adapter/sqs_messages.go | Threads throttling checks into Send/Receive/Delete/ChangeVisibility paths; refactors SendMessage decode/validation. |
| adapter/sqs_messages_batch.go | Adds throttling checks for batch verbs (SendMessageBatch/DeleteMessageBatch/ChangeMessageVisibilityBatch). |
| adapter/sqs_catalog.go | Extends queue meta (Throttle + HT-FIFO fields), parsing/validation, attribute rendering, idempotency checks, and invalidation plumbing. |
| adapter/sqs_catalog_test.go | Extends CreateQueue idempotency tests to include Throttle* differences. |
| adapter/sqs.go | Adds server-owned bucket store lifecycle and JSON throttling error helper. |
| // queueThrottleConfig loads the Throttle config and Generation off a | ||
| // queue's meta record. Generation participates in the bucket key, so | ||
| // it must travel with the throttle snapshot to avoid a stale-meta | ||
| // read that mints a fresh bucket under the wrong incarnation. | ||
| // | ||
| // Returns: | ||
| // - (cfg, gen, nil) on a successful read of an existing queue. |
There was a problem hiding this comment.
queueThrottleConfig’s comment refers to “Generation participates in the bucket key” and says it loads Generation, but it returns meta.Incarnation. Please update the comment to match the actual return values and the keying scheme.
| // queueThrottleConfig loads the Throttle config and Generation off a | |
| // queue's meta record. Generation participates in the bucket key, so | |
| // it must travel with the throttle snapshot to avoid a stale-meta | |
| // read that mints a fresh bucket under the wrong incarnation. | |
| // | |
| // Returns: | |
| // - (cfg, gen, nil) on a successful read of an existing queue. | |
| // queueThrottleConfig loads the Throttle config and Incarnation off a | |
| // queue's meta record. Incarnation participates in the bucket key, so | |
| // it must travel with the throttle snapshot to avoid a stale-meta | |
| // read that mints a fresh bucket under the wrong incarnation. | |
| // | |
| // Returns: | |
| // - (cfg, incarnation, nil) on a successful read of an existing queue. |
| # Per-Queue Throttling and Tenant Fairness for the SQS Adapter | ||
|
|
||
| **Status:** Proposed | ||
| **Author:** bootjp | ||
| **Date:** 2026-04-26 |
There was a problem hiding this comment.
The PR metadata/description says this is a docs-only PR with CI “irrelevant”, but this diff also adds/changes multiple Go implementation + test files (e.g., new throttling + HT-FIFO schema/validation code). Please update the PR title/description and test plan to reflect the code changes (or split docs vs implementation into separate PRs) so reviewers and release notes aren’t misled.
| // | ||
| // If meta.Throttle is empty (the IsEmpty short-circuit) the function | ||
| // also drops the empty struct so a round-trip GetQueueAttributes | ||
| // reports the queue as untrothttled rather than zero-valued. Mirrors |
There was a problem hiding this comment.
Typo in comment: “untrothttled” should be “unthrottled”.
| // reports the queue as untrothttled rather than zero-valued. Mirrors | |
| // reports the queue as unthrottled rather than zero-valued. Mirrors |
| // incarnation is sqsQueueMeta.Generation so a same-name recreate or | ||
| // cross-leader failover never reuses an older incarnation's tokens. |
There was a problem hiding this comment.
The doc comment for bucketStore.charge says “incarnation is sqsQueueMeta.Generation”, but the code and the rest of the design use Incarnation (CreateQueue-only) specifically to avoid PurgeQueue resetting buckets. Please update this comment to avoid future callers passing Generation by mistake.
| // incarnation is sqsQueueMeta.Generation so a same-name recreate or | |
| // cross-leader failover never reuses an older incarnation's tokens. | |
| // incarnation is the queue's Incarnation value (the CreateQueue-only | |
| // lifecycle identity), not sqsQueueMeta.Generation, so PurgeQueue does | |
| // not reset buckets while same-name recreate or cross-leader failover | |
| // still never reuses an older incarnation's tokens. |
| // Generation participates in the key (Codex P1 on PR #664): we do | ||
| // not know which incarnations have buckets cached, so range the map | ||
| // and remove any entry whose queue matches. A SetQueueAttributes | ||
| // invalidation on the same incarnation must drop the same-gen | ||
| // bucket so the new throttle config takes effect; a DeleteQueue / | ||
| // CreateQueue cycle would also drop any pre-existing incarnation's | ||
| // bucket here, although those entries also fall out via idle | ||
| // eviction since the new incarnation lands under a different key | ||
| // and the old key never sees traffic again. |
There was a problem hiding this comment.
invalidateQueue’s comment says “Generation participates in the key”, but bucketKey is (queue, action, incarnation). Please adjust the comment to refer to Incarnation to match the actual keying behavior.
|
@claude review |
|
Claude finished @bootjp's task in 6m 2s —— View job Review
Pushed Fixes pushed (5 files,
|
…s (PR #664) - chargeQueue: use writeSQSErrorFromErr instead of leaking err.Error() in body - bucketStore.charge comment: "sqsQueueMeta.Generation" → "sqsQueueMeta.Incarnation" - invalidateQueue comment: "Generation participates" → "Incarnation participates" - queueThrottleConfig comment: "Throttle config and Generation" → "Throttle config and Incarnation" - sqs_throttle_test.go: remove stale TestBucketStore_InvalidateQueueDropsAllActions header merged into the IncarnationKeyed test comment block - sqs_catalog.go: typo "untrothttled" → "unthrottled" - docs(3.C §4.3): document per-partition budget multiplication semantics explicitly (each partition gets full SendCapacity; aggregate = N × limit; rationale + escape hatch) - docs(3.D §7.2): require explicit PartitionCount for multi-partition FIFO; drop server-side default inference (idempotency hazard per CodeRabbit major) Co-authored-by: Yoshiaki Ueda (bootjp) <bootjp@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@adapter/sqs_catalog.go`:
- Around line 487-494: The PartitionCount attribute parser (the
"PartitionCount": func in the sqs catalog that sets sqsQueueMeta.PartitionCount)
should remove the //nolint:gosec suppression and instead parse with
strconv.ParseUint using a 64-bit size, then explicitly check that the parsed
value <= math.MaxUint32; if out of range return the same newSQSAPIError with
invalid attribute message, otherwise safely cast to uint32 and assign to
m.PartitionCount. Ensure you keep existing ParseUint error handling and return
the same HTTP bad request error for non-numeric input.
In `@docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md`:
- Around line 115-117: The doc currently suggests using a raw
buckets.Delete(key) for throttle invalidation which reintroduces the
orphan-bucket race; change the text to require the same
lock-then-delete-then-flag semantics as bucketStore.invalidateQueue (and
reference bucketKey/incarnation) instead of per-key Delete, i.e., require
implementers to call the invalidateQueue-style path that takes the lock, marks
the bucket invalid, and then removes entries so a pre-existing in-memory bucket
cannot be spent after invalidation; mention the exact symbols
buckets.Delete(key), bucketStore.invalidateQueue, and bucketKey so readers
locate the correct code paths.
In `@docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md`:
- Around line 227-234: The doc must state that X-Elastickv-Receive-Partition is
an internal-only hop hint: ensure the ingress path strips or ignores any
client-supplied X-Elastickv-Receive-Partition header and only the trusted proxy
path (proxyToLeader, extended to accept a partition argument) may set this
header when forwarding ReceiveMessage calls; update the leader-proxy machinery
description to show proxyToLeader adds X-Elastickv-Receive-Partition and sets
WaitTimeSeconds from remainingWait, and clarify that downstream handlers must
treat any externally provided header as untrusted (ignore/reject) so callers
cannot force single-partition reads or observe partial/false-empty
ReceiveMessage results.
- Around line 370-373: Remove the "Default `PartitionCount` for AWS-shape
callers" and the "`PartitionCount` per-shard vs per-queue" open-question
bullets: delete the text referencing a server-side default for PartitionCount
and the suggestion of choosing 8/16/32 so the doc no longer proposes a default
when FifoThroughputLimit = perMessageGroupId; instead leave the explicit
requirement from §7.2 that PartitionCount must be provided and omissions are
rejected, and retain the `--sqsDefaultFifoPartitionCount` flag discussion only
if it is framed as an operator-side convenience (not a server default).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1ff5c15d-2d41-428e-819f-15bfcb0d3ac7
📒 Files selected for processing (5)
adapter/sqs_catalog.goadapter/sqs_throttle.goadapter/sqs_throttle_test.godocs/design/2026_04_26_proposed_sqs_per_queue_throttling.mddocs/design/2026_04_26_proposed_sqs_split_queue_fifo.md
🚧 Files skipped from review as they are similar to previous changes (1)
- adapter/sqs_throttle.go
…tionCount validator (PR #664 round 12 follow-up) Two findings from the round-11 CodeRabbit review that were not in the user's 642ec0d commit: (1) CodeRabbit Major (sqs_throttle.go:221) — chargeBucket retry loop exhaustion was fail-open. The 4-iteration retry loop's terminal fallback returned chargeOutcome{allowed: true}, turning the invalidate/reconcile race into a throttle bypass on the exact path meant to enforce limits. Fixed: terminal fallback now returns chargeOutcome{allowed: false, retryAfter: 1s, bucketPresent: false} so the client gets a normal back-off cue. (2) CodeRabbit Major (split-queue-fifo §7.2) — the docs already require an explicit PartitionCount for HT-FIFO (landed in 642ec0d), but the validator did not enforce that on the wire. Added validatePartitionConfig rule that rejects FifoThroughputLimit=perMessageGroupId on PartitionCount<=1 with InvalidAttributeValue. Refactored validatePartitionConfig into validatePartitionShape + validateStandardQueueRejectsHTFIFO helpers to keep cyclop ≤ 10 once the new branch landed. New unit test TestValidatePartitionConfig_PerMessageGroupIDRequiresExplicitPartitionCount covers the (PartitionCount=0, PartitionCount=1, PartitionCount=8) × FifoThroughputLimit accept/reject matrix. Existing integration tests (Immutability / AllOrNothing / GetQueueAttributesRoundTrip) updated to use FifoThroughputLimit=perQueue rather than perMessageGroupId — perMessageGroupId is now reachable only when paired with PartitionCount > 1, which the §11 PR 2 dormancy gate currently rejects on the wire. perQueue is a legitimate single-partition FIFO setting and exercises the same immutability/roundtrip paths. go test -race ./adapter/... pass; golangci-lint ./... clean.
|
Round 12 follow-up ( (1) CodeRabbit Major ( ご指摘どおり、 修正: terminal fallback を (2) CodeRabbit Major (split-queue-fifo §7.2 PartitionCount implicit default):
新規 unit test: 既存 integration test (Immutability / AllOrNothing / GetQueueAttributesRoundTrip) を
@claude review |
|
Claude finished @bootjp's task in 8m 2s —— View job Review
Both trigger fixes from round 12 are correctly applied. Three still-open CodeRabbit findings from the latest round (posted concurrently with Both trigger fixes: resolved ✓CodeRabbit Major — CodeRabbit Major —
The refactor into Integration tests updated to New unit test ✓ Still-open CodeRabbit findings from the latest round (posted concurrently with
|
…review) Four findings from the round-12 Claude review on commit 55297f6: (1) Medium (sqs_catalog.go:493) — //nolint:gosec on the PartitionCount uint64→uint32 narrowing. CLAUDE.md requires refactor over suppression. Reparsed at uint64 width with an explicit math.MaxUint32 bounds check before the narrowing; gosec is happy without the comment. (2) Medium (per-queue-throttling §3.1 line 115) — doc said SetQueueAttributes 'calls buckets.Delete(key) for every bucketKey' while the implementation calls invalidateQueue (the lock-then-CompareAndDelete-then-flag protocol that closes the orphan-bucket race). Updated the paragraph to describe invalidateQueue semantics and explain why per-key Delete would reintroduce the race that round 5 / round 6.1 closed. Also called out the no-op-write gate added in round 12. (3) Medium (split-queue-fifo §4.2) — X-Elastickv-Receive-Partition was described as a header proxyToLeader sets but not as internal-only. Added an explicit security note: ingress strips/ignores any client-supplied X-Elastickv-Receive-Partition before fan-out runs; only the trusted proxyToLeader path may set it. Without that strip, a malicious client could force single-partition reads, observe false-empty results for groups that hash to other partitions, or repeatedly drain a targeted partition. Phase 3.D PR 5 lands the strip alongside the fan-out. (4) Low (split-queue-fifo §10) — Open Questions items 3 ('default PartitionCount?') and 4 ('--sqsDefaultFifoPartitionCount flag?') directly contradict §7.2's 'PartitionCount required, no server-side default' resolution. Both items removed; the slot replaced with an explicit 'these were resolved' parenthetical pointing at §7.2 and the validator. The cross-partition-ordering item (was #5) is renumbered to #3. go test -race ./adapter/... pass; golangci-lint ./... clean.
|
Round 13 ( (1) Medium ( ご指摘どおり、CLAUDE.md は (2) Medium ( ご指摘どおり、§3.1 が SetQueueAttributes の cache invalidation を (3) Medium ( ご指摘どおり、 (4) Low ( ご指摘どおり、§10 item 3 ("default PartitionCount?") と item 4 ("--sqsDefaultFifoPartitionCount flag?") が §7.2 の "PartitionCount required, no server-side default" 決定と直接矛盾していました。両 item を削除、slot に "これらは §7.2 で resolved" 旨の parenthetical を追加して理由 (CreateQueue idempotency が hidden deployment state に依存する hazard) を pointer 化。元 item 5 (cross-partition ordering) を item 3 に renumber。
@claude review |
|
Claude finished @bootjp's task in 3m 0s —— View job Review
All four fixes from this push are correctly applied. Two new low-severity findings. All four trigger fixes: resolved ✓
New findings[Low]
|
…t (PR #664 round 13.1) Two low-severity findings from the round-13 Claude review: (1) §4.2 said the integration-test plan in §9 'covers a client-set internal header is ignored case', but §9's test list did not actually include one. Dangling pointer that would have given false confidence to the implementation PR author. Added test item 6 (TestSQSServer_HTFIFO_Internal*HeaderStrip): create a 4-partition queue with one message hashed to each partition, send a ReceiveMessage with X-Elastickv-Receive-Partition: 0 from a direct client, assert all 4 messages come back (full fanout ran). Renumbered Jepsen item 6 → 7 and metrics item 7 → 8. (2) sqs_catalog.go PartitionCount parser comment referenced 'Claude medium on PR #664 round 12'. CLAUDE.md is explicit that comments should not reference task / PR numbers — they rot as the codebase evolves. Trimmed to the substantive WHY (uint64 parse + bounds check keeps the narrowing gosec-clean) and dropped the parenthetical. go test -race ./adapter/... pass; golangci-lint ./... clean.
|
Round 13.1 ( (1) Low ( ご指摘どおり、§4.2 の security note に "the integration-test plan in §9 covers a 'client-set internal header is ignored' case" と書いていましたが、§9 の test list に該当 item が無く dangling pointer 状態でした。実装 PR の author が §9 を読んで guidance とした際に false confidence になる経路。 修正: §9 に新規 item 6 ( (2) Low ( ご指摘どおり、CLAUDE.md は code comment で task / PR 番号を参照することを禁じています ("they rot as the codebase evolves")。先 round の comment が "(CLAUDE.md asks us to refactor rather than suppress; Claude medium on PR #664 round 12 flagged the suppression)" を含んでいたので削除。残ったのは substantive な WHY のみ ("Parse at uint64 width and bound-check explicitly so the uint32 narrowing below is gosec-clean.")。
@claude review |
|
Claude finished @bootjp's task in 2m 24s —— View job Review
Both Round 13.1 fixes are correctly applied. One new low-severity finding. Both trigger fixes: resolved ✓3.D — §9 item 6
The test spec is concrete enough for an implementation PR author to write — it names the regression scenario (a naive implementation would return at most 1 message), describes both the security path and the non-regression companion. The §4.2 cross-reference at line 240 now has a real test to point to.
// Parse at uint64 width and bound-check explicitly so the
// uint32 narrowing below is gosec-clean.No PR number, no task reference. The substantive WHY (why the 64-bit parse + bounds check) is preserved; the parenthetical that attributed it to a specific review round is gone. Clean per CLAUDE.md. New finding[Low]
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d6e2143ff
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const capSecs = float64(throttleRetryAfterCap / time.Second) | ||
| if secs > capSecs { | ||
| secs = capSecs | ||
| } |
There was a problem hiding this comment.
Return uncapped Retry-After for slow refill buckets
computeRetryAfter hard-caps waits at 1 hour, which underestimates recovery time for valid low-refill configs and larger token costs. For example, with capacity=10, refill=0.001, and a 10-entry batch charge, the true wait is 10,000s but the response says 3,600s, so compliant clients will retry too early and get throttled again repeatedly. This also defeats the comment’s eviction rationale because each retry updates lastRefill, so the bucket does not idle-evict. Please compute Retry-After from the actual refill math (or constrain config so the cap is never below the true wait).
Useful? React with 👍 / 👎.
…md (PR #664 round 14) Claude low on round 13.1: ~25+ comments across the SQS Go files reference review attributions ('Codex P1 on PR #679', 'CodeRabbit Major on PR #664 round 11', etc.). CLAUDE.md is explicit: 'Don't reference the current task, fix, or callers ... since those belong in the PR description and rot as the codebase evolves.' Sweep across: - adapter/sqs_throttle.go - adapter/sqs_catalog.go - adapter/sqs_partitioning.go - adapter/sqs_messages.go - adapter/sqs_throttle_test.go - adapter/sqs_partitioning_test.go - adapter/sqs_throttle_integration_test.go - adapter/sqs_partitioning_integration_test.go - adapter/sqs_catalog_test.go - adapter/sqs_query_protocol.go - adapter/sqs_query_protocol_test.go Each comment now retains the substantive WHY (why the design choice was made) and drops the parenthetical attribution to a specific reviewer / PR number / round. Design docs (.md files) are exempt — those are archival records where review attribution is acceptable practice. go test -race ./adapter/... pass; golangci-lint ./... clean. No behaviour change.
|
Round 14 ( ご指摘どおり、CLAUDE.md は code comment で task / PR 番号 / review attribution を参照することを明示的に禁じており ("they rot as the codebase evolves")、SQS Go files に ~25+ 個の "Codex P1 on PR #679", "CodeRabbit Major on PR #664 round 11" 等の comment が点在していました。 修正: 11 ファイル横断で sweep:
各 comment は substantive な WHY (なぜその設計判断をしたか) を保持し、reviewer/PR number/round への parenthetical attribution のみ削除。Design doc (
@claude review |
|
Claude finished @bootjp's task in 4m 19s —— View job Review
All eleven files swept in Round 14 are correctly updated. No PR numbers or round references remain in any Go file. One small residue finding noted below. Round 14 trigger fix: resolved ✓PR-number and round attribution sweep (low) ✓
The substantive WHY is preserved in every case. The technical reasoning travels in the surrounding sentences, not in the attribution parenthetical. Representative example from -// state via a no-op SetQueueAttributes (Codex P1 on PR #679).
+// state via a no-op SetQueueAttributes.The preceding sentence explains the exact failure mode; the attribution was redundant. Design docs correctly exempted ✓ New finding[Low]
|
| File | Line | Text |
|---|---|---|
sqs_query_protocol.go |
232 | SQS query reference is explicit on this and CodexP1 / Gemini |
sqs_query_protocol.go |
284 | make tests stable. (CodexP2 + Gemini high.) |
sqs_query_protocol_test.go |
90 | SQS query reference. CodexP1 + Gemini both flagged the prior |
sqs_query_protocol_test.go |
117 | (lower index wins). CodexP2 flagged the previous map iteration as |
These don't have PR numbers, but they still attribute to external reviewers — which "rot as the codebase evolves" for the same reasons PR numbers do. They satisfy the spirit of the CLAUDE.md prohibition even without the PR number. The technical reasoning in each case stands without the attribution. Fix this →
[Cosmetic] adapter/sqs_throttle.go:224-225 — sentence split at "a" is awkward post-sweep
Before Round 14:
// path that is supposed to enforce limits (CodeRabbit Major on
// PR #664 round 11). Returning a Throttling response with a
// non-zero retryAfter gives the client a normal back-off cue
After Round 14:
// path that is supposed to enforce limits. Returning a Throttling
// response with a
// non-zero retryAfter gives the client a normal back-off cue
The sentence is grammatically correct but the line break at "response with a" is awkward — "a" at end-of-line dangling before "non-zero" on the next. Functionally harmless, worth reflowing in the same follow-up pass. Fix this →
Summary
Round 14 is clean — all PR-number and PR/round attribution references are removed from the Go implementation files. The four remaining reviewer-tool-name references in sqs_query_protocol.go and sqs_query_protocol_test.go are the only missed items; none block merging. Recommend a minimal follow-up pass to clear those four lines (a one-commit change), then this PR is ready.
| Branch
Summary
Docs-only PR. Two design proposals for the remaining Phase 3 SQS items per
docs/design/2026_04_24_proposed_sqs_compatible_adapter.mdSection 14 (Phase 3 bullets). Both items were explicitly called out as needing separate design docs before any implementation work; this PR lands those proposals so the implementation PRs have a reviewed architecture to build on.3.C — Per-queue throttling and tenant fairness (proposal)
Per-queue token-bucket throttling configured on queue meta (no separate keyspace), evaluated at the SQS adapter layer on the leader (no Raft per request), surfaced as the AWS
Throttlingerror envelope so SDK retry/backoff just works.Key decisions:
SetQueueAttributes.SendMessage.InvalidAttributeNameon theThrottle*attributes (matches AWS behaviour for unknown attributes); the data-plane enforcement runs for everyone.3.D — Split-queue FIFO (high-throughput FIFO) (proposal)
Per-
MessageGroupIdhash partitioning across multiple Raft groups, mirroring AWS High Throughput FIFO. Within-group ordering preserved; across-group throughput scales with the partition count.Key decisions:
PartitionCount = 0path is the legacy layout; no migration runs implicitly).hash & (N-1)and future offline rebuilds stay tractable.MessageGroupIdpinning all traffic to one partition is documented and accepted (the feature is for cooperative operators).Test plan
_proposed_filename for a 1-character mismatch that is fine to leave for now).Self-review
This is a docs-only PR; the 5-lens self-review collapses to:
Stacking
This PR is independent of #650, #659, and #662. Branched from current
main. Merge whenever ready — landing the proposal docs early lets reviewers comment on the architecture before the implementation PRs go up.Summary by CodeRabbit