Commit 166e079
committed
sqs(catalog,capability): idempotent gate placement + sanitized public message (PR #734, round 1)
Two PR #734 review findings addressed in one commit because they
touch the same code path:
1. Codex P1 — Check existing queue before enforcing HTFIFO peer gate
The previous placement of validateHTFIFOCapability inside
createQueueCore (before createQueueWithRetry) ran the gate on
EVERY CreateQueue call with PartitionCount > 1, including
idempotent retries on an already-existing queue with identical
attributes. A transient peer poll failure (timeout / unreachable
/ malformed health) during such a retry would then return
InvalidAttributeValue instead of the AWS-correct 200 OK, breaking
create-or-get clients during partial outages or rolling upgrades.
Fix: move the gate INTO tryCreateQueueOnce after the existence
check and BEFORE the OCC dispatch. The order in tryCreateQueueOnce
is now:
1. loadQueueMetaAt — check if queue exists at the snapshot
2. exists + matching attrs → return (true, nil) idempotent OK
3. exists + different attrs → return QueueNameExists
4. validateHTFIFOCapability — runs ONLY on the genuine create
path
5. loadQueueGenerationAt + dispatch the create
The gate may run more than once across OCC retries (each retry
that gets to "queue still missing" re-polls), but every retry
that hits an existing-queue match short-circuits before the gate
runs — so idempotent CreateQueue under a partial cluster outage
stays AWS-correct.
Caller audit: validateHTFIFOCapability has exactly one production
caller (now tryCreateQueueOnce); both the JSON handler
(createQueue → createQueueCore → createQueueWithRetry →
tryCreateQueueOnce) and the query-protocol handler
(sqs_query_protocol.go: 182 → createQueueCore → …) reach it
through that one path. Move is symmetric — no caller observes a
semantic change for a queue that genuinely needs to be created;
only the existing-queue path stops paying the gate cost.
2. CodeRabbit major — Don't send raw peer poll details back to caller
buildHTFIFOCapabilityRejection's output (peer addresses + raw
poller error text) was returned to the wire layer as the
InvalidAttributeValue message, leaking cluster topology to any
authenticated CreateQueue caller. This conflicts with the
stricter error-redaction policy already used elsewhere in
sqs_catalog.go.
Fix: the wire-level rejection is now the sanitized constant
htfifoCapabilityRejectionPublic ("PartitionCount > 1 requires
every cluster peer to advertise the htfifo capability via
/sqs_health; one or more peers did not — see server logs for
details"). The full per-peer detail goes to slog.Warn with
structured fields (queueName, partitionCount, peerCount, detail)
so an operator triaging the rolling upgrade can read the failing
peer addresses + reasons from the server logs without rerunning
the poll out-of-band.
Renamed buildHTFIFOCapabilityRejection →
formatHTFIFOCapabilityReportForLog to make its server-side-only
contract obvious at the call site.
Test changes:
- New TestSQSServer_HTFIFO_CapabilityGate_IsIdempotentOnExistingQueue
(wire-level): creates a partitioned queue on a single-node
cluster (gate vacuously passes), poisons leaderSQS with an
unreachable address, then re-creates the same queue with
identical attrs and expects 200; finally creates a NEW queue
with the poisoned peer map and expects the 400 (proves the
gate is still in effect for genuine creates).
- New TestValidateHTFIFOCapability_PublicMessageDoesNotLeakPeerDetails:
pins the sanitization contract — the wire-level message must
equal htfifoCapabilityRejectionPublic exactly, never contain
a peer host:port.
- Updated TestValidateHTFIFOCapability_RejectsWhenOnePeerLacksCapability
and TestValidateHTFIFOCapability_RejectsWhenPeerUnreachable
to assert the sanitized constant + NotContains on the peer
address.
- Renamed TestBuildHTFIFOCapabilityRejection_ShapesOperatorMessage
→ TestFormatHTFIFOCapabilityReportForLog_ShapesServerSideDetail
to match the renamed helper; assertion that the helper output
is server-side-only (no client wire surface assertion here).
Below threshold (intentionally not addressed in this round):
- Gemini medium on collectSQSPeers concurrency: leaderSQS is
only mutated at SQSServer construction (WithSQSLeaderMap), not
at request time. Gemini's own comment acknowledges this.
- Gemini medium on caching the capability status: CreateQueue is
a rare control-plane operation; caching adds a stale-window
failure mode (a cluster that already lost a peer would still
accept a partitioned queue while the cache is warm). Pure
performance suggestion, not correctness.1 parent df11d2b commit 166e079
4 files changed
Lines changed: 180 additions & 43 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
10 | 20 | | |
11 | 21 | | |
12 | 22 | | |
| |||
53 | 63 | | |
54 | 64 | | |
55 | 65 | | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
56 | 76 | | |
57 | | - | |
| 77 | + | |
58 | 78 | | |
59 | 79 | | |
60 | 80 | | |
| |||
86 | 106 | | |
87 | 107 | | |
88 | 108 | | |
89 | | - | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
96 | 119 | | |
97 | | - | |
98 | 120 | | |
99 | 121 | | |
100 | 122 | | |
| |||
120 | 142 | | |
121 | 143 | | |
122 | 144 | | |
123 | | - | |
| 145 | + | |
124 | 146 | | |
125 | 147 | | |
126 | 148 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
112 | 112 | | |
113 | 113 | | |
114 | 114 | | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
119 | 119 | | |
120 | 120 | | |
121 | 121 | | |
| |||
145 | 145 | | |
146 | 146 | | |
147 | 147 | | |
148 | | - | |
149 | | - | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
150 | 152 | | |
151 | 153 | | |
152 | 154 | | |
| |||
174 | 176 | | |
175 | 177 | | |
176 | 178 | | |
177 | | - | |
178 | | - | |
179 | | - | |
180 | | - | |
181 | | - | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
182 | 188 | | |
183 | 189 | | |
184 | 190 | | |
| |||
189 | 195 | | |
190 | 196 | | |
191 | 197 | | |
192 | | - | |
193 | | - | |
194 | | - | |
195 | | - | |
196 | | - | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
197 | 202 | | |
198 | 203 | | |
199 | | - | |
| 204 | + | |
200 | 205 | | |
201 | | - | |
202 | | - | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
203 | 235 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
481 | 481 | | |
482 | 482 | | |
483 | 483 | | |
484 | | - | |
485 | | - | |
486 | | - | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
487 | 488 | | |
488 | 489 | | |
489 | 490 | | |
| |||
903 | 904 | | |
904 | 905 | | |
905 | 906 | | |
906 | | - | |
907 | | - | |
908 | | - | |
909 | | - | |
910 | | - | |
911 | | - | |
912 | | - | |
913 | | - | |
914 | | - | |
915 | | - | |
916 | | - | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
917 | 913 | | |
918 | 914 | | |
919 | 915 | | |
| |||
963 | 959 | | |
964 | 960 | | |
965 | 961 | | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
| 977 | + | |
| 978 | + | |
966 | 979 | | |
967 | 980 | | |
968 | 981 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
| 8 | + | |
7 | 9 | | |
8 | 10 | | |
9 | 11 | | |
| |||
35 | 37 | | |
36 | 38 | | |
37 | 39 | | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
38 | 108 | | |
39 | 109 | | |
40 | 110 | | |
| |||
0 commit comments