From 1d7c859535b78067cf94dde65f40eadb2f534810 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 5 May 2026 17:42:40 +0900 Subject: [PATCH] docs(sqs): promote split-queue-fifo from proposed to partial (Phase 3.D PR 8) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit §11 of docs/design/.../sqs_split_queue_fifo.md (the rollout plan) has shipped through PR 7b: schema (#681), keyspace (#703), routing layer (#704, #708, #715, #721, #723), data-plane fanout (#724, #731, #732, #734), reaper (#735, #736), Jepsen workload + metrics (#737, #738). PR 8 — the doc-lifecycle bump itself — is this PR. Per docs/design/README.md the lifecycle marker is the filename token, promoted by `git mv`: - proposed: no matching code on main, OR doc declares itself a proposal. - partial: some components exist but the full scope is not yet merged. - implemented: concrete Go code matches the design's central subsystem. Choosing "partial" rather than "implemented": every milestone that produces shippable code in the rollout plan has landed, but §10 (open questions) and §12 (alternatives considered) explicitly note follow-on work that would extend the same surface — operator-configurable hash, online resharding, cross- partition transactional admin. Each is out of scope for this proposal but each would be an extension of the design, warranting "partial" until that follow-on tracks to a separate proposal or is explicitly closed. Changes: - git mv 2026_04_26_proposed_sqs_split_queue_fifo.md → 2026_04_26_partial_sqs_split_queue_fifo.md (preserves history via similarity-based rename detection). - Update the Status header from Proposed to Partial. - Annotate §11 rollout table with shipped status anchored to merge PR numbers, plus a "Status as of 2026-05-04" header paragraph that explains the partial-vs-implemented choice. - Update in-tree source-comment cross-references from the proposed_ filename to the partial_ filename across: main_sqs_leadership_refusal.go shard_config.go adapter/sqs_keys.go adapter/sqs_partitioning.go adapter/sqs_catalog.go These are doc-only comments; no behaviour change. Self-review (5 lenses, abbreviated since this is doc-only): 1. Data loss — N/A. 2. Concurrency — N/A. 3. Performance — N/A. 4. Data consistency — N/A. 5. Test coverage — no tests added; doc-and-comment change only. Refs: docs/design/2026_04_26_partial_sqs_split_queue_fifo.md §11. --- adapter/sqs_catalog.go | 4 ++-- adapter/sqs_keys.go | 2 +- adapter/sqs_partitioning.go | 2 +- ...026_04_26_partial_sqs_split_queue_fifo.md} | 24 ++++++++++--------- main_sqs_leadership_refusal.go | 2 +- shard_config.go | 2 +- 6 files changed, 19 insertions(+), 17 deletions(-) rename docs/design/{2026_04_26_proposed_sqs_split_queue_fifo.md => 2026_04_26_partial_sqs_split_queue_fifo.md} (97%) diff --git a/adapter/sqs_catalog.go b/adapter/sqs_catalog.go index dbca9927..14b87c12 100644 --- a/adapter/sqs_catalog.go +++ b/adapter/sqs_catalog.go @@ -118,7 +118,7 @@ type sqsQueueMeta struct { // along with the rest of the queue. Throttle *sqsQueueThrottle `json:"throttle,omitempty"` // PartitionCount is the number of FIFO partitions for this queue - // (Phase 3.D HT-FIFO, see docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md). + // (Phase 3.D HT-FIFO, see docs/design/2026_04_26_partial_sqs_split_queue_fifo.md). // Zero or 1 means the legacy single-partition layout — no schema // change. Greater than 1 enables HT-FIFO. Set at CreateQueue time // and immutable thereafter (SetQueueAttributes rejects any change). @@ -478,7 +478,7 @@ var sqsAttributeAppliers = map[string]attributeApplier{ return nil }, // PartitionCount enables HT-FIFO when > 1 (Phase 3.D, see - // docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md). Set + // docs/design/2026_04_26_partial_sqs_split_queue_fifo.md). Set // at CreateQueue time; SetQueueAttributes attempts to change it // reject via the immutability check in trySetQueueAttributesOnce. // PartitionCount > 1 is gated by validateHTFIFOCapability (the diff --git a/adapter/sqs_keys.go b/adapter/sqs_keys.go index 8614fe66..53f40919 100644 --- a/adapter/sqs_keys.go +++ b/adapter/sqs_keys.go @@ -50,7 +50,7 @@ const ( ) // HT-FIFO partitioned-keyspace discriminator. Per the §3.1 design in -// docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md, partitioned +// docs/design/2026_04_26_partial_sqs_split_queue_fifo.md, partitioned // FIFO queues live in a separate keyspace so the legacy single- // partition layout can stay byte-identical on disk: // diff --git a/adapter/sqs_partitioning.go b/adapter/sqs_partitioning.go index bce51e21..63c0fc76 100644 --- a/adapter/sqs_partitioning.go +++ b/adapter/sqs_partitioning.go @@ -7,7 +7,7 @@ import ( // HT-FIFO (Phase 3.D split-queue FIFO) configuration vocabulary and // the routing primitive partitionFor. See the design doc at -// docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md. +// docs/design/2026_04_26_partial_sqs_split_queue_fifo.md. // // PR 2 of the §11 rollout introduces the schema fields plus the // validation surface — including the temporary dormancy gate that diff --git a/docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md b/docs/design/2026_04_26_partial_sqs_split_queue_fifo.md similarity index 97% rename from docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md rename to docs/design/2026_04_26_partial_sqs_split_queue_fifo.md index e610e798..5eec04d1 100644 --- a/docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md +++ b/docs/design/2026_04_26_partial_sqs_split_queue_fifo.md @@ -1,6 +1,6 @@ # Split-Queue FIFO for the SQS Adapter -**Status:** Proposed +**Status:** Partial **Author:** bootjp **Date:** 2026-04-26 @@ -387,16 +387,18 @@ This is out of scope here. ## 11. Rollout Plan (Multi-PR) -| PR | Content | Reviewable in isolation? | -|---|---|---| -| 1 | This proposal doc lands. Operators have time to flag concerns. | Yes | -| 2 | Schema: `sqsQueueMeta.PartitionCount`, `DeduplicationScope`, `FifoThroughputLimit`. Routing function `partitionFor`. CreateQueue / SetQueueAttributes validation including the §3.2 cross-attribute rules. **Temporary feature gate** (see below): `CreateQueue` rejects `PartitionCount > 1` with `InvalidAttributeValue` ("PartitionCount > 1 requires HT-FIFO data plane — not yet enabled") so the schema field exists in the meta type but cannot land in production data. | Yes (catalog only) | -| 3 | Keyspace: thread `partitionIndex` through every `sqsMsg*Key` constructor, defaulting to 0 so existing queues stay byte-identical. Gate from PR 2 still in place — `PartitionCount > 1` remains rejected. | Yes (mechanical) | -| 4 | Routing layer: `kv/shard_router.go` accepts the `(queue, partition)` key. New `--sqsFifoPartitionMap` flag (separate from the existing `--raftSqsMap` endpoint-mapping flag). Mixed-version gate (§8.5 capability advertisement via `/sqs_health` + catalog polling for `CreateQueue` gating, **and** the §8 leadership-refusal hook in `kv/lease_state.go` that calls `TransferLeadership` when a non-`htfifo` binary discovers a partitioned queue in its shard on startup or leadership acquisition — both components are required before the binary is marked `htfifo`-eligible). PR 2's temporary `PartitionCount > 1` rejection still in place. | Yes (operator-config) | -| 5 | Send / Receive partition fanout. Receipt-handle v2 codec. **Removes the PR 2 `PartitionCount > 1` rejection** in the same commit that wires the data-plane fanout — the gate and its lift land atomically so a half-deployed cluster can never accept a partitioned queue without the data plane to serve it. | Yes (data-plane) | -| 6 | PurgeQueue / DeleteQueue partition iteration. Tombstone schema update. Reaper update. | Yes (control-plane) | -| 7 | Jepsen HT-FIFO workload. Metrics. | Yes (testing) | -| 8 | Partial-doc lifecycle bump: 3.D moves from TODO to Landed. Section 13 from §16.6 of the partial doc gets the as-built record. | Yes (docs) | +**Status as of 2026-05-04**: PRs 1–7 are merged on `main`. The doc is being moved from `proposed` to `partial` in PR 8 (this rename) because every milestone in the rollout plan that produces shippable code has landed. The "partial" classification rather than "implemented" leaves room for future work tracked in §10 / §12 (e.g. operator-configurable hash, online resharding, cross-partition transactional admin) — none of which are in this proposal's scope but each of which would be an extension to the same surface. + +| PR | Content | Reviewable in isolation? | Status | +|---|---|---|---| +| 1 | This proposal doc lands. Operators have time to flag concerns. | Yes | ✅ Merged (#664) | +| 2 | Schema: `sqsQueueMeta.PartitionCount`, `DeduplicationScope`, `FifoThroughputLimit`. Routing function `partitionFor`. CreateQueue / SetQueueAttributes validation including the §3.2 cross-attribute rules. **Temporary feature gate** (see below): `CreateQueue` rejects `PartitionCount > 1` with `InvalidAttributeValue` ("PartitionCount > 1 requires HT-FIFO data plane — not yet enabled") so the schema field exists in the meta type but cannot land in production data. | Yes (catalog only) | ✅ Merged (#681) | +| 3 | Keyspace: thread `partitionIndex` through every `sqsMsg*Key` constructor, defaulting to 0 so existing queues stay byte-identical. Gate from PR 2 still in place — `PartitionCount > 1` remains rejected. | Yes (mechanical) | ✅ Merged (#703) | +| 4 | Routing layer: `kv/shard_router.go` accepts the `(queue, partition)` key. New `--sqsFifoPartitionMap` flag (separate from the existing `--raftSqsMap` endpoint-mapping flag). Mixed-version gate (§8.5 capability advertisement via `/sqs_health` + catalog polling for `CreateQueue` gating, **and** the §8 leadership-refusal hook in `kv/lease_state.go` that calls `TransferLeadership` when a non-`htfifo` binary discovers a partitioned queue in its shard on startup or leadership acquisition — both components are required before the binary is marked `htfifo`-eligible). PR 2's temporary `PartitionCount > 1` rejection still in place. | Yes (operator-config) | ✅ Merged across 4-A / 4-B-1 / 4-B-2 / 4-B-3a / 4-B-3b (#704, #708, #715, #721, #723) | +| 5 | Send / Receive partition fanout. Receipt-handle v2 codec. **Removes the PR 2 `PartitionCount > 1` rejection** in the same commit that wires the data-plane fanout — the gate and its lift land atomically so a half-deployed cluster can never accept a partitioned queue without the data plane to serve it. | Yes (data-plane) | ✅ Merged across 5a / 5b-1 / 5b-2 / 5b-3 (#724, #731, #732, #734) | +| 6 | PurgeQueue / DeleteQueue partition iteration. Tombstone schema update. Reaper update. | Yes (control-plane) | ✅ Merged across 6a / 6b (#735, #736) | +| 7 | Jepsen HT-FIFO workload. Metrics. | Yes (testing) | ✅ Merged across 7a / 7b (#737, #738) | +| 8 | Partial-doc lifecycle bump: rename `proposed` → `partial`, annotate §11 with shipped PR anchors, update in-tree source references that point at the proposed-stage filename. | Yes (docs) | 🟡 In flight (this PR) | **Why the temporary gate** (Codex P1 on PR #664 tenth-round Codex review): without it, a cluster running PR 2–4 would accept a `CreateQueue` with `PartitionCount = 4` (the schema is in place, the validator only checks per-attribute validity) and then dispatch every subsequent `SendMessage` against the **legacy single-partition keyspace** with `partitionIndex = 0` — silently writing all messages under `!sqs|msg|data||…` regardless of `PartitionCount`. When PR 5 lands and the new fanout reader looks for messages under the partitioned prefix `!sqs|msg|data|p|||…`, every message written during the PR 2–4 window is invisible to it and to the partition-aware reaper scan. The gate-and-lift pattern (PR 2 rejects, PR 5 lifts in the same commit as the data-plane fanout) makes it impossible to land data under the wrong layout: any cluster that accepts `PartitionCount > 1` is, by construction, also running the partition-aware send path. diff --git a/main_sqs_leadership_refusal.go b/main_sqs_leadership_refusal.go index aa6bfeec..5caa242d 100644 --- a/main_sqs_leadership_refusal.go +++ b/main_sqs_leadership_refusal.go @@ -23,7 +23,7 @@ type sqsLeadershipController interface { // observer that refuses leadership of any Raft group hosting a // partitioned FIFO queue when this binary does NOT advertise the // htfifo capability. Implements §8 of -// docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md. +// docs/design/2026_04_26_partial_sqs_split_queue_fifo.md. // // # What it protects against // diff --git a/shard_config.go b/shard_config.go index ba08231a..d425fc39 100644 --- a/shard_config.go +++ b/shard_config.go @@ -41,7 +41,7 @@ var ( // sqsFifoPartitionMaxPartitions caps the per-queue partition count so // the partitionFor mask + bucket-store sizing arguments in -// docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §3.1 stay +// docs/design/2026_04_26_partial_sqs_split_queue_fifo.md §3.1 stay // honest: 32 partitions × ~1k RPS per shard ≈ 30k aggregate RPS per // queue, which matches the design's stated ceiling. Operators who // need more should split the workload across queues rather than