You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(design): address Claude round-5 review on logical backup proposal
Important (correctness gap, verified against main.go):
- BeginBackup pin was node-local. Each node owns its own
ActiveTimestampTracker (main.go:294) and only its compactor
consults it (main.go:335). A pin on the receiving node would not
protect group leaders on other nodes; their compactors would be
free to retire MVCC versions at read_ts mid-dump.
Fix: pins propagate through each Raft group's log as
BackupPin / BackupExtend / BackupRelease FSM commands. Every replica
applies them to its local tracker, so the pin set is replicated and
durable across leader changes (a new leader inherits the pin from
the same log entry it applied). BeginBackup fans the propose out
per group; if any group fails to commit within
--begin-backup-deadline, BeginBackup proposes BackupRelease on
groups that did commit and returns Unavailable so retry is clean.
RenewBackup / EndBackup fan out the same way; releases are
idempotent so partial state is tolerated.
Minor:
- Specified _id_remap.jsonl path: redis/db_<n>/streams/<key>._id_remap.jsonl
(next to source) so tools can find it on disk.
- TestRedisStreamSkipExistingHandlesERR description: "the dump"
-> "the restore" (it's a restore-side test).
- Added KEYMAP.jsonl to every per-adapter scope in the layout tree;
omitted when empty.
Tests added: TestBeginBackupPinFanOutAllNodes,
TestBeginBackupPinSurvivesLeaderChange,
TestBeginBackupGroupUnreachable.
@@ -754,6 +758,25 @@ The sweeper logs a structured warning (`backup_pin_expired`) when it
754
758
drops a stuck registration so operators see crashed-producer cases in
755
759
their existing log pipeline.
756
760
761
+
**Cluster-wide propagation.** Each elastickv node owns its own
762
+
`ActiveTimestampTracker` (`main.go:294`) and its compactor only
763
+
consults the local instance (`main.go:335`). For a multi-node
764
+
deployment, a pin recorded on the node receiving the admin RPC is not
765
+
sufficient — the producer's `BackupScanner` reads from group leaders
766
+
that may live on different nodes whose compactors are oblivious to the
767
+
local pin. Pins are therefore **propagated through each Raft group's
768
+
log** as `BackupPin{pin_id, read_ts, deadline}` / `BackupExtend` /
769
+
`BackupRelease` FSM commands. Every replica applies these to its
770
+
local tracker on log apply, so the pin set is replicated and durable
771
+
across leader changes (a newly-elected leader inherits the pin from
772
+
the same log it just applied). Compaction on each replica continues to
773
+
consult only the local tracker; the only new ingredient is the FSM
774
+
plumbing that keeps every replica's tracker in sync. The backup
775
+
admin server (the node receiving `BeginBackup`) is responsible for
776
+
fan-out: it submits the per-group entry through the existing
777
+
`ShardGroup.Propose` path and waits for commit before returning to the
778
+
producer.
779
+
757
780
**Bound on concurrent active backup pins.** To prevent a misbehaving
758
781
or malicious caller from issuing unbounded `BeginBackup` requests and
759
782
holding the compactor open across the whole MVCC retention horizon,
@@ -794,22 +817,51 @@ dumps with retention pressure.
794
817
reach the threshold within the deadline, `BeginBackup` returns
795
818
`FailedPrecondition` and the producer aborts — the dump is
796
819
not started until every group can serve `read_ts` consistently.
797
-
3. **Pin `read_ts`** with `PinWithDeadline(read_ts, now+ttl_ms)`; return
798
-
the resulting `pin_token` to the producer.
820
+
3. **Pin `read_ts` cluster-wide**, not just on the node that received
821
+
the RPC. `kv/active_timestamp_tracker.go` is per-process
822
+
(`main.go:294` constructs one tracker per node and wires it
823
+
exclusively to that node's compactor at `main.go:335`). A pin
824
+
recorded on the receiving node would gate only its own compactor;
825
+
compactors on other nodes — including any group leader serving the
826
+
producer's `BackupScanner` — would be free to retire MVCC versions
827
+
at `read_ts`. The pin therefore fans out to **every replica that
828
+
could compact**, propagated via the Raft log of each group:
829
+
`BeginBackup` proposes a `BackupPin{pin_id, read_ts, deadline}`
830
+
entry through every group involved in the dump; each replica's
831
+
FSM, on apply, calls `PinWithDeadline(read_ts, deadline)` on its
832
+
local tracker. The pin is therefore replicated and durable across
833
+
leader changes — a new leader applies the same entry from the log
834
+
and inherits the pin. `BeginBackup` returns the resulting
835
+
`pin_token = (pin_id, []group_id)` to the producer only after every
836
+
group has committed the entry; if any group's leader is unreachable
837
+
or fails to commit within `--begin-backup-deadline`, `BeginBackup`
838
+
proposes a matching `BackupRelease{pin_id}` to every group that
839
+
*did* commit and returns `Unavailable` so the operator retries
840
+
cleanly without leaving stranded pins.
799
841
4. **Producer scans** all configured adapter scopes via
800
842
`BackupScanner.Next(at_ts=read_ts)`.
801
843
5. **Renew on long dumps**: the producer calls
802
-
`RenewBackup(pin_token, ttl_ms)` every `ttl_ms / 3` to extend the
803
-
deadline. The `read_ts` is preserved across renewals; only the
804
-
deadline shifts. A multi-hour dump never relies on a single
805
-
30-minute pin. Renewals are cheap (in-memory map update), and the
844
+
`RenewBackup(pin_token, ttl_ms)` every `ttl_ms / 3`. The admin
845
+
server proposes `BackupExtend{pin_id, deadline}` on every group
846
+
recorded in `pin_token`. The `read_ts` is preserved across
847
+
renewals; only the deadline shifts. A multi-hour dump never relies
848
+
on a single 30-minute pin. Renewals are cheap (one Raft entry per
849
+
group per renewal — at `ttl_ms/3 = 10 min`, that is 6 entries per
850
+
hour per group, negligible alongside production traffic). The
806
851
producer's renewal goroutine logs a critical alert and aborts the
807
852
dump if a renewal call fails — letting the dump continue past the
808
853
TTL would silently produce a corrupted artifact (the compactor
809
854
would have already retired versions the in-flight scan still
810
-
depends on).
811
-
6. **`EndBackup(pin_token)`** releases the tracker entry. A producer
812
-
crash before EndBackup leaves the entry to be reaped by the sweeper.
855
+
depends on). If renewal succeeds on some groups but fails on
856
+
others, the producer aborts and issues `EndBackup` (which itself
857
+
tolerates partial state — see step 6).
858
+
6. **`EndBackup(pin_token)`** proposes `BackupRelease{pin_id}` on
859
+
every group recorded in `pin_token`. The release is idempotent: a
860
+
group that has already swept the pin via deadline expiry treats
861
+
the release as a no-op. A producer crash before EndBackup leaves
862
+
the pin to be reaped by each replica's sweeper goroutine
863
+
independently — the per-replica deadline ensures liveness even if
864
+
no group ever sees a `BackupRelease`.
813
865
814
866
The dump is therefore a **point-in-time snapshot** of the user-visible
815
867
keyspace, not a streaming tail.
@@ -1005,9 +1057,11 @@ parser, the format has failed its goal.
1005
1057
sound way to splice mid-stream without losing the original IDs.
1006
1058
Operators who need merge-into-non-empty pass
1007
1059
`--stream-merge-strategy=auto-id`, which falls back to `XADD *`
1008
-
and records the original-to-new ID mapping in a
1009
-
`_id_remap.jsonl` log so referential integrity can be patched
1010
-
out-of-band.
1060
+
and records the original-to-new ID mapping next to the source
1061
+
file at `redis/db_<n>/streams/<key>._id_remap.jsonl` (one
1062
+
`{"original_id":"…","new_id":"…"}` line per entry) so referential
1063
+
integrity can be patched out-of-band by tools that find the dump
1064
+
on disk.
1011
1065
1012
1066
### Risks and Mitigations
1013
1067
@@ -1093,6 +1147,9 @@ Scope: out of this proposal; mentioned only to draw the boundary.
1093
1147
|`TestSQSDumpFifoOrderPreserved`| Messages with interleaved `MessageGroupId` are emitted in `(send_ts, sequence_number, message_id)` order; visibility-state fields zeroed by default |
1094
1148
|`TestManifestVersionGate`| Restore with `format_version > current` fails fast with a typed error; same-major-newer-minor allowed; older-major refused with a clear message |
1095
1149
|`TestBeginBackupBlocksCompactor`| Open a `BeginBackup`, force a compaction round, confirm MVCC versions for the registered `read_ts` are retained |
1150
+
|`TestBeginBackupPinFanOutAllNodes`| A 3-node cluster: `BeginBackup` issued to node A; verify nodes B and C have applied the `BackupPin` Raft entry and their compactors retain MVCC versions at `read_ts`. Compactor on B forced to run mid-dump must not retire pinned versions |
1151
+
|`TestBeginBackupPinSurvivesLeaderChange`| After `BeginBackup` on node A, force a leadership change on a group; the new leader still honors the pin (its FSM applied the same entry); subsequent `BackupScanner.Next` calls succeed |
1152
+
|`TestBeginBackupGroupUnreachable`| If one group cannot commit `BackupPin` within `--begin-backup-deadline`, `BeginBackup` returns `Unavailable` and proposes `BackupRelease` on every group that did commit; no stranded pins remain |
1096
1153
|`TestPinWithDeadlineExpiry`|`PinWithDeadline(ts, now+100ms)` is auto-released by the sweeper after the deadline; compactor unblocked; `backup_pin_expired` log emitted |
1097
1154
|`TestBeginBackupWaitsForLaggingShard`| Force shard B's `applied_index` to lag; `BeginBackup` polls until it catches up or times out with `FailedPrecondition`; no scan starts in the timeout case |
1098
1155
|`TestBackupScannerPaging`| A range with > pageSize keys is returned across multiple `ScanAt` pages with no overlap, no gaps; iteration tolerates concurrent writes by completing at the pinned `read_ts`|
@@ -1117,7 +1174,7 @@ Scope: out of this proposal; mentioned only to draw the boundary.
1117
1174
|`TestRedisStreamMergeRejectsNonEmpty`|`--mode merge` on a non-empty target stream errors out; `--stream-merge-strategy=auto-id` falls back to `XADD *` and writes `_id_remap.jsonl`|
1118
1175
|`TestRedisStreamTTLRoundTrip`| A stream with `PEXPIREAT` round-trips through dump and restore: `expire_at_ms` is captured in the `_meta` line and re-applied so the restored stream expires at the original epoch |
1119
1176
|`TestRedisHLLTTLRoundTrip`| A TTL'd HLL key surfaces in `hll_ttl.jsonl` and is re-applied on restore via the same `EXPIREAT` path used for strings; no-TTL HLLs leave `hll_ttl.jsonl` absent |
1120
-
| `TestRedisStreamSkipExistingHandlesERR` | The restore tool's `skip-existing` path tolerates `ERR The ID specified in XADD is equal or smaller` without aborting the dump; entries with non-conflicting IDs are still applied |
1177
+
|`TestRedisStreamSkipExistingHandlesERR`| The restore tool's `skip-existing` path tolerates `ERR The ID specified in XADD is equal or smaller` without aborting the restore; entries with non-conflicting IDs are still applied |
0 commit comments