Skip to content

Commit 2c00d56

Browse files
committed
docs(design): address Claude round-5 review on logical backup proposal
Important (correctness gap, verified against main.go): - BeginBackup pin was node-local. Each node owns its own ActiveTimestampTracker (main.go:294) and only its compactor consults it (main.go:335). A pin on the receiving node would not protect group leaders on other nodes; their compactors would be free to retire MVCC versions at read_ts mid-dump. Fix: pins propagate through each Raft group's log as BackupPin / BackupExtend / BackupRelease FSM commands. Every replica applies them to its local tracker, so the pin set is replicated and durable across leader changes (a new leader inherits the pin from the same log entry it applied). BeginBackup fans the propose out per group; if any group fails to commit within --begin-backup-deadline, BeginBackup proposes BackupRelease on groups that did commit and returns Unavailable so retry is clean. RenewBackup / EndBackup fan out the same way; releases are idempotent so partial state is tolerated. Minor: - Specified _id_remap.jsonl path: redis/db_<n>/streams/<key>._id_remap.jsonl (next to source) so tools can find it on disk. - TestRedisStreamSkipExistingHandlesERR description: "the dump" -> "the restore" (it's a restore-side test). - Added KEYMAP.jsonl to every per-adapter scope in the layout tree; omitted when empty. Tests added: TestBeginBackupPinFanOutAllNodes, TestBeginBackupPinSurvivesLeaderChange, TestBeginBackupGroupUnreachable.
1 parent 797d571 commit 2c00d56

1 file changed

Lines changed: 70 additions & 13 deletions

File tree

docs/design/2026_04_29_proposed_logical_backup.md

Lines changed: 70 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -125,15 +125,18 @@ backup-<utc-timestamp>-<cluster-id>-<commit-ts>/
125125
├── dynamodb/
126126
│ └── <table-name>/
127127
│ ├── _schema.json
128+
│ ├── KEYMAP.jsonl # per-scope (omitted when empty)
128129
│ └── items/
129130
│ └── <pk-segment>/[<sk-segment>.]json
130131
├── s3/
131132
│ └── <bucket-name>/
132133
│ ├── _bucket.json
134+
│ ├── KEYMAP.jsonl # per-scope (omitted when empty)
133135
│ ├── <object-key-path> # original object bytes
134136
│ └── <object-key-path>.elastickv-meta.json # sidecar (reserved suffix)
135137
├── redis/
136138
│ └── db_<n>/
139+
│ ├── KEYMAP.jsonl # per-scope (omitted when empty)
137140
│ ├── strings/<key>.bin
138141
│ ├── strings_ttl.jsonl
139142
│ ├── hashes/<key>.json
@@ -146,6 +149,7 @@ backup-<utc-timestamp>-<cluster-id>-<commit-ts>/
146149
└── sqs/
147150
└── <queue-name>/
148151
├── _queue.json
152+
├── KEYMAP.jsonl # per-scope (omitted when empty)
149153
└── messages.jsonl
150154
```
151155

@@ -754,6 +758,25 @@ The sweeper logs a structured warning (`backup_pin_expired`) when it
754758
drops a stuck registration so operators see crashed-producer cases in
755759
their existing log pipeline.
756760

761+
**Cluster-wide propagation.** Each elastickv node owns its own
762+
`ActiveTimestampTracker` (`main.go:294`) and its compactor only
763+
consults the local instance (`main.go:335`). For a multi-node
764+
deployment, a pin recorded on the node receiving the admin RPC is not
765+
sufficient — the producer's `BackupScanner` reads from group leaders
766+
that may live on different nodes whose compactors are oblivious to the
767+
local pin. Pins are therefore **propagated through each Raft group's
768+
log** as `BackupPin{pin_id, read_ts, deadline}` / `BackupExtend` /
769+
`BackupRelease` FSM commands. Every replica applies these to its
770+
local tracker on log apply, so the pin set is replicated and durable
771+
across leader changes (a newly-elected leader inherits the pin from
772+
the same log it just applied). Compaction on each replica continues to
773+
consult only the local tracker; the only new ingredient is the FSM
774+
plumbing that keeps every replica's tracker in sync. The backup
775+
admin server (the node receiving `BeginBackup`) is responsible for
776+
fan-out: it submits the per-group entry through the existing
777+
`ShardGroup.Propose` path and waits for commit before returning to the
778+
producer.
779+
757780
**Bound on concurrent active backup pins.** To prevent a misbehaving
758781
or malicious caller from issuing unbounded `BeginBackup` requests and
759782
holding the compactor open across the whole MVCC retention horizon,
@@ -794,22 +817,51 @@ dumps with retention pressure.
794817
reach the threshold within the deadline, `BeginBackup` returns
795818
`FailedPrecondition` and the producer aborts — the dump is
796819
not started until every group can serve `read_ts` consistently.
797-
3. **Pin `read_ts`** with `PinWithDeadline(read_ts, now+ttl_ms)`; return
798-
the resulting `pin_token` to the producer.
820+
3. **Pin `read_ts` cluster-wide**, not just on the node that received
821+
the RPC. `kv/active_timestamp_tracker.go` is per-process
822+
(`main.go:294` constructs one tracker per node and wires it
823+
exclusively to that node's compactor at `main.go:335`). A pin
824+
recorded on the receiving node would gate only its own compactor;
825+
compactors on other nodes — including any group leader serving the
826+
producer's `BackupScanner` — would be free to retire MVCC versions
827+
at `read_ts`. The pin therefore fans out to **every replica that
828+
could compact**, propagated via the Raft log of each group:
829+
`BeginBackup` proposes a `BackupPin{pin_id, read_ts, deadline}`
830+
entry through every group involved in the dump; each replica's
831+
FSM, on apply, calls `PinWithDeadline(read_ts, deadline)` on its
832+
local tracker. The pin is therefore replicated and durable across
833+
leader changes — a new leader applies the same entry from the log
834+
and inherits the pin. `BeginBackup` returns the resulting
835+
`pin_token = (pin_id, []group_id)` to the producer only after every
836+
group has committed the entry; if any group's leader is unreachable
837+
or fails to commit within `--begin-backup-deadline`, `BeginBackup`
838+
proposes a matching `BackupRelease{pin_id}` to every group that
839+
*did* commit and returns `Unavailable` so the operator retries
840+
cleanly without leaving stranded pins.
799841
4. **Producer scans** all configured adapter scopes via
800842
`BackupScanner.Next(at_ts=read_ts)`.
801843
5. **Renew on long dumps**: the producer calls
802-
`RenewBackup(pin_token, ttl_ms)` every `ttl_ms / 3` to extend the
803-
deadline. The `read_ts` is preserved across renewals; only the
804-
deadline shifts. A multi-hour dump never relies on a single
805-
30-minute pin. Renewals are cheap (in-memory map update), and the
844+
`RenewBackup(pin_token, ttl_ms)` every `ttl_ms / 3`. The admin
845+
server proposes `BackupExtend{pin_id, deadline}` on every group
846+
recorded in `pin_token`. The `read_ts` is preserved across
847+
renewals; only the deadline shifts. A multi-hour dump never relies
848+
on a single 30-minute pin. Renewals are cheap (one Raft entry per
849+
group per renewal — at `ttl_ms/3 = 10 min`, that is 6 entries per
850+
hour per group, negligible alongside production traffic). The
806851
producer's renewal goroutine logs a critical alert and aborts the
807852
dump if a renewal call fails — letting the dump continue past the
808853
TTL would silently produce a corrupted artifact (the compactor
809854
would have already retired versions the in-flight scan still
810-
depends on).
811-
6. **`EndBackup(pin_token)`** releases the tracker entry. A producer
812-
crash before EndBackup leaves the entry to be reaped by the sweeper.
855+
depends on). If renewal succeeds on some groups but fails on
856+
others, the producer aborts and issues `EndBackup` (which itself
857+
tolerates partial state — see step 6).
858+
6. **`EndBackup(pin_token)`** proposes `BackupRelease{pin_id}` on
859+
every group recorded in `pin_token`. The release is idempotent: a
860+
group that has already swept the pin via deadline expiry treats
861+
the release as a no-op. A producer crash before EndBackup leaves
862+
the pin to be reaped by each replica's sweeper goroutine
863+
independently — the per-replica deadline ensures liveness even if
864+
no group ever sees a `BackupRelease`.
813865
814866
The dump is therefore a **point-in-time snapshot** of the user-visible
815867
keyspace, not a streaming tail.
@@ -1005,9 +1057,11 @@ parser, the format has failed its goal.
10051057
sound way to splice mid-stream without losing the original IDs.
10061058
Operators who need merge-into-non-empty pass
10071059
`--stream-merge-strategy=auto-id`, which falls back to `XADD *`
1008-
and records the original-to-new ID mapping in a
1009-
`_id_remap.jsonl` log so referential integrity can be patched
1010-
out-of-band.
1060+
and records the original-to-new ID mapping next to the source
1061+
file at `redis/db_<n>/streams/<key>._id_remap.jsonl` (one
1062+
`{"original_id":"…","new_id":"…"}` line per entry) so referential
1063+
integrity can be patched out-of-band by tools that find the dump
1064+
on disk.
10111065

10121066
### Risks and Mitigations
10131067

@@ -1093,6 +1147,9 @@ Scope: out of this proposal; mentioned only to draw the boundary.
10931147
| `TestSQSDumpFifoOrderPreserved` | Messages with interleaved `MessageGroupId` are emitted in `(send_ts, sequence_number, message_id)` order; visibility-state fields zeroed by default |
10941148
| `TestManifestVersionGate` | Restore with `format_version > current` fails fast with a typed error; same-major-newer-minor allowed; older-major refused with a clear message |
10951149
| `TestBeginBackupBlocksCompactor` | Open a `BeginBackup`, force a compaction round, confirm MVCC versions for the registered `read_ts` are retained |
1150+
| `TestBeginBackupPinFanOutAllNodes` | A 3-node cluster: `BeginBackup` issued to node A; verify nodes B and C have applied the `BackupPin` Raft entry and their compactors retain MVCC versions at `read_ts`. Compactor on B forced to run mid-dump must not retire pinned versions |
1151+
| `TestBeginBackupPinSurvivesLeaderChange` | After `BeginBackup` on node A, force a leadership change on a group; the new leader still honors the pin (its FSM applied the same entry); subsequent `BackupScanner.Next` calls succeed |
1152+
| `TestBeginBackupGroupUnreachable` | If one group cannot commit `BackupPin` within `--begin-backup-deadline`, `BeginBackup` returns `Unavailable` and proposes `BackupRelease` on every group that did commit; no stranded pins remain |
10961153
| `TestPinWithDeadlineExpiry` | `PinWithDeadline(ts, now+100ms)` is auto-released by the sweeper after the deadline; compactor unblocked; `backup_pin_expired` log emitted |
10971154
| `TestBeginBackupWaitsForLaggingShard` | Force shard B's `applied_index` to lag; `BeginBackup` polls until it catches up or times out with `FailedPrecondition`; no scan starts in the timeout case |
10981155
| `TestBackupScannerPaging` | A range with > pageSize keys is returned across multiple `ScanAt` pages with no overlap, no gaps; iteration tolerates concurrent writes by completing at the pinned `read_ts` |
@@ -1117,7 +1174,7 @@ Scope: out of this proposal; mentioned only to draw the boundary.
11171174
| `TestRedisStreamMergeRejectsNonEmpty` | `--mode merge` on a non-empty target stream errors out; `--stream-merge-strategy=auto-id` falls back to `XADD *` and writes `_id_remap.jsonl` |
11181175
| `TestRedisStreamTTLRoundTrip` | A stream with `PEXPIREAT` round-trips through dump and restore: `expire_at_ms` is captured in the `_meta` line and re-applied so the restored stream expires at the original epoch |
11191176
| `TestRedisHLLTTLRoundTrip` | A TTL'd HLL key surfaces in `hll_ttl.jsonl` and is re-applied on restore via the same `EXPIREAT` path used for strings; no-TTL HLLs leave `hll_ttl.jsonl` absent |
1120-
| `TestRedisStreamSkipExistingHandlesERR` | The restore tool's `skip-existing` path tolerates `ERR The ID specified in XADD is equal or smaller` without aborting the dump; entries with non-conflicting IDs are still applied |
1177+
| `TestRedisStreamSkipExistingHandlesERR` | The restore tool's `skip-existing` path tolerates `ERR The ID specified in XADD is equal or smaller` without aborting the restore; entries with non-conflicting IDs are still applied |
11211178

11221179
### P2
11231180

0 commit comments

Comments
 (0)