Skip to content

Commit bbd1bb8

Browse files
committed
docs(design): address Gemini review on logical backup proposal
HIGH: - TTL renewal made explicit. New RenewBackup RPC; producer renews every ttl_ms/3 and aborts the dump on renewal failure (continuing past TTL would silently corrupt the artifact). backup_max_ttl_ms caps a single window; multi-hour dumps work via repeated renewals. MEDIUM: - DynamoDB scalability. Default stays per-item (the user's stated requirement: PK名.json で1レコード1ファイル). Added opt-in --dynamodb-bundle-mode jsonl for tables where inode count is the binding constraint; structured warning emitted when an unbundled scope exceeds 1M items. - S3 file-vs-directory collisions. Producer detects keys like "path/to" and "path/to/obj" coexisting and renames the shorter to "path/to.elastickv-leaf-data" with KEYMAP.jsonl + manifest record. - KEYMAP scaled by switching to KEYMAP.jsonl (line-streamable). - Redis strings_ttl.json -> strings_ttl.jsonl with one TTL record per line so TTL count is unbounded. - max_active_backup_pins (default 4) caps concurrent BeginBackup registrations; tracker returns ErrTooManyActiveBackups, RPC surfaces ResourceExhausted. Prevents a misbehaving caller from holding the compactor open across the entire MVCC retention horizon. Manifest fields added: s3_collision_strategy, dynamodb_layout, max_active_backup_pins. CLI flags added: --ttl-ms, --scan-page-size, --dynamodb-bundle-mode/--dynamodb-bundle-size, --rename-collisions. Tests added: TestS3PathFileVsDirectoryCollision, TestBeginBackupTooManyActiveBackups, TestRenewBackupExtendsDeadline.
1 parent 5719128 commit bbd1bb8

1 file changed

Lines changed: 143 additions & 21 deletions

File tree

docs/design/2026_04_29_proposed_logical_backup.md

Lines changed: 143 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ backup-<utc-timestamp>-<cluster-id>-<commit-ts>/
135135
├── redis/
136136
│ └── db_<n>/
137137
│ ├── strings/<key>.bin
138-
│ ├── strings_ttl.json
138+
│ ├── strings_ttl.jsonl
139139
│ ├── hashes/<key>.json
140140
│ ├── lists/<key>.json
141141
│ ├── sets/<key>.json
@@ -192,11 +192,37 @@ Rules:
192192
per message (would be millions of small files); they live in a single
193193
`messages.jsonl`. Same rationale for Redis stream entries.
194194

195-
A single `KEYMAP` file at each adapter scope root translates the encoded
196-
filename back to the exact original bytes, in case the user needs to feed
197-
the data into a system that requires the verbatim key. The translation is
198-
also losslessly recoverable from the encoded filename alone — `KEYMAP` is a
199-
convenience, not a correctness dependency.
195+
`KEYMAP.jsonl` at each adapter scope root translates the encoded
196+
filename back to the exact original bytes, in case the user needs to
197+
feed the data into a system that requires the verbatim key. JSONL
198+
(one mapping per line, never loaded fully) is used so the file scales
199+
to millions of keys without becoming a memory bottleneck on either the
200+
producer or a downstream consumer. The translation is also losslessly
201+
recoverable from the encoded filename alone — `KEYMAP.jsonl` is a
202+
convenience, not a correctness dependency. A consumer that does not
203+
need it can ignore it entirely.
204+
205+
### S3 path collisions (file vs. directory)
206+
207+
S3 permits two objects whose keys are `path/to` and `path/to/obj`
208+
simultaneously. POSIX filesystems cannot represent both — `path/to`
209+
cannot be both a regular file and a directory. The dump producer
210+
detects this case at scan time:
211+
212+
- **Pure-leaf case** (the more common case): the key without `/obj`
213+
suffix is the only object → write at the natural path, no collision.
214+
- **Collision case**: both `path/to` and `path/to/obj` exist in the
215+
same bucket at the same generation. The shorter key (the one that
216+
would force a regular file on a path that must also be a directory)
217+
is renamed to `path/to.elastickv-leaf-data`, with the rename
218+
recorded in `s3/<bucket>/KEYMAP.jsonl`. The sidecar follows the
219+
same renamed base: `path/to.elastickv-leaf-data.elastickv-meta.json`.
220+
- The collision-handling rule is documented at dump time in
221+
`MANIFEST.json` (`s3_collision_strategy: "leaf-data-suffix"`) so a
222+
restore tool reverses it without guessing.
223+
224+
The producer emits a structured warning per collision so operators
225+
can spot dataset shapes that benefit from object-key normalization.
200226

201227
## Per-Adapter Format
202228

@@ -274,6 +300,44 @@ to a real DynamoDB SDK. Restoring into AWS DynamoDB is a literal
274300
`PutItem` per file. A failed table can be partially recovered by hand-
275301
editing one item and re-feeding it.
276302

303+
#### One-file-per-item vs. bundle mode
304+
305+
The default — one item per file under `items/<pk>/<sk>.json` — is a
306+
**deliberate, requirement-driven choice**, not an oversight. It is
307+
what makes "recover one row by editing one file and running one
308+
PutItem" trivially true, which is the whole point of a
309+
vendor-independent dump. The cost is filesystem overhead: a 50-million-
310+
item table emits 50 million inodes. On most modern filesystems
311+
(ext4/xfs/zfs/apfs) this is fine for write-once-and-tar dumps but
312+
slow for live filesystem operations.
313+
314+
For tables where the inode count is the binding constraint, the
315+
producer accepts an opt-in `--dynamodb-bundle-mode jsonl[:size=64MiB]`
316+
that emits items as `items/data-<part-id>.jsonl` instead, packed up
317+
to a configurable per-file size budget:
318+
319+
```
320+
dynamodb/orders/
321+
├── _schema.json
322+
└── items/
323+
├── data-00001.jsonl # one item per line, items/data-00001.jsonl ≤ 64 MiB
324+
├── data-00002.jsonl
325+
└── data-00003.jsonl
326+
```
327+
328+
`MANIFEST.json` records the choice (`dynamodb_layout: "per-item" |
329+
"jsonl"`) so a restore tool dispatches the right reader. The bundle
330+
mode is an explicit operational decision; the default stays per-file
331+
because that is the layout the user asked for and the layout that
332+
makes one-line recovery scripts trivial. Operators are also free to
333+
post-process a per-item dump into bundles (`find … | xargs cat`)
334+
without losing any information, so the choice is not load-bearing on
335+
the format itself.
336+
337+
The producer emits a structured warning when an unbundled scope
338+
exceeds 1 million items so operators can decide whether to switch
339+
modes for that table.
340+
277341
GSIs are **not materialized** under the dump because they are derivable
278342
from `_schema.json` plus the base item set. Re-creating the table from
279343
`_schema.json` and replaying the items rebuilds the GSI; this is what AWS
@@ -362,7 +426,7 @@ redis/
362426
├── strings/
363427
│ ├── session%3Aabc123.bin # SET session:abc123 …
364428
│ └── counter%3Aglobal.bin # SET counter:global "42"
365-
├── strings_ttl.json # { "session%3Aabc123": 1735689600000, ... }
429+
├── strings_ttl.jsonl # one line per TTL'd key
366430
├── hashes/
367431
│ └── user%3A1.json
368432
├── lists/
@@ -381,9 +445,17 @@ any client library) can replay them without elastickv.
381445
- `strings/<key>.bin` is the **raw value bytes** — Redis strings are
382446
binary-safe, so JSON wrapping would force base64 and lose the property
383447
that `cat strings/<key>.bin` produces the original payload. TTL is in
384-
the sidecar `strings_ttl.json` keyed by the same encoded filename;
385-
values are absolute Unix-millis expirations to avoid clock skew on
386-
restore.
448+
the sidecar `strings_ttl.jsonl`, one record per line:
449+
```
450+
{"key":"session%3Aabc123","expire_at_ms":1735689600000}
451+
{"key":"cache%3Akv99","expire_at_ms":1735689610500}
452+
```
453+
Values are absolute Unix-millis expirations to avoid clock skew on
454+
restore. JSONL (not a single JSON object) is used so producers and
455+
consumers stream the file line-by-line; a Redis instance with tens
456+
of millions of TTL'd strings would exceed practical memory limits
457+
for a single-object form, and Redis itself has no upper bound on
458+
the number of TTL'd keys.
387459
- `hashes/<key>.json`:
388460
```json
389461
{"format_version": 1, "fields": {"name": "alice", "age": "30"}, "expire_at_ms": null}
@@ -505,7 +577,10 @@ emits an `_internals/` subdirectory of newline-delimited records.
505577
"encoded_filename_charset": "rfc3986-unreserved-plus-percent",
506578
"key_segment_max_bytes": 240,
507579
"backup_ts_ttl_ms": 1800000,
508-
"s3_meta_suffix": ".elastickv-meta.json"
580+
"s3_meta_suffix": ".elastickv-meta.json",
581+
"s3_collision_strategy": "leaf-data-suffix",
582+
"dynamodb_layout": "per-item",
583+
"max_active_backup_pins": 4
509584
}
510585
```
511586

@@ -563,23 +638,31 @@ Per-adapter encoders consume `BackupScanner` and emit one record per
563638
service Admin {
564639
rpc BeginBackup(BeginBackupRequest) returns (BeginBackupResponse) {}
565640
rpc EndBackup(EndBackupRequest) returns (EndBackupResponse) {}
641+
rpc RenewBackup(RenewBackupRequest) returns (RenewBackupResponse) {}
566642
rpc ListAdaptersAndScopes(ListAdaptersAndScopesRequest)
567643
returns (ListAdaptersAndScopesResponse) {}
568644
}
569645

570646
message BeginBackupRequest {
571647
// Time-to-live for the read_ts pin on the active timestamp tracker.
572-
// If EndBackup is not called within this window the pin is auto-released.
573-
// Range: 60s–24h. Default: 30m.
648+
// If neither EndBackup nor RenewBackup is called within this window
649+
// the pin is auto-released. Range: 60s–24h. Default: 30m.
574650
uint64 ttl_ms = 1;
575651
}
576652
message BeginBackupResponse {
577653
uint64 read_ts = 1;
578-
bytes pin_token = 2; // opaque, must be passed back to EndBackup
654+
bytes pin_token = 2; // opaque, must be passed back to RenewBackup / EndBackup
579655
uint64 ttl_ms_effective = 3;
580656
repeated ShardApplied shards = 4; // group_id, applied_index at pin time
581657
}
582658

659+
// RenewBackup extends the deadline for an existing pin. A long-running
660+
// dump calls this every ttl_ms/3 (producer default) so a multi-hour
661+
// scan never relies on a single TTL window. The read_ts is preserved
662+
// across renewals; only the deadline shifts.
663+
message RenewBackupRequest { bytes pin_token = 1; uint64 ttl_ms = 2; }
664+
message RenewBackupResponse { uint64 ttl_ms_effective = 1; }
665+
583666
message EndBackupRequest { bytes pin_token = 1; }
584667
message EndBackupResponse {}
585668

@@ -602,7 +685,8 @@ extends the tracker:
602685

603686
```go
604687
// kv/active_timestamp_tracker.go (extended)
605-
func (t *ActiveTimestampTracker) PinWithDeadline(ts uint64, deadline time.Time) *ActiveTimestampToken
688+
func (t *ActiveTimestampTracker) PinWithDeadline(ts uint64, deadline time.Time) (*ActiveTimestampToken, error)
689+
func (t *ActiveTimestampToken) Extend(newDeadline time.Time) error
606690
```
607691

608692
`PinWithDeadline` records `(id → ts, deadline)`. A single sweeper
@@ -614,6 +698,30 @@ The sweeper logs a structured warning (`backup_pin_expired`) when it
614698
drops a stuck registration so operators see crashed-producer cases in
615699
their existing log pipeline.
616700

701+
**Bound on concurrent active backup pins.** To prevent a misbehaving
702+
or malicious caller from issuing unbounded `BeginBackup` requests and
703+
holding the compactor open across the whole MVCC retention horizon,
704+
the tracker carries a hard cap (default `max_active_backup_pins = 4`,
705+
configurable). When the cap is reached, `PinWithDeadline` returns
706+
`ErrTooManyActiveBackups` and the admin RPC surfaces it as
707+
`ResourceExhausted`. Operators raising this cap should size it
708+
against the GC/compaction headroom — each held pin clamps the
709+
compactor's `Oldest()` timestamp until released. The cap is
710+
intentionally small because backups are an operator action, not
711+
end-user traffic; if four are not enough, something is wrong with
712+
the orchestration layer and adding pins compounds the underlying
713+
problem rather than fixing it.
714+
715+
**TTL ceiling and back-pressure.** `ttl_ms` is bounded above by
716+
`backup_max_ttl_ms` (default 1 h) — a single pin cannot block
717+
compaction beyond that window even if a buggy caller asks for it.
718+
For dumps that legitimately need to run longer (e.g. a 50 TiB
719+
warehouse), the producer renews via `RenewBackup` every `ttl_ms/3`,
720+
so total dump duration is bounded only by overall MVCC retention
721+
budget, not by any single TTL choice. The producer surfaces a
722+
`pin_renewals_total` metric so operators can correlate long-running
723+
dumps with retention pressure.
724+
617725
### BeginBackup → EndBackup flow
618726

619727
1. **Pick `read_ts`**: `BeginBackup` reads the lease-read timestamp
@@ -634,10 +742,16 @@ their existing log pipeline.
634742
the resulting `pin_token` to the producer.
635743
4. **Producer scans** all configured adapter scopes via
636744
`BackupScanner.Next(at_ts=read_ts)`.
637-
5. **Renew on long dumps**: the producer calls `BeginBackup` again with
638-
the same `read_ts` (carrying the existing token) every `ttl_ms / 3`
639-
to extend the deadline. A multi-hour dump never relies on a single
640-
30-minute pin.
745+
5. **Renew on long dumps**: the producer calls
746+
`RenewBackup(pin_token, ttl_ms)` every `ttl_ms / 3` to extend the
747+
deadline. The `read_ts` is preserved across renewals; only the
748+
deadline shifts. A multi-hour dump never relies on a single
749+
30-minute pin. Renewals are cheap (in-memory map update), and the
750+
producer's renewal goroutine logs a critical alert and aborts the
751+
dump if a renewal call fails — letting the dump continue past the
752+
TTL would silently produce a corrupted artifact (the compactor
753+
would have already retired versions the in-flight scan still
754+
depends on).
641755
6. **`EndBackup(pin_token)`** releases the tracker entry. A producer
642756
crash before EndBackup leaves the entry to be reaped by the sweeper.
643757

@@ -688,7 +802,12 @@ elastickv-backup dump \
688802
[--include-orphans] \
689803
[--preserve-sqs-visibility] \
690804
[--include-sqs-side-records] \
691-
[--checksums sha256]
805+
[--checksums sha256] \
806+
[--ttl-ms 1800000] \
807+
[--scan-page-size 1024] \
808+
[--dynamodb-bundle-mode per-item|jsonl] \
809+
[--dynamodb-bundle-size 64MiB] \
810+
[--rename-collisions]
692811
```
693812

694813
Internally it runs:
@@ -908,7 +1027,10 @@ Scope: out of this proposal; mentioned only to draw the boundary.
9081027
| `TestPinWithDeadlineExpiry` | `PinWithDeadline(ts, now+100ms)` is auto-released by the sweeper after the deadline; compactor unblocked; `backup_pin_expired` log emitted |
9091028
| `TestBeginBackupWaitsForLaggingShard` | Force shard B's `applied_index` to lag; `BeginBackup` polls until it catches up or times out with `FailedPrecondition`; no scan starts in the timeout case |
9101029
| `TestBackupScannerPaging` | A range with > pageSize keys is returned across multiple `ScanAt` pages with no overlap, no gaps; iteration tolerates concurrent writes by completing at the pinned `read_ts` |
911-
| `TestS3SidecarSuffixCollision` | A user S3 object key ending in `.elastickv-meta.json` is rejected without `--rename-collisions`; with the flag, the rename is recorded in `KEYMAP` |
1030+
| `TestS3SidecarSuffixCollision` | A user S3 object key ending in `.elastickv-meta.json` is rejected without `--rename-collisions`; with the flag, the rename is recorded in `KEYMAP.jsonl` |
1031+
| `TestS3PathFileVsDirectoryCollision` | Bucket holds both `path/to` (object) and `path/to/obj`; producer renames the shorter key to `path/to.elastickv-leaf-data` and records it in `KEYMAP.jsonl`; restore tool reverses it via `MANIFEST.s3_collision_strategy` |
1032+
| `TestBeginBackupTooManyActiveBackups` | Reaching `max_active_backup_pins` returns `ResourceExhausted`; releasing one pin frees a slot for the next request |
1033+
| `TestRenewBackupExtendsDeadline` | `RenewBackup` shifts the deadline; producer's failed-renewal path aborts the dump with a critical log line rather than continuing past the TTL |
9121034

9131035
### P1
9141036

0 commit comments

Comments
 (0)