Skip to content

Commit 9b1be20

Browse files
authored
backup: KEYMAP.jsonl writer/reader and MANIFEST.json schema (Phase 0a) (#712)
## Summary Stacked on top of #711 (filename encoding). Adds two more foundation pieces of the Phase 0 logical-backup decoder. **KEYMAP.jsonl** (`internal/backup/keymap.go`) - Append-only JSONL stream of `{encoded, original (b64url), kind}` records. - Records exist only when the original bytes are NOT recoverable from the encoded filename alone: - `KindSHAFallback` — segments rendered as `<sha32>__<truncated>` - `KindS3LeafData` — S3 path collisions renamed to `.elastickv-leaf-data` - `KindMetaCollision` — user object key ending in `.elastickv-meta.json` - `KeymapWriter`: streaming append, JSON encoder configured to skip HTML escapes so user-key bytes round-trip cleanly. Refuses empty `encoded` or `kind` so producer bugs surface loudly. `Count()` exposed for the "omit empty file" decision. - `KeymapReader`: line-by-line scanner with bounded buffer (1 MiB); blank lines surface as `ErrInvalidKeymapRecord` rather than being silently skipped so truncated dumps are recognised. - `LoadKeymap`: convenience helper that materialises the file as a map (last-wins on duplicates). **MANIFEST.json** (`internal/backup/manifest.go`) - Structs matching the schema in `docs/design/2026_04_29_proposed_snapshot_logical_decoder.md`. - `CurrentFormatVersion = 1`; `ReadManifest` refuses `format_version > current` and `format_version == 0` (`ErrUnsupportedFormatVersion`). - Phase discriminator: Phase 0 must not set `Live`, Phase 1 must not set `Source` — both validated at write and read time. - `DisallowUnknownFields` on read so format drift surfaces loudly. - Pretty-printed output (2-space indent, no HTML escapes) since `MANIFEST.json` is operator-facing. - `NewPhase0SnapshotManifest` seeds policy fields with the documented defaults. ## Test plan - [x] `go test -race ./internal/backup/...` — pass. - [x] `golangci-lint run ./internal/backup/...` — clean. - [x] Tests cover round-trip, sticky-error semantics, last-wins dedup, HTML-escape suppression, future-version refusal, unknown-field refusal, unknown-phase refusal, cross-phase `Source`/`Live` exclusion. ## Self-review - **Data loss** — N/A (read/write helpers). `KeymapReader` returns sticky errors so partial reads cannot be silently treated as success. - **Concurrency** — `KeymapWriter`/`KeymapReader` are not goroutine-safe (per-scope use); manifest helpers are pure. `-race` clean. - **Performance** — `bufio.Writer`(64 KiB) for the JSONL stream; bounded scanner buffer (1 MiB) on read. - **Data consistency** — `DisallowUnknownFields` + format-version gate prevent silent drift. The phase discriminator's structural rules are enforced symmetrically at write and read. - **Test coverage** — 7 keymap tests + 10 manifest tests covering the documented happy/sad paths. ## Stacking Base branch is `feat/backup-phase0a-filename` (PR #711). When that lands, this PR's base will switch to `main` automatically.
2 parents 3d71d1f + f13cd1b commit 9b1be20

4 files changed

Lines changed: 1567 additions & 0 deletions

File tree

internal/backup/keymap.go

Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
package backup
2+
3+
import (
4+
"bufio"
5+
"bytes"
6+
"encoding/base64"
7+
"encoding/json"
8+
"io"
9+
10+
"github.com/cockroachdb/errors"
11+
)
12+
13+
// jsonNullLiteral is the byte-for-byte JSON null token. We compare raw
14+
// json.RawMessage values against this rather than relying on
15+
// post-Unmarshal string emptiness, because `null` and `""` collapse to
16+
// the same Go-side value once Unmarshal'd into a typed field.
17+
var jsonNullLiteral = []byte("null")
18+
19+
// KEYMAP.jsonl shape (one record per line):
20+
//
21+
// {"encoded":"<encoded-segment>","original":"<base64url-no-padding>","kind":"sha-fallback"}
22+
//
23+
// Records are written in encounter order (the order the encoder produced
24+
// them) and never modified after write. The file is append-only; if the same
25+
// encoded segment is written twice the reader keeps the last entry, but the
26+
// encoder is expected not to emit duplicates within a single dump.
27+
//
28+
// Records exist only for entries whose original bytes are NOT recoverable
29+
// from the encoded filename alone:
30+
//
31+
// - KindSHAFallback — segment is `<sha-prefix-32>__<truncated-original>`
32+
// (filename length exceeded EncodeSegment's 240-byte ceiling).
33+
// - KindS3LeafData — S3 object renamed to `<obj>.elastickv-leaf-data`
34+
// because both `<obj>` and `<obj>/...` existed in the same bucket.
35+
// - KindMetaCollision — user S3 object key happened to end in
36+
// `.elastickv-meta.json`; renamed under --rename-collisions.
37+
//
38+
// A consumer that does not care about reversing these to original bytes can
39+
// ignore KEYMAP.jsonl entirely.
40+
const (
41+
KindSHAFallback = "sha-fallback"
42+
KindS3LeafData = "s3-leaf-data"
43+
KindMetaCollision = "meta-suffix-rename"
44+
)
45+
46+
// keymapBufSizeWriter is the bufio.Writer buffer size for the JSONL writer.
47+
// 64 KiB amortises the per-syscall cost across hundreds of small records
48+
// without holding pathological amounts of memory.
49+
const keymapBufSizeWriter = 64 << 10
50+
51+
// keymapBufSizeReader bounds bufio.Scanner's per-line buffer. KEYMAP
52+
// records carry a ~240-byte encoded segment plus a base64url-encoded
53+
// original key. The source store (store/mvcc_store.go
54+
// maxSnapshotKeySize) caps a single key at 1 MiB; base64url expansion
55+
// is ~4/3 (1 MiB → ~1.33 MiB), and the surrounding JSON object adds a
56+
// fixed ~80 bytes of field names / brackets / commas. A 1 MiB cap was
57+
// therefore not enough to cover a maximum-sized valid key — Codex P1
58+
// round 6 (commit 2cd58a93). 4 MiB carries 2× margin over the
59+
// theoretical worst case while still bounding pathological lines, and
60+
// matches the doubling cadence we'd want if the upstream key cap were
61+
// ever raised.
62+
const keymapBufSizeReader = 4 << 20
63+
64+
// ErrInvalidKeymapRecord is returned by Reader.Next when a line does not
65+
// parse as a KeymapRecord (malformed JSON, missing field, malformed
66+
// base64, etc.).
67+
var ErrInvalidKeymapRecord = errors.New("backup: invalid KEYMAP.jsonl record")
68+
69+
// KeymapRecord is a single mapping from encoded filename component back to
70+
// the original key bytes. Original bytes are arbitrary (binary safe), so
71+
// they are encoded as base64url-no-padding for transport in JSON.
72+
type KeymapRecord struct {
73+
// Encoded is the filename segment as it appears in the dump tree.
74+
Encoded string `json:"encoded"`
75+
// OriginalB64 is base64url-no-padding of the original key bytes.
76+
OriginalB64 string `json:"original"`
77+
// Kind classifies why this record exists; see Kind* constants.
78+
Kind string `json:"kind"`
79+
}
80+
81+
// Original returns the decoded original key bytes from r.OriginalB64.
82+
func (r KeymapRecord) Original() ([]byte, error) {
83+
out, err := base64.RawURLEncoding.DecodeString(r.OriginalB64)
84+
if err != nil {
85+
return nil, errors.Wrap(ErrInvalidKeymapRecord, err.Error())
86+
}
87+
return out, nil
88+
}
89+
90+
// KeymapWriter appends records to a KEYMAP.jsonl stream. Concurrent calls to
91+
// Write are serialised through the underlying bufio.Writer; the caller is
92+
// expected to use a single writer per scope.
93+
type KeymapWriter struct {
94+
bw *bufio.Writer
95+
enc *json.Encoder
96+
// count tracks how many records have been written; exposed so the caller
97+
// can decide to omit an empty KEYMAP.jsonl file (per the spec, the file
98+
// is omitted when no entries exist).
99+
count int
100+
}
101+
102+
// NewKeymapWriter returns a writer that appends JSONL records to w. Close
103+
// must be called to flush.
104+
func NewKeymapWriter(w io.Writer) *KeymapWriter {
105+
bw := bufio.NewWriterSize(w, keymapBufSizeWriter)
106+
enc := json.NewEncoder(bw)
107+
enc.SetEscapeHTML(false) // we never embed user keys in HTML; preserve `<>&`
108+
return &KeymapWriter{bw: bw, enc: enc}
109+
}
110+
111+
// Write appends one KeymapRecord. The record is JSON-serialised with a
112+
// trailing newline (json.Encoder behavior), giving the JSONL contract.
113+
func (w *KeymapWriter) Write(rec KeymapRecord) error {
114+
if rec.Encoded == "" {
115+
return errors.WithStack(errors.New("backup: KEYMAP record encoded must be non-empty"))
116+
}
117+
if rec.Kind == "" {
118+
return errors.WithStack(errors.New("backup: KEYMAP record kind must be non-empty"))
119+
}
120+
if err := w.enc.Encode(rec); err != nil {
121+
return errors.WithStack(err)
122+
}
123+
w.count++
124+
return nil
125+
}
126+
127+
// WriteOriginal is a convenience wrapper that base64-encodes raw original
128+
// bytes for the caller.
129+
func (w *KeymapWriter) WriteOriginal(encoded string, original []byte, kind string) error {
130+
return w.Write(KeymapRecord{
131+
Encoded: encoded,
132+
OriginalB64: base64.RawURLEncoding.EncodeToString(original),
133+
Kind: kind,
134+
})
135+
}
136+
137+
// Count returns the number of records written so far. Useful for the
138+
// "omit empty KEYMAP file" decision after the dump completes.
139+
func (w *KeymapWriter) Count() int { return w.count }
140+
141+
// Close flushes any buffered records to the underlying writer.
142+
func (w *KeymapWriter) Close() error {
143+
if w.bw == nil {
144+
return nil
145+
}
146+
if err := w.bw.Flush(); err != nil {
147+
return errors.WithStack(err)
148+
}
149+
return nil
150+
}
151+
152+
// KeymapReader iterates JSONL records line-by-line. Memory footprint is
153+
// bounded by keymapBufSizeReader regardless of file size.
154+
type KeymapReader struct {
155+
sc *bufio.Scanner
156+
err error
157+
}
158+
159+
// NewKeymapReader wraps r so the caller can iterate records via Next.
160+
func NewKeymapReader(r io.Reader) *KeymapReader {
161+
sc := bufio.NewScanner(r)
162+
sc.Buffer(make([]byte, 0, keymapBufSizeReader), keymapBufSizeReader)
163+
return &KeymapReader{sc: sc}
164+
}
165+
166+
// Next decodes the next record. It returns (rec, true, nil) on success,
167+
// (zero, false, nil) at end of stream, and (zero, false, err) on parse
168+
// failure or I/O error. Once an error is returned the reader is sticky:
169+
// subsequent calls return the same error.
170+
//
171+
// The base64-encoded `original` field is validated at parse time rather
172+
// than lazily: a malformed dump must surface on the first read of the
173+
// affected line, not propagate silently until a much later
174+
// rec.Original() call. Same error class either way.
175+
func (r *KeymapReader) Next() (KeymapRecord, bool, error) {
176+
if r.err != nil {
177+
return KeymapRecord{}, false, r.err
178+
}
179+
if !r.sc.Scan() {
180+
if err := r.sc.Err(); err != nil {
181+
r.err = errors.WithStack(err)
182+
return KeymapRecord{}, false, r.err
183+
}
184+
return KeymapRecord{}, false, nil
185+
}
186+
line := r.sc.Bytes()
187+
rec, err := decodeKeymapLine(line)
188+
if err != nil {
189+
r.err = err
190+
return KeymapRecord{}, false, r.err
191+
}
192+
return rec, true, nil
193+
}
194+
195+
// decodeKeymapLine parses one JSONL record. It enforces three properties:
196+
//
197+
// 1. The record must contain `encoded`, `original`, and `kind` fields,
198+
// and none of them may be the JSON literal `null` — Go unmarshals
199+
// a null string field into "", and base64.DecodeString("") would
200+
// silently accept it as an empty original key, rewriting the
201+
// mapping. Codex P2 round 5 + P1 round 7-follow-up.
202+
// 2. `encoded` and `kind` must be non-empty strings.
203+
// 3. `original` (the base64) must be parseable at parse time so a
204+
// corrupted dump fails on first read rather than at later
205+
// Original() call. Codex P1 #179.
206+
func decodeKeymapLine(line []byte) (KeymapRecord, error) {
207+
// Two-phase decode: first into a presence-aware map so we can
208+
// distinguish "field absent" from "field present and empty
209+
// string"; then into the typed struct for value extraction.
210+
var fields map[string]json.RawMessage
211+
if err := json.Unmarshal(line, &fields); err != nil {
212+
return KeymapRecord{}, errors.Wrap(ErrInvalidKeymapRecord, err.Error())
213+
}
214+
for _, name := range [...]string{"encoded", "original", "kind"} {
215+
raw, ok := fields[name]
216+
if !ok {
217+
return KeymapRecord{}, errors.Wrapf(ErrInvalidKeymapRecord, "missing field %q", name)
218+
}
219+
// `"original": null` round-trips to "" through json.Unmarshal
220+
// into a `string` target, and base64.DecodeString("") would
221+
// then silently accept it. Reject the JSON null literal
222+
// explicitly so corrupted/truncated records don't slip
223+
// through with empty-bytes mappings.
224+
if bytes.Equal(raw, jsonNullLiteral) {
225+
return KeymapRecord{}, errors.Wrapf(ErrInvalidKeymapRecord, "field %q is null", name)
226+
}
227+
}
228+
var rec KeymapRecord
229+
if err := json.Unmarshal(line, &rec); err != nil {
230+
return KeymapRecord{}, errors.Wrap(ErrInvalidKeymapRecord, err.Error())
231+
}
232+
if rec.Encoded == "" || rec.Kind == "" {
233+
return KeymapRecord{}, errors.Wrap(ErrInvalidKeymapRecord, "missing encoded or kind")
234+
}
235+
if _, err := base64.RawURLEncoding.DecodeString(rec.OriginalB64); err != nil {
236+
return KeymapRecord{}, errors.Wrap(ErrInvalidKeymapRecord, err.Error())
237+
}
238+
return rec, nil
239+
}
240+
241+
// LoadKeymap reads every record from r into an in-memory map keyed by
242+
// encoded segment. The last record wins on duplicates. Suitable for
243+
// scopes where the keymap fits comfortably in memory; for large scopes
244+
// callers should use KeymapReader directly.
245+
func LoadKeymap(r io.Reader) (map[string]KeymapRecord, error) {
246+
out := make(map[string]KeymapRecord)
247+
rd := NewKeymapReader(r)
248+
for {
249+
rec, ok, err := rd.Next()
250+
if err != nil {
251+
return nil, err
252+
}
253+
if !ok {
254+
return out, nil
255+
}
256+
out[rec.Encoded] = rec
257+
}
258+
}

0 commit comments

Comments
 (0)