s3keys: export ParseBlobKey for offline blob-key consumers#717
s3keys: export ParseBlobKey for offline blob-key consumers#717
Conversation
…ndation)
Implements internal/backup/{filename.go,filename_test.go} per the Phase 0
design doc (docs/design/2026_04_29_proposed_snapshot_logical_decoder.md).
Encoding rules (deterministic, reversible from the filename alone except
for the SHA fallback):
- RFC3986 unreserved set [A-Za-z0-9._-] passes through.
- Other bytes -> %HH (uppercase hex), like form-urlencoded but applied
to every non-allowlisted byte.
- Segments exceeding 240 bytes after percent-encoding render as
<sha256-hex-prefix-32>__<truncated-original> with the full original
recorded in KEYMAP.jsonl by the caller. DecodeSegment refuses these
with ErrShaFallbackNeedsKeymap so callers cannot fabricate the
original bytes from the filename alone.
- DynamoDB B-attribute (binary) keys take a separate "b64.<base64url>"
path so a binary key never collides with a hex-shaped string key.
Tests cover passthrough, percent escaping, hex-uppercase invariant,
SHA-fallback firing on long inputs (raw or post-expansion), binary
round-trip, malformed-input rejection, output-length bound under
adversarial inputs, and rapid-driven property tests for round-trip on
both percent and binary paths.
Builds on PR #711 (filename encoding). Adds two more foundation pieces of the Phase 0 logical-backup decoder. KEYMAP.jsonl (internal/backup/keymap.go) - Append-only JSONL stream of {encoded, original (b64url), kind} records. - Records exist only for entries whose original bytes are NOT recoverable from the encoded filename alone: - KindSHAFallback (segments rendered as <sha32>__<truncated>) - KindS3LeafData (S3 path collisions renamed to .elastickv-leaf-data) - KindMetaCollision (user object key ending in .elastickv-meta.json) - KeymapWriter: streaming append, json encoder configured to skip HTML escapes so user-key bytes round-trip cleanly. Refuses empty encoded or kind so producer bugs surface loudly. Count() exposed for the "omit empty file" decision. - KeymapReader: line-by-line scanner with bounded buffer (1 MiB); blank lines surface as ErrInvalidKeymapRecord rather than being silently skipped so truncated dumps are recognised. - LoadKeymap: convenience helper that materialises the file as a map (last-wins on duplicates). MANIFEST.json (internal/backup/manifest.go) - Manifest, Source, Live, Adapters, Adapter, Exclusions structs matching the schema in docs/design/2026_04_29_proposed_snapshot_logical_decoder.md. - CurrentFormatVersion = 1; ReadManifest refuses format_version > current and format_version == 0 (ErrUnsupportedFormatVersion). - Phase discriminator constants for Phase 0 ("phase0-snapshot-decode") and Phase 1 ("phase1-live-pinned"); Phase 0 manifests must not set Live, Phase 1 must not set Source -- both validated at write and read time. - DisallowUnknownFields on read so format drift surfaces loudly. - Pretty-printed output (2-space indent, no HTML escapes) since MANIFEST.json is operator-facing. - NewPhase0SnapshotManifest seeds the policy fields with the documented defaults so callers can focus on populating per-dump metadata. Tests cover round-trip, sticky-error semantics, last-wins dedup, HTML-escape suppression, future-version refusal, unknown-field refusal, unknown-phase refusal, and the cross-phase Source/Live exclusion rules.
Builds on PR #712 (KEYMAP/MANIFEST). Adds the first per-adapter encoder of the Phase 0 logical-backup decoder. Covers the three "simple" Redis prefixes that map ONE snapshot record to one user key — strings, HLLs, and the TTL scan index. Wide-column types (hash/list/set/zset/stream) ship in follow-up PRs. Snapshot key prefixes handled: - !redis|str|<userKey> -> strings/<encoded>.bin (+ inline TTL stripped into strings_ttl.jsonl). New magic-prefix format and legacy raw bytes both supported. - !redis|hll|<userKey> -> hll/<encoded>.bin (raw HLL sketch bytes). - !redis|ttl|<userKey> -> routed by prior type observation: redisKindString -> strings_ttl.jsonl (legacy strings) redisKindHLL -> hll_ttl.jsonl redisKindUnknown -> buffered for wide-column types (warning at Finalize until those encoders land). Routing relies on Pebble's lex order: !redis|hll|* < !redis|str|* < !redis|ttl|* (h < s < t), so by the time HandleTTL is called the type tracker already has the answer. Filenames go through EncodeSegment for safety; values are written atomically (tmp + rename) and writeFileAtomic refuses to overwrite a symlink so a malicious or careless link-in-the-output-tree cannot redirect bytes. Tests cover: new-format with/without inline TTL, legacy format treated as raw, HLL raw round-trip, TTL routing for both kinds, orphan-TTL warning emission, truncated header / TTL section rejection, malformed TTL value rejection, EncodeSegment integration, symlink-overwrite refusal, HasInlineTTL helper. The package keeps these constants local rather than importing from adapter/redis_compat_types.go because the backup decoder is intentionally adapter-independent (it must run as an offline tool with no live cluster). A doc comment in redis_string.go points back to the live source so the next renamer is reminded to update both.
Builds on PR #713. Adds the SQS encoder for the Phase 0 logical- backup decoder. Snapshot prefixes handled: - !sqs|queue|meta|<base64url(queue)> -> sqs/<queue>/_queue.json (dump-format projection of the live sqsQueueMeta with AWS-style field names; FormatVersion stamped, throttle / partition / dedup- scope fields elided -- cluster-internal state, not user-visible config) - !sqs|msg|data|<base64url(queue)><gen 8B BE><base64url(msgID)> -> sqs/<queue>/messages.jsonl (one record per line, sorted at Finalize-time by (SendTimestampMillis, SequenceNumber, MessageID)). - !sqs|msg|vis | byage | dedup | group, !sqs|queue|tombstone: excluded by default; --include-sqs-side-records routes them to sqs/<queue>/_internals/side_records.jsonl as a structured bag. - !sqs|queue|gen, !sqs|queue|seq: not handled by Phase 0 (operational counters, not user-visible state). Implementation choices: - Lex-order for the snapshot is m < q < ... so msg|data records arrive BEFORE queue|meta. Encoder buffers per encoded-queue-prefix and resolves the human-readable queue name at Finalize via the queue|meta records that arrive later. - Boundary detection: the encoded queue segment is base64url-no- padding (alphabet [A-Za-z0-9-_]); the first byte of the 8-byte BE gen is 0x00 for any production gen value (< 2^56), so the first non-alphabet byte is the queue/gen boundary. The msgID segment is also validated by attempting a base64url decode -- a failed decode is surfaced as ErrSQSMalformedKey rather than routed to the wrong queue. - Visibility-state on emitted messages is zeroed by default; --preserve-visibility passes the live values through. - Orphan messages (data records with no matching queue meta) emit a structured warning at Finalize and are dropped from the dump. Restoring orphans without a queue config would silently create a default-attribute queue, which is rarely what the operator wants. - Memory: O(messages-per-queue) buffer at Finalize for the sort pass. Queues with hundreds of millions of messages will need a future stream-and-merge variant; documented as a known limit. Tests: queue meta round-trip, message ordering by (ts, seq, id) with ties resolved by message_id, default visibility-state zeroing, --preserve-visibility round-trip, orphan-message warning, magic-prefix rejection, JSON-decode rejection, wrong-prefix rejection, peekMsgDataKey component round-trip, side-records include/exclude.
Builds on PR #714. Adds the DynamoDB encoder for the Phase 0 logical-backup decoder. Snapshot prefixes handled: - !ddb|meta|table|<base64url(table)> -> dynamodb/<table>/_schema.json (DynamoTableSchema proto -> DescribeTable-shaped JSON; cluster- internal fields like key_encoding_version and migrating_from_generation stripped from the public projection because they're not user-visible). - !ddb|item|<base64url(table)>|<gen>|<rest> -> per-item JSON files under dynamodb/<table>/items/. Hash-only tables emit items/<pk>.json; composite-key tables emit items/<pk>/<sk>.json. - !ddb|gsi|... ignored (derivable from items + schema; replaying GSI rows on restore would conflict with the destination's own index maintenance). - !ddb|meta|gen|... ignored (operational counter, not user state). Implementation choices: - Lex order ('i' < 'm') means items arrive before the table schema. Encoder buffers per encoded-table-segment and emits at Finalize once the schema is known, parallel to the SQS encoder's strategy. - Table-segment parsing is unambiguous: base64url alphabet contains no '|', so the first '|' after the prefix is the table/gen separator. No heuristic boundary detection needed. - Item filename derivation reads the hash and range key NAMES from the schema, then looks them up in the item's attributes map. A missing required-key attribute on an item is structural error (it could never have been GetItem-able) and surfaces as ErrDDBInvalidItem on Finalize. - B-attribute (binary) primary keys take EncodeBinarySegment so they cannot collide with hex-shaped string keys; matches the design's "binary keys take b64.<base64url>" rule. - All 10 documented attribute kinds (S, N, B, BOOL, NULL, SS, NS, BS, L, M) are translated to their AWS-DynamoDB-JSON shapes. Empty oneof (a malformed proto value) surfaces as NULL=true so the dump remains deserialisable. - attributeValueToPublic is split into scalar / set / composite helpers so the cyclomatic complexity stays under the package cap. - Bundle mode (--dynamodb-bundle-mode jsonl) is left as a stub: WithBundleJSONL(true) makes Finalize return a clear "not implemented in this PR" error so the master pipeline can surface the capability gap. Per-item layout is the documented default. Tests cover hash-only and composite-key round-trips, binary primary key rendering, orphan items without schema, value-magic rejection on schema and item paths, missing-key-attribute rejection at Finalize, GSI rows ignored, all 10 attribute kinds round-trip through JSON, empty-oneof -> NULL, bundle-mode stub, malformed table-segment key rejection.
The Phase 0a logical-backup decoder (internal/backup) needs to route each !s3|blob| record to its assembled object body without holding a live cluster. Today the package only exports parsers for object manifests and upload parts; blob keys are constructable via BlobKey / VersionedBlobKey but not parseable. ParseBlobKey decodes the 6-component form (BlobKey output) and the 7-component form (VersionedBlobKey output with partVersion != 0) into all components: bucket, generation, object, uploadID, partNo, chunkNo, partVersion. Truncated keys, malformed segments, and trailers that aren't either zero bytes or exactly one u64 are rejected with ok=false. Implementation is split into parseBlobKeyHead (the 6-component head) and parseOptionalPartVersion (the trailer) so cyclomatic complexity stays under the package cap without a //nolint marker. Tests cover the un-versioned round-trip, versioned round-trip, partVersion=0 fallback to un-versioned shape (matching VersionedBlobKey's documented behaviour), rejection of non-blob keys (bucket meta, object manifest, upload part, junk), and rejection of trailing-garbage keys.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (14)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 14 minutes and 31 seconds.Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive logical backup framework for DynamoDB, Redis, and SQS adapters, featuring filename encoding with SHA-fallback for long keys, manifest management, and a key mapping system for original key recovery. The review feedback highlights opportunities to improve maintainability by centralizing shared utility functions—such as atomic file writing and JSONL handling—into a common utility file to reduce coupling across encoder-specific implementations. Additionally, it is recommended to remove or correct the unused and inconsistently documented retry logic found in the Redis encoder.
| @@ -0,0 +1,383 @@ | |||
| package backup | |||
There was a problem hiding this comment.
General-purpose helper functions (e.g., writeFileAtomic, openJSONL, stripPrefixSegment, and mustMarshalIndent) are currently spread across encoder-specific files like redis_string.go and sqs.go. This creates unnecessary coupling. These shared utilities should be centralized into a common file within the backup package, such as util.go or fs.go, to improve maintainability and follow the repository's practice of centralizing common logic.
References
- Centralize common complex logic into shared helper functions to avoid code duplication and improve maintainability.
| // IsBlobAtomicWriteRetriable reports whether err from writeFileAtomic is | ||
| // a retriable I/O failure (no-space, transient FS error). Today this is a | ||
| // stub that returns false for any error; exposed so the master decoder | ||
| // loop can decide whether to abort the whole dump on encountering one. | ||
| func IsBlobAtomicWriteRetriable(err error) bool { | ||
| if err == nil { | ||
| return false | ||
| } | ||
| // errors.Is handles wrapped paths; both sentinel checks are stable | ||
| // for now because we never wrap them ourselves. | ||
| return errors.Is(err, io.ErrShortWrite) | ||
| } |
There was a problem hiding this comment.
The function IsBlobAtomicWriteRetriable should be addressed for the following reasons:
- Dead Code: The function is currently unused in this PR and should be removed to improve code clarity.
- Centralization: If this logic were required, it should be moved to a shared utility file (e.g.,
util.go) rather than residing in an encoder-specific file. - Logic/Documentation: The implementation returns
trueforio.ErrShortWritewhile the comment states it returnsfalsefor any error. Note thatio.ErrShortWriteis typically not a transient error suitable for automatic retries.
References
- Remove dead or unreachable code paths to improve code clarity and maintainability.
- Centralize common complex logic into shared helper functions to avoid code duplication.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0e15f02c18
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| func parseDDBItemKey(key []byte) (string, error) { | ||
| rest, err := stripPrefixSegment(key, []byte(DDBItemPrefix)) | ||
| if err != nil { | ||
| return "", errors.Wrap(ErrDDBMalformedKey, err.Error()) | ||
| } |
There was a problem hiding this comment.
Validate and retain DynamoDB item generation
parseDDBItemKey only extracts the encoded table segment and discards the <gen> field from !ddb|item|<encTable>|<gen>|..., so later stages cannot filter items to the active schema generation. If a snapshot still contains rows from older table generations (e.g., after recreate/purge before cleanup), those stale rows will be mixed into the export and can overwrite valid items or fail encoding against the current key schema. Parse the generation and gate buffered items against the table schema generation at finalize time.
Useful? React with 👍 / 👎.
| if len(st.messages) == 0 { | ||
| return nil |
There was a problem hiding this comment.
Flush SQS side records for queues without messages
When WithIncludeSideRecords(true) is enabled, flushQueue returns immediately when len(st.messages) == 0, which skips the _internals/side_records.jsonl write path entirely. Queues that have side records but no message payloads in this snapshot lose the opted-in internal data, so the dump is incomplete for the side-record mode.
Useful? React with 👍 / 👎.
|
PR #717 itself is
PR #717's actual code (s3keys.ParseBlobKey + 5 tests) has no outstanding feedback. Will land a follow-up @claude review |
|
Claude finished @bootjp's task in 2m 38s —— View job Review of PR #717:
|
Summary
Tiny addition to
internal/s3keys. ExportsParseBlobKeyso the offline Phase 0a logical-backup decoder (next stacked PR) can route!s3|blob|records to their assembled object without holding a live cluster.The package already exports parsers for object manifests and upload parts; blob keys are constructable via
BlobKey/VersionedBlobKeybut were not parseable from the outside.What
ParseBlobKeydoesReturns all components:
bucket,generation,object,uploadID,partNo,chunkNo,partVersion. Both the 6-component (un-versioned) and 7-component (versioned) shapes are decoded.partVersion=0reverts to the un-versioned shape, matchingVersionedBlobKey's documented fallback. Truncated keys, malformed segments, and trailers that aren't either zero bytes or exactly one u64 are rejected withok=false.Test plan
go test -race ./internal/s3keys/...— pass.golangci-lint run ./internal/s3keys/...— clean.partVersion=0fallback, rejection of bucket-meta / object-manifest / upload-part / junk inputs, rejection of trailing-garbage keys.parseBlobKeyHead+parseOptionalPartVersionso cyclomatic complexity stays under the package cap (no//nolint).Why this is its own PR
Branch stack discipline. The S3 backup encoder uses this; landing the parser separately keeps the encoder PR focused on the dump-format mapping.