-
Notifications
You must be signed in to change notification settings - Fork 4.1k
backup: compaction silently corrupts BackupManifest_File EndKeys via slice aliasing #170895
Copy link
Copy link
Closed
Labels
A-disaster-recoveryC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-agentFiled by an AI agent; usually the result of a human/agent investigation sessionFiled by an AI agent; usually the result of a human/agent investigation sessionP-0Issues/test failures with a fix SLA of 2 weeksIssues/test failures with a fix SLA of 2 weeksT-disaster-recoverybranch-release-26.2Used to mark GA and release blockers, technical advisories, and bugs for 26.2Used to mark GA and release blockers, technical advisories, and bugs for 26.2target-release-26.3.0
Metadata
Metadata
Assignees
Labels
A-disaster-recoveryC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-agentFiled by an AI agent; usually the result of a human/agent investigation sessionFiled by an AI agent; usually the result of a human/agent investigation sessionP-0Issues/test failures with a fix SLA of 2 weeksIssues/test failures with a fix SLA of 2 weeksT-disaster-recoverybranch-release-26.2Used to mark GA and release blockers, technical advisories, and bugs for 26.2Used to mark GA and release blockers, technical advisories, and bugs for 26.2target-release-26.3.0
Type
Fields
Give feedbackNo fields configured for issues without a type.
Describe the problem
SSTSinkKeyWriter.maybeDoSizeFlushinpkg/backup/backupsink/sst_sink_key_writer.goassignslastFile.Span.EndKey = newSpan.Keywithout cloning.newSpan.Keyaliases the caller's reused scratch buffer incompactSpanEntry(pkg/backup/compaction_processor.go:469-470). SubsequentWriteKeycalls overwrite scratch in place, silently mutating the already-recordedEndKeyon the just-shrunk manifest entry. By the timeFlushserializes the entry, itsEndKeyholds whatever value scratch contained when the SST was finalized — typically a key far past the intended split boundary.The result is multiple
BackupManifest_Fileentries in the same compacted layer pointing at the same physical SST with overlapping[start_key, end_key)spans. Two entries share anEndKey(often differing only by a/0family suffix that arrives in scratch via the iteration sequence), with the later entry'sStartKeyfalling strictly inside the earlier entry's span.Sibling
Resetpaths in the same file clone correctly (lines 164, 181) with a comment at lines 178-180 calling out exactly this aliasing concern.maybeDoSizeFlushwas missed.Counts observed in a production fixture
Counted physical files with overlapping
BackupManifest_Fileentries (samepath, overlapping[start_key, end_key)) per layer in thetpcc-5kfixture atgs://cockroach-fixtures-us-east1/roachtest/master/tpcc-5k/20260522-090958.790:Example entry pair on one physical file:
Reproduction
A direct unit reproduction against
SSTSinkKeyWriter:fileSpanByteLimitto a small value (e.g. 8 KiB) so soft-flush fires on tiny data.compactSpanEntry's scratch-reuse pattern: a single[]bytewhose contents are rewritten in place for everyWriteKeycall.len(flushedFiles) == 2and the shrunk entry'sEndKeyequals the split key.EndKeymutates in place under it.Assert that the shrunk entry's
EndKeyis byte-stable across step 4.Code references
ResetFix
Clone the boundary key before assigning, e.g.
lastFile.Span.EndKey = newSpan.Key.Clone(). This corrects the writer going forward but does not retroactively repair manifests already produced.Related: #170225.
Jira issue: CRDB-64224