Skip to content

Commit 5a2f214

Browse files
committed
backup: #904 v12 - remove stale .fsm on self-test mismatch
## Codex P2 v10: self-test mismatch leaves orphaned stale .fsm When --output already exists from a prior successful run and the new --self-test invocation detects a mismatch, writeAndPublish returned errSelfTestMismatch without removing the stale <output>.fsm. encodeOne then wrote a fresh <output>.encode_info.json with self_test.matched=false and the NEW SHA pointing to the unpublished temp snapshot, leaving: - <output>.fsm: STALE bytes from the prior successful run - <output>.encode_info.json: NEW SHA + matched=false - <output>.mismatch.txt: from the new run The sidecar describes an FSM that does not exist on disk. This violates the CLI contract that "a self-test failure leaves no restore-visible FSM" and breaks runbooks that consume the sidecar's SHA to verify FSM integrity (they would find the stale FSM's SHA does not match the sidecar's NEW SHA). Fix: writeAndPublish, on the self-test-mismatch branch, now os.Remove(cfg.outputPath) before returning. After cleanup the artifacts are internally consistent: the sidecar describes a FAILED encode attempt, mismatch.txt has the diff, and the FSM path is absent (matching the sidecar's intent). errors.Is(rerr, os.ErrNotExist) is treated as success — a self-test mismatch from a first-ever encode (no prior FSM) leaves the path absent, which is the same end state. ## Surgical scope — only self-test mismatch wipes stale .fsm Other exit-2 paths (manifest-floor regression, adapter-data error, unsupported DDB layout) preserve any prior <output>.fsm because those failures occur BEFORE writeAndPublish runs. This preserves the runbook ergonomic of "operator typo'd --last-commit-ts; last good FSM still on disk to retry against." ## Test infrastructure To drive a deterministic self-test mismatch end-to-end through the CLI's writeAndPublish, a minimal test seam was added to package backup: EncodeOptions.SetSelfTestCorruptHookForTest(func(*os.File)) exposes the previously-package-private corruptBufferForTest field to callers outside package backup. Production code MUST NOT call this; the godoc explicitly names cmd/elastickv-snapshot-encode tests as the sole intended caller. Pinned by: - TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch: pre-places a stale .fsm, injects corruption via the new seam, asserts publishErr == errSelfTestMismatch, the stale .fsm is GONE, and mismatch.txt is present. - TestCLINonSelfTestExitTwoPreservesPriorFSM: pre-places a stale .fsm, drives a manifest-floor regression (exit-2 BEFORE writeAndPublish), asserts the stale .fsm is byte-for-byte preserved. Pins the surgical scope of the cleanup. ## Caller audit per CLAUDE.md semantic-change rule - writeAndPublish: sole production caller is encodeOne. On self-test mismatch, encodeOne still writes the sidecar (publishErr matches errSelfTestMismatch) and returns publishErr. The new os.Remove happens before that, so the caller sees IDENTICAL error semantics and the same sidecar-write path. The only difference is on-disk state for <output>.fsm: was stale, is now absent. - EncodeOptions.SetSelfTestCorruptHookForTest: new exported method. No production callers anywhere; one test caller in cmd/elastickv-snapshot-encode/main_test.go. In-package backup tests continue to set corruptBufferForTest directly. Tests + lint green.
1 parent f075610 commit 5a2f214

3 files changed

Lines changed: 150 additions & 0 deletions

File tree

cmd/elastickv-snapshot-encode/main.go

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,16 @@ func writeAndPublish(cfg *config, encodeOpts backup.EncodeOptions, mismatchPath
335335
if werr := os.WriteFile(mismatchPath, result.SelfTestMismatchTxt, mismatchTxtPerm); werr != nil {
336336
logger.Warn("write mismatch.txt", "err", werr)
337337
}
338+
// Remove the stale <output>.fsm if one exists from a prior
339+
// successful run. encodeOne is about to write a fresh
340+
// <output>.encode_info.json with self_test.matched=false and
341+
// a NEW SHA pointing to the unpublished temp snapshot; leaving
342+
// the old bytes on disk would make the sidecar describe an
343+
// FSM that does not exist and violate the "self-test failure
344+
// leaves no restore-visible FSM" contract (codex P2 v10 #904).
345+
if rerr := os.Remove(cfg.outputPath); rerr != nil && !errors.Is(rerr, os.ErrNotExist) {
346+
logger.Warn("remove stale .fsm on self-test mismatch", "err", rerr)
347+
}
338348
return result, errors.Wrap(errSelfTestMismatch, "self-test diff (see "+mismatchPath+")")
339349
}
340350
if err := os.Rename(tempPath, cfg.outputPath); err != nil {

cmd/elastickv-snapshot-encode/main_test.go

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ import (
1212
"time"
1313

1414
"github.com/bootjp/elastickv/internal/backup"
15+
"github.com/cockroachdb/errors"
1516
)
1617

1718
// isWindows is true on Windows builds; perm-bit tests skip on Windows
@@ -300,6 +301,130 @@ func readSidecar(t *testing.T, output string) backup.EncodeInfo {
300301
return info
301302
}
302303

304+
// TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch pins codex
305+
// P2 v10 #904: when a prior successful run left an <output>.fsm on
306+
// disk and a new --self-test invocation produces a mismatch,
307+
// writeAndPublish must remove that stale .fsm. Otherwise encodeOne
308+
// writes a fresh sidecar (matched=false, NEW SHA) alongside the OLD
309+
// bytes — violating the CLI contract that a self-test failure leaves
310+
// no restore-visible FSM, and making the sidecar describe an FSM that
311+
// is not on disk.
312+
//
313+
// To drive a deterministic self-test mismatch end-to-end through the
314+
// CLI's writeAndPublish, the test uses the backup package's exported
315+
// test seam (SetSelfTestCorruptHookForTest) to flip bytes in the
316+
// disk-backed self-test buffer between WriteTo and the re-decode.
317+
func TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch(t *testing.T) {
318+
t.Parallel()
319+
rawIn := t.TempDir()
320+
writeSQSFixture(t, rawIn)
321+
emitMinimalManifest(t, rawIn, 7000)
322+
canonicalIn := canonicalizeInput(t, rawIn, 7000)
323+
324+
out := filepath.Join(t.TempDir(), "out.fsm")
325+
// Pre-place a stale .fsm — what a prior successful run would have
326+
// left behind. The codex P2 v10 contract is that a subsequent
327+
// self-test mismatch invalidates this file.
328+
stalePayload := []byte("STALE FSM FROM PRIOR SUCCESSFUL RUN")
329+
if err := os.WriteFile(out, stalePayload, 0o600); err != nil {
330+
t.Fatalf("WriteFile stale: %v", err)
331+
}
332+
333+
scratchBase := t.TempDir()
334+
encodeOpts := backup.EncodeOptions{
335+
InputRoot: canonicalIn,
336+
Adapters: backup.AdapterSet{SQS: true},
337+
LastCommitTS: 7000,
338+
ManifestLastCommitTS: 7000,
339+
SelfTest: true,
340+
SelfTestDecodeOptions: backup.DecodeOptions{
341+
OutRoot: scratchBase,
342+
Adapters: backup.AdapterSet{SQS: true},
343+
},
344+
}
345+
// Flip bytes past the EKVPBBL1 header so the re-decode trips on
346+
// a malformed entry length and the self-test returns matched=false.
347+
// Pattern mirrors flipBytesPastHeaderHelper in the library test.
348+
encodeOpts.SetSelfTestCorruptHookForTest(func(f *os.File) {
349+
info, ferr := f.Stat()
350+
if ferr != nil {
351+
t.Fatalf("temp Stat: %v", ferr)
352+
}
353+
const headerSkip = 200
354+
if info.Size() <= headerSkip {
355+
t.Fatalf("temp file too small to corrupt past header: %d bytes", info.Size())
356+
}
357+
buf := make([]byte, info.Size()-headerSkip)
358+
if _, rerr := f.ReadAt(buf, headerSkip); rerr != nil {
359+
t.Fatalf("ReadAt: %v", rerr)
360+
}
361+
for i := 0; i < len(buf); i += 13 {
362+
buf[i] ^= 0xFF
363+
}
364+
if _, werr := f.WriteAt(buf, headerSkip); werr != nil {
365+
t.Fatalf("WriteAt: %v", werr)
366+
}
367+
})
368+
369+
cfg := &config{
370+
inputPath: canonicalIn,
371+
outputPath: out,
372+
adapters: backup.AdapterSet{SQS: true},
373+
selfTest: true,
374+
}
375+
mismatchPath := out + ".mismatch.txt"
376+
377+
_, publishErr := writeAndPublish(cfg, encodeOpts, mismatchPath, quietLogger())
378+
if !errors.Is(publishErr, errSelfTestMismatch) {
379+
t.Fatalf("publishErr = %v, want errSelfTestMismatch", publishErr)
380+
}
381+
if _, statErr := os.Stat(out); !os.IsNotExist(statErr) {
382+
t.Errorf("stale .fsm at %s not removed after self-test mismatch (codex P2 v10)", out)
383+
}
384+
// The mismatch.txt should be present as the operator-visible
385+
// record of the failed encode attempt.
386+
if _, statErr := os.Stat(mismatchPath); statErr != nil {
387+
t.Errorf("mismatch.txt missing after self-test mismatch: %v", statErr)
388+
}
389+
}
390+
391+
// TestCLINonSelfTestExitTwoPreservesPriorFSM pins the surgical scope
392+
// of the codex P2 v10 fix: non-self-test exit-2 paths (e.g. the
393+
// manifest-floor HLC regression that fails BEFORE writeAndPublish)
394+
// must NOT remove a prior <output>.fsm. Only self-test mismatch
395+
// triggers the cleanup; a runbook recovering from a typo'd
396+
// --last-commit-ts still has its last good FSM on disk.
397+
func TestCLINonSelfTestExitTwoPreservesPriorFSM(t *testing.T) {
398+
t.Parallel()
399+
in := t.TempDir()
400+
emitMinimalManifest(t, in, 1000)
401+
out := filepath.Join(t.TempDir(), "out.fsm")
402+
stalePayload := []byte("STALE FSM FROM PRIOR SUCCESSFUL RUN")
403+
if err := os.WriteFile(out, stalePayload, 0o600); err != nil {
404+
t.Fatalf("WriteFile stale: %v", err)
405+
}
406+
// Manifest-floor regression → exit-2 from resolveLastCommitTS,
407+
// before writeAndPublish runs. Stale .fsm should be preserved.
408+
code, err := run([]string{
409+
"--input", in,
410+
"--output", out,
411+
"--last-commit-ts", "500", // below manifest 1000
412+
}, quietLogger())
413+
if err == nil {
414+
t.Fatalf("run succeeded; want manifest-floor regression")
415+
}
416+
if code != exitDataErr {
417+
t.Errorf("exit code = %d, want %d", code, exitDataErr)
418+
}
419+
body, rerr := os.ReadFile(out)
420+
if rerr != nil {
421+
t.Fatalf("read stale .fsm: %v (must be preserved on non-self-test exit-2)", rerr)
422+
}
423+
if !bytes.Equal(body, stalePayload) {
424+
t.Errorf("stale .fsm mutated; want preserved on manifest-floor regression")
425+
}
426+
}
427+
303428
// TestCLIRoundTripSelfTestAllAdapters is the gold-standard CLI-level
304429
// end-to-end test: a real adapter fixture, encoder runs with
305430
// --self-test, exit 0, matched:true in the sidecar.

internal/backup/encode_snapshot.go

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,21 @@ type EncodeOptions struct {
135135
corruptBufferForTest func(*os.File)
136136
}
137137

138+
// SetSelfTestCorruptHookForTest installs a same-process hook that
139+
// fires against the on-disk self-test buffer between WriteTo and the
140+
// re-decode call. The hook can WriteAt into the file to inject
141+
// corruption so the subsequent self-test mismatches deterministically.
142+
//
143+
// Production code MUST NOT call this; it is exclusively a test seam
144+
// for callers OUTSIDE package backup (specifically the
145+
// cmd/elastickv-snapshot-encode CLI tests, which need to drive a real
146+
// end-to-end self-test mismatch to verify the stale-.fsm cleanup
147+
// path — codex P2 v10 #904). In-package tests should set
148+
// EncodeOptions.corruptBufferForTest directly.
149+
func (o *EncodeOptions) SetSelfTestCorruptHookForTest(hook func(*os.File)) {
150+
o.corruptBufferForTest = hook
151+
}
152+
138153
// EncodeResult is the public return value from EncodeSnapshot. Mirrors
139154
// the decoder's DecodeResult shape.
140155
type EncodeResult struct {

0 commit comments

Comments
 (0)