Skip to content

Commit 48bbcc4

Browse files
committed
kv(composed1): ABORT bypass + happy-path test + doc fix (round-3 on PR #895)
Three remaining items from claude's round-2 follow-up review + the gemini HIGH-priority finding that was missed in round 2: * gemini HIGH (kv/fsm.go) — ABORT requests MUST bypass the Composed-1 gate. If a route shifted between PREPARE and ABORT (or the observed version was evicted), a rejected ABORT would leave the txn's intent locks pinned in MVCC until LockResolver's TTL — minutes of write-blocked keys for what should be a one-RPC cleanup. Production callers (transaction.go, sharded_coordinator.go) happen to leave ObservedRouteVersion=0 on ABORT requests, so today the existing observedVer==0 short-circuit already protects them. But that is enforcement-by-convention; a future ABORT construction site that copies a COMMIT struct by mistake would silently start failing aborts. Fix: add `if r.GetPhase() == pb.Phase_ABORT { return nil }` as the FIRST guard in verifyComposed1, before the routes/shardGroupID check. Enforcement-in-code rather than enforcement-by-convention. Caller audit (the loop directive's semantic-change check): verifyComposed1 is only called from handleTxnRequest, which treats nil as success. The ABORT bypass returns nil, so the caller's semantics are preserved. No further wiring change. * claude follow-up #1 (distribution/engine.go) — The SetHistoryDepthForTest doc comment said "no internal synchronisation here" even though the function acquired e.mu.Lock. Future readers would see the contradiction and not know which was authoritative. Rewrote the comment to describe the actual contract: the write is lock-protected, but callers should still set depth before sharing the Engine with concurrent SnapshotAt/Current readers to avoid interleaving surprises around the eviction watermark. * claude follow-up #2 (kv/fsm_composed1_test.go) — The gate suite had no happy-path test. An off-by-one in OwnerOf, a reversed comparison, or a sign-flip in the group-ID match would silently false-reject every valid commit and the failure-only suite would not catch it. Added TestVerifyComposed1_ValidOwnershipPassesGate: this FSM serves the group that owns the key at BOTH the observed version AND the current catalog version, so the gate must return nil. * claude follow-up nit (kv/fsm_composed1_test.go) — The `//nolint:unparam` annotation on newComposed1FSM is now redundant: the new ABORT and shardGroupID=0 tests pass distinct shardGroupID values, so golangci-lint no longer flags the parameter as constant. Removed (verified with `make lint` → 0 issues). Also added TestVerifyComposed1_AbortPhaseSkipsGate as the regression for the ABORT bypass: deliberately sets ObservedRouteVersion=1 (non-zero) on the ABORT to defeat the existing observedVer==0 short-circuit and isolate the new Phase_ABORT guard. Verification: go test -race -count=1 ./kv ./distribution → pass. make lint → 0 issues.
1 parent 6108626 commit 48bbcc4

3 files changed

Lines changed: 93 additions & 12 deletions

File tree

distribution/engine.go

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -209,13 +209,15 @@ func (e *Engine) SnapshotAt(v uint64) (RouteHistorySnapshot, bool) {
209209
}
210210

211211
// SetHistoryDepthForTest overrides the FIFO ring depth from outside
212-
// the package. Test-only — callers MUST set the depth before the
213-
// Engine is shared with any concurrent reader (no internal
214-
// synchronisation here for the same reason TestEngineSnapshotAt_FIFOEviction
215-
// does the direct field write in-package; this seam exposes the
216-
// equivalent capability to external test packages that need a
217-
// small depth to exercise eviction without overwhelming TLC-style
218-
// bounded scenarios — claude review on PR #894).
212+
// the package. Test-only. Callers should set the depth before
213+
// sharing the Engine with concurrent SnapshotAt/Current readers to
214+
// avoid interleaving surprises around the eviction watermark, but
215+
// the write itself is lock-protected (e.mu.Lock below) so it is
216+
// safe to call from any goroutine that does not also expect a
217+
// consistent SnapshotAt view across the depth change. Exists so
218+
// tests in the kv package can drive eviction-trigger scenarios
219+
// without adding a constructor option just for tests (claude
220+
// review on PR #894).
219221
//
220222
// Production code must use DefaultRouteHistoryDepth (32) or a
221223
// future operator-exposed config knob; this seam is build-time

kv/fsm.go

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -580,6 +580,20 @@ func (f *kvFSM) handleTxnRequest(ctx context.Context, r *pb.Request, commitTS ui
580580
// outside the ring (M4 retry), and ErrComposed1Violation wrapped
581581
// with per-key context otherwise.
582582
func (f *kvFSM) verifyComposed1(r *pb.Request) error {
583+
// ABORT requests MUST always reach the abort handler so the
584+
// txn's intent locks get released. If a route shifted between
585+
// PREPARE and ABORT (or the observed version was evicted), a
586+
// rejected ABORT would leave the locks pinned until LockResolver
587+
// noticed at TTL expiry — minutes of write-blocked keys for
588+
// what should be a one-RPC cleanup. Production callers carry
589+
// ObservedRouteVersion=0 on ABORT (the existing observedVer==0
590+
// check below handles that case), so this guard is defensive
591+
// belt-and-suspenders: any future ABORT construction site that
592+
// inadvertently sets the version still bypasses the gate
593+
// (gemini HIGH on PR #895 — fail-open on ABORT).
594+
if r.GetPhase() == pb.Phase_ABORT {
595+
return nil
596+
}
583597
// Bypass the gate when EITHER the route history is unwired OR
584598
// the FSM's shardGroupID is the unset sentinel (0). Both
585599
// fields are documented as "unset ⇒ short-circuit" but the

kv/fsm_composed1_test.go

Lines changed: 70 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,6 @@ func applyComposed1Snapshot(t *testing.T, e *distribution.Engine, version uint64
2828
// shard group ID the gate compares against. Production wiring lives
2929
// in main.go's buildShardGroups; this helper short-circuits to the
3030
// test-only fixture without spinning up a real Raft group.
31-
//
32-
// the helper keeps it as a parameter so a future test can exercise
33-
// the "wrong-group" case without re-deriving the boilerplate.
34-
//
35-
//nolint:unparam // shardGroupID is currently always 1 in tests but
3631
func newComposed1FSM(t *testing.T, e *distribution.Engine, shardGroupID uint64) *kvFSM {
3732
t.Helper()
3833
fsmIface := NewKvFSMWithHLC(store.NewMVCCStore(), NewHLC(),
@@ -256,3 +251,73 @@ func TestVerifyComposed1_ShardGroupIDZeroSkipsGate(t *testing.T) {
256251
"shardGroupID=0 must bypass before any ring lookup, so ErrComposed1VersionGCd must not surface either")
257252
}
258253
}
254+
255+
// TestVerifyComposed1_ValidOwnershipPassesGate is the happy-path
256+
// witness the gate suite was missing (claude follow-up review on
257+
// PR #895): when this FSM serves the group that owns the key at
258+
// BOTH the txn's observed version AND the current catalog version,
259+
// verifyComposed1 returns nil. Without this test an off-by-one
260+
// in OwnerOf, a reversed comparison, or an inadvertent sign-flip
261+
// in the group-ID match would silently false-reject every valid
262+
// commit and the failure-only suite would not catch it.
263+
func TestVerifyComposed1_ValidOwnershipPassesGate(t *testing.T) {
264+
t.Parallel()
265+
266+
// At v=1 AND v=2, key "k" is owned by group 1. An FSM serving
267+
// group 1 must allow a commit pinned at v=1 to pass both the
268+
// observed-version check (routes[1][k] = g1) and the
269+
// current-version fence (routes[2][k] = g1).
270+
e := distribution.NewEngine()
271+
applyComposed1Snapshot(t, e, 1, []distribution.RouteDescriptor{
272+
{RouteID: 100, Start: []byte(""), End: nil, GroupID: 1, State: distribution.RouteStateActive},
273+
})
274+
applyComposed1Snapshot(t, e, 2, []distribution.RouteDescriptor{
275+
{RouteID: 101, Start: []byte(""), End: nil, GroupID: 1, State: distribution.RouteStateActive},
276+
})
277+
fsm := newComposed1FSM(t, e, 1)
278+
279+
require.NoError(t, fsm.verifyComposed1(commitTxnRequest(1, "k")),
280+
"verifyComposed1 must return nil when this FSM's shardGroupID matches the owner at BOTH the observed version and the current catalog version — covers the happy path neither the failure tests nor the bypass tests exercise")
281+
}
282+
283+
// TestVerifyComposed1_AbortPhaseSkipsGate is the regression for the
284+
// gemini HIGH finding on PR #895: ABORT requests MUST bypass the
285+
// Composed-1 gate so that a route shift between PREPARE and ABORT
286+
// cannot strand the txn's intent locks. Without the early
287+
// Phase_ABORT short-circuit, an ABORT that arrives at the wrong
288+
// group after a MoveRange would fail with ErrComposed1Violation and
289+
// the locks would remain pinned until LockResolver's TTL — minutes
290+
// of write-blocked keys for what should be a one-RPC cleanup.
291+
//
292+
// The production callers (transaction.go and sharded_coordinator.go)
293+
// happen to leave ObservedRouteVersion=0 on ABORT, so today the
294+
// existing observedVer==0 branch already protects them, but that's
295+
// enforcement-by-convention. The Phase_ABORT guard is
296+
// enforcement-in-code: a future ABORT construction site that
297+
// inadvertently copies a COMMIT request struct would still bypass.
298+
func TestVerifyComposed1_AbortPhaseSkipsGate(t *testing.T) {
299+
t.Parallel()
300+
301+
// Wire the engine so that, absent the ABORT bypass, the gate
302+
// would reject: key "k" is owned by group 5 at v=1 but the
303+
// FSM serves group 1. The test deliberately sets
304+
// ObservedRouteVersion=1 (non-zero) on the ABORT to defeat the
305+
// existing observedVer==0 short-circuit and isolate the
306+
// Phase_ABORT guard.
307+
e := distribution.NewEngine()
308+
applyComposed1Snapshot(t, e, 1, []distribution.RouteDescriptor{
309+
{RouteID: 100, Start: []byte(""), End: nil, GroupID: 5, State: distribution.RouteStateActive},
310+
})
311+
fsm := newComposed1FSM(t, e, 1)
312+
313+
abortReq := &pb.Request{
314+
IsTxn: true,
315+
Phase: pb.Phase_ABORT,
316+
ObservedRouteVersion: 1,
317+
Mutations: []*pb.Mutation{
318+
{Op: pb.Op_DEL, Key: []byte("k")},
319+
},
320+
}
321+
require.NoError(t, fsm.verifyComposed1(abortReq),
322+
"ABORT requests must bypass the Composed-1 gate even with a non-zero ObservedRouteVersion pinned, so a route shift between PREPARE and ABORT cannot strand the txn's intent locks past lock-resolver TTL")
323+
}

0 commit comments

Comments
 (0)