Skip to content

Commit 128e16c

Browse files
authored
feat(redis): flip one-phase txn dedup default to on (closes #937) (#943)
## Summary Flips `adapter.RedisServer.onePhaseTxnDedup` from default-off to default-on, closing the rollout of the option-2 idempotency design landed by [`docs/design/2026_05_21_proposed_txn_secondary_idempotency.md`](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_21_proposed_txn_secondary_idempotency.md). No new mechanism — the FSM probe (`dedupProbeOnePhase`), the wire change (`TxnMeta.PrevCommitTS`, V2-only when non-zero), and every adapter-side reuse path already ship and are exercised every day by the dedup-mode Jepsen workflow. Closes #937. ## Authorization The parent design's `M4` 7-day criterion ([source](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_21_proposed_txn_secondary_idempotency.md#m4--validation)) reads: > 7 consecutive days without `:duplicate-elements` / `:G-single-item-realtime` in the dedup-mode workflow. [`jepsen-test-scheduled-dedup.yml`](https://github.com/bootjp/elastickv/blob/main/.github/workflows/jepsen-test-scheduled-dedup.yml) (Redis stress profile, `ELASTICKV_REDIS_ONEPHASE_DEDUP=1`) has produced **12 consecutive green runs over 10 calendar days, 2026-05-31 → 2026-06-10**. Run IDs are listed in the new design doc; the most recent are: | Date (UTC) | Run | Conclusion | |---|---|---| | 2026-06-10 04:50 | [27253991270](https://github.com/bootjp/elastickv/actions/runs/27253991270) | success | | 2026-06-09 04:44 | [27184402445](https://github.com/bootjp/elastickv/actions/runs/27184402445) | success | | 2026-06-08 04:57 | [27116904871](https://github.com/bootjp/elastickv/actions/runs/27116904871) | success | | 2026-06-07 04:54 | [27083142341](https://github.com/bootjp/elastickv/actions/runs/27083142341) | success | | 2026-06-06 04:40 | [27052820868](https://github.com/bootjp/elastickv/actions/runs/27052820868) | success | | 2026-06-05 04:49 | [26996058409](https://github.com/bootjp/elastickv/actions/runs/26996058409) | success | | 2026-06-04 04:57 | [26931749014](https://github.com/bootjp/elastickv/actions/runs/26931749014) | success | | 2026-06-04 04:16 | [26930308744](https://github.com/bootjp/elastickv/actions/runs/26930308744) | success | | 2026-06-03 04:58 | [26864667493](https://github.com/bootjp/elastickv/actions/runs/26864667493) | success | | ...(2 more) | | success | Exceeds the 7-day threshold. The parent's R5 (FSM determinism across a rolling upgrade) is discharged — the probe code has shipped on every production node for months. ## Change | File | Change | |---|---| | `docs/design/2026_06_10_proposed_redis_onephase_dedup_default_on.md` | New design doc (commit 1) recording M4 evidence and the rollout. | | `adapter/redis.go` | Flips env-var sense from `== "1"` to `!= "0"`; updates the comment, godoc, and field comment. | | `.github/workflows/jepsen-test-scheduled.yml` | Adds explicit `ELASTICKV_REDIS_ONEPHASE_DEDUP: "0"` to the job env so the control baseline keeps its meaning across the default change. Retirement of the control workflow is a 30-day follow-up. | ## Behaviour change - **Default**: dedup gate is now **on** for any process that does not opt out. - **Opt-out**: `ELASTICKV_REDIS_ONEPHASE_DEDUP=0` (single-env-var rollback, no binary revert). The constructor option `WithOnePhaseTxnDedup(false)` still trumps the env. - **Existing opt-in users**: `=1` (the dedup-mode workflow's existing setting) continues to resolve to `true`. No breakage. ## Risk | Concern | Disposition | |---|---| | Data loss | None — the dedup path only short-circuits on a confirmed prior commit at the exact `prev_commit_ts` (the parent design's R1 + the existing `kv/fsm.go` probe). No write is dropped. | | Concurrency / distributed failures | None — every node already runs the probe code; no new fanout, no new round-trip. R5 discharged as above. | | Performance | Default-on adds the `PrevCommitTS` probe on each retryable attempt. Cost is one `CommittedVersionAt` MVCC lookup at the primary key + stale `commit_ts`; this is the same cost the dedup-mode workflow has been paying for 12 days without regression. No hot-path allocation change. | | Data consistency | Improved — the failure modes the parent design predicted (`:duplicate-elements`, `:future-read`, `:G-single-item-realtime`) become unreachable on the protected path. Issue #937 is the most recent control-baseline reproduction. | | Test coverage | No new branches added; existing `adapter/redis_*_dedup_test.go` suites cover both gate states. Tests that construct `&RedisServer{...}` directly (zero-value `onePhaseTxnDedup`) still observe the legacy path, so the off-path remains covered at unit level. | ## Verification - `go vet ./...` — clean - `go build ./...` — clean - `go test -run 'Dedup|OnePhase|PrevCommit|Idempot' ./adapter/` — pass (1.2s) - `go test -run 'TestRedis|TestList|TestSet|TestZSet|TestStream|TestExec|TestMulti' ./adapter/` — pass (169s) - `golangci-lint run ./adapter/...` — 0 issues ## Follow-ups - **30 days post-merge** — review whether `jepsen-test-scheduled.yml` (now explicit dedup-off) still adds signal. If not, retire it. - **DynamoDB default-on** — parallel work tracked by [`2026_06_03_partial_dynamodb_onephase_dedup.md`](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_06_03_partial_dynamodb_onephase_dedup.md)'s `M2`. Not part of this PR. - **S3, SQS dedup paths** — not yet routed through one-phase reuse; out of scope for the parent design. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added a design doc describing a planned default-on change for Redis transaction dedup and rollout/verification plans. * **Chores** * Flipped adapter default behavior so transaction dedup is enabled by default, with an environment opt-out preserved. * Introduced a separate opt-in gate for standalone SET dedup to avoid unintended behavior changes. * **Tests** * Updated standalone SET tests and added a regression test to verify overwrite semantics remain correct under current defaults. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
2 parents bd597b5 + a5909eb commit 128e16c

5 files changed

Lines changed: 395 additions & 21 deletions

File tree

.github/workflows/jepsen-test-scheduled.yml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,19 @@ jobs:
3939
runs-on: ubuntu-latest
4040
env:
4141
GOCACHE: /tmp/go-build
42+
# Explicit dedup-OFF control baseline. The Redis adapter's
43+
# onePhaseTxnDedup flipped to default-on in
44+
# docs/design/2026_06_10_proposed_redis_onephase_dedup_default_on.md;
45+
# this workflow is preserved as the legacy-path coverage so anomalies
46+
# the dedup gate prevents (`:duplicate-elements`, `:future-read`,
47+
# `:G-single-item-realtime`) continue to be measured against an
48+
# unprotected build. Pair with the dedup-ON workflow
49+
# (.github/workflows/jepsen-test-scheduled-dedup.yml) which sets
50+
# ELASTICKV_REDIS_ONEPHASE_DEDUP=1 explicitly. Retirement of this
51+
# workflow is a follow-up after 30 days of post-flip data; until
52+
# then, do NOT remove this env var — without it the two workflows
53+
# would exercise the same path under the new default.
54+
ELASTICKV_REDIS_ONEPHASE_DEDUP: "0"
4255
steps:
4356
- uses: actions/checkout@v6
4457
with:

adapter/redis.go

Lines changed: 61 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -243,27 +243,59 @@ type RedisServer struct {
243243
// retryable write error, list-push retries reuse the failed attempt's
244244
// write set and carry prev_commit_ts so the FSM can dedup a commit that
245245
// landed under leadership churn (see
246-
// docs/design/2026_05_21_proposed_txn_secondary_idempotency.md). It
247-
// MUST stay off until every node runs a probe-aware binary — see R5
248-
// (FSM determinism across a rolling upgrade). Default off; enabled via
249-
// WithOnePhaseTxnDedup / the ELASTICKV_REDIS_ONEPHASE_DEDUP env var
250-
// after a full rollout.
246+
// docs/design/2026_05_21_proposed_txn_secondary_idempotency.md). The
247+
// FSM probe ships on every node in production, satisfying R5 (FSM
248+
// determinism across a rolling upgrade), so the gate now defaults on
249+
// per docs/design/2026_06_10_proposed_redis_onephase_dedup_default_on.md.
250+
// Set ELASTICKV_REDIS_ONEPHASE_DEDUP=0 (or WithOnePhaseTxnDedup(false))
251+
// to opt out — kept as a one-env-var operator rollback.
251252
onePhaseTxnDedup bool
252253

254+
// standaloneSetDedup gates whether the *standalone* SET command (not SET
255+
// inside MULTI/EXEC) routes through runTransactionWithDedup. Default off
256+
// because the dedup path's applySet does not match the legacy
257+
// executeSet/replaceWithStringTxn semantics for SET-over-collection: a
258+
// `SET k v` after `RPUSH k x` returns WRONGTYPE on the dedup path but
259+
// overwrites correctly on the legacy path (PR #943 round-1 codex P1).
260+
// Bringing applySet to parity (collection-deletion + string write inside
261+
// the dedup txn) is tracked as a follow-up; until that lands, the
262+
// standalone SET path stays on the legacy default-on-flip code path
263+
// regardless of onePhaseTxnDedup's value. Enable explicitly via
264+
// WithStandaloneSetDedup(true) or ELASTICKV_REDIS_ONEPHASE_DEDUP_SET=1
265+
// only when applySet's parity is verified for the workload at hand.
266+
standaloneSetDedup bool
267+
253268
route map[string]func(conn redcon.Conn, cmd redcon.Command)
254269
}
255270

256271
type RedisServerOption func(*RedisServer)
257272

258-
// WithOnePhaseTxnDedup enables the option-2 one-phase idempotency dedup on
259-
// list-push retries (see RedisServer.onePhaseTxnDedup). Off by default;
260-
// enable only after the whole cluster runs a probe-aware binary.
273+
// WithOnePhaseTxnDedup enables (or disables) the option-2 one-phase
274+
// idempotency dedup on list-push and MULTI/EXEC retries
275+
// (see RedisServer.onePhaseTxnDedup). On by default since the rollout
276+
// recorded in docs/design/2026_06_10_proposed_redis_onephase_dedup_default_on.md;
277+
// pass false to opt out from code, or set ELASTICKV_REDIS_ONEPHASE_DEDUP=0
278+
// to opt out from the environment. The constructor option trumps the env var.
279+
// Standalone SET requires the separate WithStandaloneSetDedup gate; see
280+
// RedisServer.standaloneSetDedup.
261281
func WithOnePhaseTxnDedup(enabled bool) RedisServerOption {
262282
return func(r *RedisServer) {
263283
r.onePhaseTxnDedup = enabled
264284
}
265285
}
266286

287+
// WithStandaloneSetDedup enables the option-2 dedup path on the *standalone*
288+
// SET command (not SET inside MULTI/EXEC). Off by default because the dedup
289+
// path's applySet does not yet match the legacy executeSet semantics for
290+
// SET-over-collection — see RedisServer.standaloneSetDedup. Enable only
291+
// after verifying applySet parity for the workload (no SET-over-list /
292+
// SET-over-hash / SET-over-set / SET-over-zset / SET-over-stream issued).
293+
func WithStandaloneSetDedup(enabled bool) RedisServerOption {
294+
return func(r *RedisServer) {
295+
r.standaloneSetDedup = enabled
296+
}
297+
}
298+
267299
func WithRedisActiveTimestampTracker(tracker *kv.ActiveTimestampTracker) RedisServerOption {
268300
return func(r *RedisServer) {
269301
r.readTracker = tracker
@@ -495,15 +527,20 @@ func NewRedisServer(listen net.Listener, redisAddr string, store store.MVCCStore
495527
// getLuaPool, which honors luaPoolMaxIdle the same way.
496528
luaPool: nil,
497529
traceCommands: os.Getenv("ELASTICKV_REDIS_TRACE") == "1",
498-
// onePhaseTxnDedup honors the documented opt-in env var; the
499-
// WithOnePhaseTxnDedup option below can still override either way.
500-
// Default off — see R5 in the design doc (the writer must not be
501-
// enabled until the whole cluster runs a probe-aware binary).
502-
onePhaseTxnDedup: os.Getenv("ELASTICKV_REDIS_ONEPHASE_DEDUP") == "1",
503-
baseCtx: baseCtx,
504-
baseCancel: baseCancel,
505-
streamWaiters: newKeyWaiterRegistry(),
506-
zsetWaiters: newKeyWaiterRegistry(),
530+
// onePhaseTxnDedup defaults on — the parent design's R5 rolling-upgrade
531+
// constraint is discharged (FSM probe shipped on every node months ago,
532+
// 12 consecutive green dedup-mode Jepsen runs 2026-05-31 → 2026-06-10).
533+
// See docs/design/2026_06_10_proposed_redis_onephase_dedup_default_on.md.
534+
// ELASTICKV_REDIS_ONEPHASE_DEDUP=0 opts out; the WithOnePhaseTxnDedup
535+
// constructor option still trumps the env var.
536+
onePhaseTxnDedup: os.Getenv("ELASTICKV_REDIS_ONEPHASE_DEDUP") != "0",
537+
// standaloneSetDedup defaults off; see field comment for the
538+
// applySet-vs-executeSet parity gap that gates this separately.
539+
standaloneSetDedup: os.Getenv("ELASTICKV_REDIS_ONEPHASE_DEDUP_SET") == "1",
540+
baseCtx: baseCtx,
541+
baseCancel: baseCancel,
542+
streamWaiters: newKeyWaiterRegistry(),
543+
zsetWaiters: newKeyWaiterRegistry(),
507544
}
508545
r.relay.Bind(r.publishLocal)
509546

@@ -1131,7 +1168,13 @@ func (r *RedisServer) set(conn redcon.Conn, cmd redcon.Command) {
11311168
// SET there is exactly one element with the same redisResult shape as
11321169
// the standalone reply (resultString OK / resultNil for NX/XX miss /
11331170
// resultBulk for GET).
1134-
if r.onePhaseTxnDedup {
1171+
// Both gates must be on to route standalone SET through the dedup path.
1172+
// onePhaseTxnDedup covers the MULTI/EXEC and list-push retries that the
1173+
// parent design's M4 validated; standaloneSetDedup is a separate sub-gate
1174+
// (default off) because applySet diverges from executeSet on SET-over-
1175+
// collection — flipping onePhaseTxnDedup default-on without this guard
1176+
// would change normal Redis overwrite behaviour (PR #943 round-1 codex P1).
1177+
if r.onePhaseTxnDedup && r.standaloneSetDedup {
11351178
// Call runTransactionWithDedup directly instead of going through
11361179
// runTransaction. runTransaction re-checks the same
11371180
// r.onePhaseTxnDedup gate and routes here anyway; the indirection

adapter/redis_set_dedup_test.go

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,12 @@ func TestStandaloneSetDedup_LandedPriorAttempt_ReturnsOK(t *testing.T) {
2828
ctx := context.Background()
2929
st := store.NewMVCCStore()
3030
coord := newDedupTestCoordinator(st, 1, true) // attempt 1 lands then errors
31-
srv := &RedisServer{store: st, coordinator: coord, scriptCache: map[string]string{}, onePhaseTxnDedup: true}
31+
// Both gates: onePhaseTxnDedup is the umbrella; standaloneSetDedup is
32+
// the per-path sub-gate that opts the *standalone* SET branch into
33+
// the dedup routing. The sub-gate defaults off (PR #943 round-1 codex
34+
// P1 — applySet diverges from executeSet on SET-over-collection).
35+
// Tests that pin the dedup-on routing must explicitly enable both.
36+
srv := &RedisServer{store: st, coordinator: coord, scriptCache: map[string]string{}, onePhaseTxnDedup: true, standaloneSetDedup: true}
3237

3338
conn := &recordingConn{}
3439
cmd := redcon.Command{Args: [][]byte{[]byte(cmdSet), []byte("k"), []byte("v1")}}
@@ -61,7 +66,12 @@ func TestStandaloneSetDedup_NXMissReturnsNil(t *testing.T) {
6166
require.NoError(t, st.PutAt(ctx, redisStrKey([]byte("k")), encodeRedisStr([]byte("seed"), nil), 5, 0))
6267

6368
coord := newDedupTestCoordinator(st, 1, true) // attempt 1 lands then errors
64-
srv := &RedisServer{store: st, coordinator: coord, scriptCache: map[string]string{}, onePhaseTxnDedup: true}
69+
// Both gates: onePhaseTxnDedup is the umbrella; standaloneSetDedup is
70+
// the per-path sub-gate that opts the *standalone* SET branch into
71+
// the dedup routing. The sub-gate defaults off (PR #943 round-1 codex
72+
// P1 — applySet diverges from executeSet on SET-over-collection).
73+
// Tests that pin the dedup-on routing must explicitly enable both.
74+
srv := &RedisServer{store: st, coordinator: coord, scriptCache: map[string]string{}, onePhaseTxnDedup: true, standaloneSetDedup: true}
6575

6676
conn := &recordingConn{}
6777
// SET k v1 NX -- attempt 1 records resultNil because NX miss.
@@ -99,7 +109,12 @@ func TestStandaloneSetDedup_GETOptionReturnsOldBulk(t *testing.T) {
99109
require.NoError(t, st.PutAt(ctx, redisStrKey([]byte("k")), encodeRedisStr([]byte("prior"), nil), 5, 0))
100110

101111
coord := newDedupTestCoordinator(st, 1, true) // attempt 1 lands then errors
102-
srv := &RedisServer{store: st, coordinator: coord, scriptCache: map[string]string{}, onePhaseTxnDedup: true}
112+
// Both gates: onePhaseTxnDedup is the umbrella; standaloneSetDedup is
113+
// the per-path sub-gate that opts the *standalone* SET branch into
114+
// the dedup routing. The sub-gate defaults off (PR #943 round-1 codex
115+
// P1 — applySet diverges from executeSet on SET-over-collection).
116+
// Tests that pin the dedup-on routing must explicitly enable both.
117+
srv := &RedisServer{store: st, coordinator: coord, scriptCache: map[string]string{}, onePhaseTxnDedup: true, standaloneSetDedup: true}
103118

104119
conn := &recordingConn{}
105120
// SET k v1 GET -- attempt 1 records resultBulk("prior").
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
package adapter
2+
3+
import (
4+
"context"
5+
"testing"
6+
7+
"github.com/redis/go-redis/v9"
8+
"github.com/stretchr/testify/require"
9+
)
10+
11+
// TestRedis_SET_OverwritesList_UnderDefaultGate locks in the legacy
12+
// SET-over-collection overwrite semantics under the new default-on dedup
13+
// gate landed in PR #943. The dedup path's applySet returns WRONGTYPE on
14+
// a SET that hits a key already holding a list/hash/set/zset/stream,
15+
// while the legacy setLegacy / executeSet / replaceWithStringTxn path
16+
// deletes the collection's logical elements and writes the string.
17+
// Codex flagged this as a P1 regression on PR #943 round-1: flipping
18+
// onePhaseTxnDedup default-on without a separate gate on the standalone
19+
// SET path would have changed normal Redis overwrite behaviour. The fix
20+
// is the standaloneSetDedup sub-gate, which defaults off — so
21+
// `SET k v` after `RPUSH k x` must still return OK and let the next
22+
// GET observe the string value, not WRONGTYPE.
23+
//
24+
// This is a regression test (CLAUDE.md self-review §5 + the
25+
// "when code review surfaces a defect, first add a failing test"
26+
// convention). If a future change re-enables standaloneSetDedup as
27+
// default-on without bringing applySet to parity with executeSet, this
28+
// test must fail.
29+
func TestRedis_SET_OverwritesList_UnderDefaultGate(t *testing.T) {
30+
t.Parallel()
31+
nodes, _, _ := createNode(t, 3)
32+
defer shutdown(nodes)
33+
34+
ctx := context.Background()
35+
rdb := redis.NewClient(&redis.Options{Addr: nodes[0].redisAddress})
36+
defer func() { _ = rdb.Close() }()
37+
38+
// Seed the key as a list — the encoding that the dedup path's
39+
// applySet hard-fails against.
40+
require.NoError(t, rdb.Do(ctx, "RPUSH", "setover:list", "elem-a").Err())
41+
42+
// Real Redis behaviour: SET unconditionally replaces the value,
43+
// dropping the previous type. The legacy path implements this; the
44+
// dedup path returns WRONGTYPE. With the default config
45+
// (onePhaseTxnDedup on, standaloneSetDedup off) we must take the
46+
// legacy path and observe OK + string value.
47+
res, err := rdb.Do(ctx, "SET", "setover:list", "replaced").Result()
48+
require.NoError(t, err, "SET must overwrite an existing list under default config")
49+
require.Equal(t, "OK", res)
50+
51+
got, err := rdb.Get(ctx, "setover:list").Result()
52+
require.NoError(t, err, "GET after SET-over-list must succeed")
53+
require.Equal(t, "replaced", got)
54+
}

0 commit comments

Comments
 (0)