Skip to content

Commit 5c107be

Browse files
committed
perf(redis): bounded Lua VM pool — r2 doc + stale comment cleanup
Round-2 Claude bot review fixes on commit 03736fb: 1. medium: DefaultLuaPoolMaxIdle godoc mis-attribution The luaStatePool security-invariant narrative ran contiguous into the DefaultLuaPoolMaxIdle const doc with no blank line separator, so godoc / gopls attributed the entire ~60-line narrative to the const. Deleted the duplicated narrative (the const now has only its own short doc) and folded the security- invariant content into the type's own doc block instead. Pure docs change; no semantic difference. 2. nit: explicit "intentionally unreachable" on the nil-element guard in get() put() guards `pls == nil || pls.state == nil` before sending, so the channel cannot carry a nil. The defensive nil branch in get() is meant only to convert a hypothetical future refactor's bug into a miss + fresh-alloc rather than a panic — restate the comment so a reader does not try to hit it. 3. low: stale sync.Pool references in test comments redis_lua_pool_test.go had four spots still describing sync.Pool's GC-driven eviction / -race per-P caching as the reason for not asserting pointer identity on a single round- trip. The channel pool is deterministic on a single goroutine; the actual non-determinism is t.Parallel sibling scheduling between put and get. Updated each comment cluster to reflect the new pool's semantics (and dropped the no-longer-meaningful "sync.Pool.New reintroduced" regression hook in favour of the real invariant: "get() must never auto-fill on empty"). Tests still pass; no behaviour change. Test: go test -race -count=1 -run "TestLua_Pool|TestLua_VMReuse" ./adapter -- green. golangci-lint run --config=.golangci.yaml ./adapter/ . -- 0 issues. go vet ./... -- clean.
1 parent 03736fb commit 5c107be

2 files changed

Lines changed: 83 additions & 88 deletions

File tree

adapter/redis_lua_pool.go

Lines changed: 45 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -33,61 +33,6 @@ const luaResetKeySlack = 8
3333
// table, redis, cjson, cmsgpack).
3434
const luaWhitelistedTableHint = 8
3535

36-
// luaStatePool pools *lua.LState instances to cut heap/GC pressure on
37-
// high-rate EVAL / EVALSHA workloads (e.g. BullMQ ~10 scripts/s, where
38-
// each fresh state allocs ~34% of in-use heap via newFuncContext,
39-
// newRegistry, newFunctionProto).
40-
//
41-
// Security invariant: no state must leak between scripts. Each pooled
42-
// state is initialised with a fixed set of base globals (redis, cjson,
43-
// cmsgpack, table/string/math + base lib helpers, and nil-ed loaders).
44-
// Three snapshots are captured at construction time:
45-
//
46-
// - globalsSnapshot: the full (*any*-keyed) _G map at init. Using an
47-
// LValue-keyed map lets the reset path catch non-string-keyed
48-
// leaks like `_G[42] = "secret"` or `_G[true] = "bad"`, which
49-
// would otherwise survive a naive string-only wipe.
50-
// - tableSnapshots: a shallow map from each whitelisted nested
51-
// table (string, math, table, redis, cjson, cmsgpack) to its
52-
// init-time field set. This is what blocks table-poisoning
53-
// attacks such as `string.upper = function() return "pwned" end`
54-
// -- merely restoring the `string` *reference* on _G would leave
55-
// the shared table's fields still mutated.
56-
// - metatableSnapshots: the init-time raw metatable of _G plus of
57-
// every whitelisted nested table. Without this, a script calling
58-
// `setmetatable(_G, { __index = function() return "pwned" end })`
59-
// could leak a poisoned fallback into the next pooled eval via
60-
// any undefined-global access. Same risk for `setmetatable(string,
61-
// ...)` etc.
62-
//
63-
// On release, the reset routine
64-
//
65-
// 1. restores the raw metatable of _G and every whitelisted table
66-
// (LNil if there was none originally), neutering setmetatable
67-
// poisoning,
68-
// 2. walks each snapshotted nested table and restores its contents
69-
// (deletes script-added fields, rebinds original fields),
70-
// 3. walks the current global table and deletes every key -- of any
71-
// type -- that is not present in the globals snapshot (removes
72-
// user-added globals such as KEYS, ARGV, GLOBAL_LEAK, _G[42]),
73-
// and
74-
// 4. restores every globals-snapshot key to its original value (so a
75-
// script that did `table = nil` or `redis = evil` cannot poison
76-
// the next script).
77-
//
78-
// Additionally the value stack is truncated to 0 and the script
79-
// context binding is cleared so the redis.call/pcall closures cannot
80-
// be invoked against a stale context.
81-
//
82-
// The redis / cjson / cmsgpack closures are registered ONCE at pool
83-
// fill time and read the per-eval *luaScriptContext out of each
84-
// state's own Lua registry (see luaCtxRegistryKey / ctxBinding),
85-
// which is set on acquire and cleared on release. Closures that
86-
// would otherwise capture a fresh context per eval no longer need
87-
// to be re-registered, which is what makes pooling safe and cheap.
88-
// The registry-backed binding is also the reason redis.call is
89-
// lock-free in the hot path, unlike the first iteration which used
90-
// a package-level map guarded by sync.RWMutex.
9136
// DefaultLuaPoolMaxIdle is the default upper bound on idle pooled
9237
// *lua.LState instances retained for reuse. Each pooled state holds
9338
// the base stdlib + redis/cjson/cmsgpack closures + per-state
@@ -125,6 +70,44 @@ const DefaultLuaPoolMaxIdle = 64
12570
// comparable to sync.Pool's per-P slabs and well below the cost of
12671
// a single Lua eval. See TestLua_PoolBoundedOverflow for the
12772
// invariants.
73+
//
74+
// Security invariant: no state must leak between scripts. Each
75+
// pooled state is initialised with a fixed set of base globals
76+
// (redis, cjson, cmsgpack, table/string/math + base lib helpers,
77+
// and nil-ed loaders). Three snapshots — captured per-state at
78+
// construction and stored on pooledLuaState — back the reset path:
79+
// globalsSnapshot (the full LValue-keyed _G map, so non-string-keyed
80+
// leaks like `_G[42] = "secret"` are caught), tableSnapshots
81+
// (shallow field sets of the whitelisted nested tables, so
82+
// `string.upper = function() return "pwned" end` cannot poison
83+
// reuse), and metatableSnapshots (the init-time raw metatable of
84+
// _G plus each whitelisted nested table, so a script-installed
85+
// `setmetatable(_G, { __index = function() … end })` does not
86+
// leak across evals).
87+
//
88+
// On release, the reset routine
89+
//
90+
// 1. restores the raw metatable of _G and every whitelisted table
91+
// (LNil if there was none originally), neutering setmetatable
92+
// poisoning,
93+
// 2. walks each snapshotted nested table and restores its contents
94+
// (deletes script-added fields, rebinds original fields),
95+
// 3. walks the current global table and deletes every key — of any
96+
// type — that is not present in the globals snapshot (removes
97+
// user-added globals such as KEYS, ARGV, GLOBAL_LEAK, _G[42]),
98+
// and
99+
// 4. restores every globals-snapshot key to its original value (so
100+
// a script that did `table = nil` or `redis = evil` cannot
101+
// poison the next script).
102+
//
103+
// The value stack is also truncated to 0 and the script-context
104+
// binding is cleared so the redis.call/pcall closures cannot be
105+
// invoked against a stale context. Those closures are registered
106+
// ONCE at pool fill time and read the per-eval *luaScriptContext
107+
// out of each state's own Lua registry (see luaCtxRegistryKey /
108+
// pooledLuaState.ctxBinding); this is what keeps redis.call
109+
// lock-free on the hot path, unlike the first iteration which used
110+
// a package-level map guarded by sync.RWMutex.
128111
type luaStatePool struct {
129112
idle chan *pooledLuaState
130113
maxIdle int
@@ -548,8 +531,13 @@ func (p *luaStatePool) get(ctx *luaScriptContext) *pooledLuaState {
548531
if pls != nil {
549532
p.hits.Add(1)
550533
} else {
551-
// Defence in depth: a nil element on the channel is
552-
// treated as an allocation miss rather than a panic.
534+
// Intentionally-unreachable defence in depth:
535+
// put() above already guards `if pls == nil ||
536+
// pls.state == nil { return }`, so no caller can
537+
// enqueue a nil. If a future refactor breaks that
538+
// invariant the nil arrives here as a miss + fresh
539+
// allocation rather than a runtime panic. Not a
540+
// branch a reader should try to hit.
553541
p.misses.Add(1)
554542
pls = newPooledLuaState()
555543
}

adapter/redis_lua_pool_test.go

Lines changed: 38 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -78,12 +78,14 @@ func TestLua_VMReuseDoesNotLeakGlobals(t *testing.T) {
7878
pool.put(plsA)
7979

8080
// --- Script B: same pool, no leak -----------------------------
81-
// sync.Pool is free to allocate a fresh item even immediately
82-
// after a put under race/GC, so we do not assert pointer
83-
// identity here. To assert the pool is effective at all, see
84-
// TestLua_PoolRecordsReuseVsAllocation which uses the hit counter.
85-
// What we DO assert is the security invariant: whichever state
86-
// we got, it must not observe the leaked globals from script A.
81+
// We avoid asserting pointer identity here because the channel
82+
// pool's hit/miss outcome under concurrent get/put can race
83+
// with other parallel sub-tests scheduled by t.Parallel; for
84+
// the deterministic effectiveness check see
85+
// TestLua_PoolRecordsReuseVsAllocation, which uses the hit
86+
// counter directly. What we DO assert is the security
87+
// invariant: whichever state we got, it must not observe the
88+
// leaked globals from script A.
8789
_ = ptrA
8890
plsB := pool.get(nil)
8991
stateB := plsB.state
@@ -109,10 +111,10 @@ func TestLua_VMReuseDoesNotLeakGlobals(t *testing.T) {
109111
pool.put(plsB)
110112

111113
// NOTE: we intentionally do NOT assert pool.Hits() >= 1 here.
112-
// As noted at line 81, sync.Pool may evict items under GC pressure,
113-
// making a single-iteration hit assertion non-deterministic.
114-
// Pool effectiveness is covered by TestLua_PoolRecordsReuseVsAllocation,
115-
// which uses a loop to ensure reuse occurs.
114+
// As noted above, parallel sub-test scheduling can racily steer
115+
// the second get() to a fresh allocation; pool effectiveness is
116+
// covered by TestLua_PoolRecordsReuseVsAllocation, which uses
117+
// a loop to ensure reuse occurs.
116118
}
117119

118120
// TestLua_VMReuseRestoresRebindsWhitelistedGlobals guards against a
@@ -140,9 +142,11 @@ func TestLua_VMReuseRestoresRebindsWhitelistedGlobals(t *testing.T) {
140142

141143
// TestLua_PoolSerialAcquireReusesState verifies the pool serves
142144
// existing *lua.LState instances in sequential acquire/release cycles
143-
// -- the knob we care about for the heap-pressure win. sync.Pool is
144-
// free to reclaim under GC pressure, so we cannot assert on the exact
145-
// pointer; instead we count hits vs misses via the test hook.
145+
// — the knob we care about for the heap-pressure win. The channel
146+
// pool is deterministic on a single goroutine, but the test runs
147+
// under t.Parallel, so we assert via the hit counter rather than
148+
// pointer identity (a sibling test could in principle pre-empt
149+
// between get and put).
146150
func TestLua_PoolSerialAcquireReusesState(t *testing.T) {
147151
t.Parallel()
148152

@@ -159,28 +163,31 @@ func TestLua_PoolSerialAcquireReusesState(t *testing.T) {
159163
// At least one hit proves the pool is actually handing back an
160164
// existing VM rather than minting a new one every time.
161165
require.GreaterOrEqual(t, pool.Hits(), uint64(1),
162-
"pool never reported a hit; sync.Pool reuse not happening")
166+
"pool never reported a hit; channel-pool reuse not happening")
163167
}
164168

165169
// TestLua_PoolRecordsReuseVsAllocation pins down the "is the pool
166170
// actually doing anything?" question via the hit/miss counters. The
167-
// test guards against the subtle regression where sync.Pool.New is
168-
// (re-)configured: with a New func set, p.pool.Get() on an empty
169-
// pool would auto-construct and never return nil, so hit/miss
170-
// tracking would be meaningless. Two sub-scenarios are exercised:
171+
// test guards against a subtle regression where get() auto-fills on
172+
// empty (e.g. a hypothetical "warm pool eagerly" refactor): with
173+
// auto-fill, the first get on a brand-new pool would never record a
174+
// miss and the hit/miss accounting would silently break. Two
175+
// sub-scenarios are exercised:
171176
//
172177
// 1. Miss branch: a get() on a brand-new pool has nothing to hand
173178
// out. It must increment the miss counter (fresh allocation) and
174-
// leave hits at zero. This is deterministic -- sync.Pool's own
175-
// scheduling cannot turn an empty pool into a non-empty one.
179+
// leave hits at zero. Channel recv on an empty buffered channel
180+
// fires the select default deterministically, so this branch is
181+
// race-free.
176182
// 2. Hit branch: after many put/get cycles at least one acquire
177-
// must actually be served from the pool. sync.Pool under -race
178-
// randomises per-P caching and can drop items, so we cannot
179-
// assert on a single put/get round-trip; instead we run a loop
180-
// large enough that the probability of zero reuse is negligible.
183+
// must actually be served from the pool. The pool is bounded
184+
// and deterministic on a single goroutine, but t.Parallel can
185+
// race concurrent sub-tests between put and get; the loop
186+
// amortises that to a near-certainty rather than relying on a
187+
// single round-trip.
181188
//
182-
// If sync.Pool.New were accidentally re-introduced, the miss branch
183-
// (step 1) would fail immediately: Misses would be 0, Hits would be 1.
189+
// If get() ever started auto-filling, the miss-branch assertion
190+
// would fail immediately (Misses == 0, Hits == 1).
184191
func TestLua_PoolRecordsReuseVsAllocation(t *testing.T) {
185192
t.Parallel()
186193

@@ -190,16 +197,16 @@ func TestLua_PoolRecordsReuseVsAllocation(t *testing.T) {
190197
plsA := pool.get(nil)
191198
require.NotNil(t, plsA, "get on empty pool must allocate a fresh state, not return nil")
192199
require.Equal(t, uint64(0), pool.Hits(),
193-
"empty pool must not record a hit on first acquire -- sync.Pool.New likely reintroduced")
200+
"empty pool must not record a hit on first acquire — auto-fill leaked into get()")
194201
require.Equal(t, uint64(1), pool.Misses(),
195202
"empty pool must record exactly one miss on first acquire")
196203
pool.put(plsA)
197204

198205
// Scenario 2: with the state now available, a loop of get/put
199-
// cycles must observe at least one genuine reuse. We cannot
200-
// assert on a single round-trip because sync.Pool under -race
201-
// may drop the freshly-put item from the local P cache; over
202-
// many iterations, however, at least one must be served.
206+
// cycles must observe at least one genuine reuse. Single-
207+
// goroutine reuse is deterministic for the channel pool, but
208+
// t.Parallel can race sibling sub-tests between our put and
209+
// get; the loop amortises that effect over many iterations.
203210
const iters = 500
204211
for i := 0; i < iters; i++ {
205212
pool.put(pool.get(nil))

0 commit comments

Comments
 (0)