perf(redis): bounded Lua VM pool — r2 doc + stale comment cleanup

bootjp · bootjp · commit 5c107be0c814 · 2026-05-18T15:51:27.000+09:00
Round-2 Claude bot review fixes on commit 03736fb: 1. medium: DefaultLuaPoolMaxIdle godoc mis-attribution The luaStatePool security-invariant narrative ran contiguous into the DefaultLuaPoolMaxIdle const doc with no blank line separator, so godoc / gopls attributed the entire ~60-line narrative to the const. Deleted the duplicated narrative (the const now has only its own short doc) and folded the security- invariant content into the type's own doc block instead. Pure docs change; no semantic difference. 2. nit: explicit "intentionally unreachable" on the nil-element guard in get() put() guards `pls == nil || pls.state == nil` before sending, so the channel cannot carry a nil. The defensive nil branch in get() is meant only to convert a hypothetical future refactor's bug into a miss + fresh-alloc rather than a panic — restate the comment so a reader does not try to hit it. 3. low: stale sync.Pool references in test comments redis_lua_pool_test.go had four spots still describing sync.Pool's GC-driven eviction / -race per-P caching as the reason for not asserting pointer identity on a single round- trip. The channel pool is deterministic on a single goroutine; the actual non-determinism is t.Parallel sibling scheduling between put and get. Updated each comment cluster to reflect the new pool's semantics (and dropped the no-longer-meaningful "sync.Pool.New reintroduced" regression hook in favour of the real invariant: "get() must never auto-fill on empty"). Tests still pass; no behaviour change. Test: go test -race -count=1 -run "TestLua_Pool|TestLua_VMReuse" ./adapter -- green. golangci-lint run --config=.golangci.yaml ./adapter/ . -- 0 issues. go vet ./... -- clean.
diff --git a/adapter/redis_lua_pool.go b/adapter/redis_lua_pool.go
@@ -33,61 +33,6 @@ const luaResetKeySlack = 8
 // table, redis, cjson, cmsgpack).
 const luaWhitelistedTableHint = 8
 
-// luaStatePool pools *lua.LState instances to cut heap/GC pressure on
-// high-rate EVAL / EVALSHA workloads (e.g. BullMQ ~10 scripts/s, where
-// each fresh state allocs ~34% of in-use heap via newFuncContext,
-// newRegistry, newFunctionProto).
-//
-// Security invariant: no state must leak between scripts. Each pooled
-// state is initialised with a fixed set of base globals (redis, cjson,
-// cmsgpack, table/string/math + base lib helpers, and nil-ed loaders).
-// Three snapshots are captured at construction time:
-//
-//   - globalsSnapshot: the full (*any*-keyed) _G map at init. Using an
-//     LValue-keyed map lets the reset path catch non-string-keyed
-//     leaks like `_G[42] = "secret"` or `_G[true] = "bad"`, which
-//     would otherwise survive a naive string-only wipe.
-//   - tableSnapshots: a shallow map from each whitelisted nested
-//     table (string, math, table, redis, cjson, cmsgpack) to its
-//     init-time field set. This is what blocks table-poisoning
-//     attacks such as `string.upper = function() return "pwned" end`
-//     -- merely restoring the `string` *reference* on _G would leave
-//     the shared table's fields still mutated.
-//   - metatableSnapshots: the init-time raw metatable of _G plus of
-//     every whitelisted nested table. Without this, a script calling
-//     `setmetatable(_G, { __index = function() return "pwned" end })`
-//     could leak a poisoned fallback into the next pooled eval via
-//     any undefined-global access. Same risk for `setmetatable(string,
-//     ...)` etc.
-//
-// On release, the reset routine
-//
-//  1. restores the raw metatable of _G and every whitelisted table
-//     (LNil if there was none originally), neutering setmetatable
-//     poisoning,
-//  2. walks each snapshotted nested table and restores its contents
-//     (deletes script-added fields, rebinds original fields),
-//  3. walks the current global table and deletes every key -- of any
-//     type -- that is not present in the globals snapshot (removes
-//     user-added globals such as KEYS, ARGV, GLOBAL_LEAK, _G[42]),
-//     and
-//  4. restores every globals-snapshot key to its original value (so a
-//     script that did `table = nil` or `redis = evil` cannot poison
-//     the next script).
-//
-// Additionally the value stack is truncated to 0 and the script
-// context binding is cleared so the redis.call/pcall closures cannot
-// be invoked against a stale context.
-//
-// The redis / cjson / cmsgpack closures are registered ONCE at pool
-// fill time and read the per-eval *luaScriptContext out of each
-// state's own Lua registry (see luaCtxRegistryKey / ctxBinding),
-// which is set on acquire and cleared on release. Closures that
-// would otherwise capture a fresh context per eval no longer need
-// to be re-registered, which is what makes pooling safe and cheap.
-// The registry-backed binding is also the reason redis.call is
-// lock-free in the hot path, unlike the first iteration which used
-// a package-level map guarded by sync.RWMutex.
 // DefaultLuaPoolMaxIdle is the default upper bound on idle pooled
 // *lua.LState instances retained for reuse. Each pooled state holds
 // the base stdlib + redis/cjson/cmsgpack closures + per-state
@@ -125,6 +70,44 @@ const DefaultLuaPoolMaxIdle = 64
 // comparable to sync.Pool's per-P slabs and well below the cost of
 // a single Lua eval. See TestLua_PoolBoundedOverflow for the
 // invariants.
+//
+// Security invariant: no state must leak between scripts. Each
+// pooled state is initialised with a fixed set of base globals
+// (redis, cjson, cmsgpack, table/string/math + base lib helpers,
+// and nil-ed loaders). Three snapshots — captured per-state at
+// construction and stored on pooledLuaState — back the reset path:
+// globalsSnapshot (the full LValue-keyed _G map, so non-string-keyed
+// leaks like `_G[42] = "secret"` are caught), tableSnapshots
+// (shallow field sets of the whitelisted nested tables, so
+// `string.upper = function() return "pwned" end` cannot poison
+// reuse), and metatableSnapshots (the init-time raw metatable of
+// _G plus each whitelisted nested table, so a script-installed
+// `setmetatable(_G, { __index = function() … end })` does not
+// leak across evals).
+//
+// On release, the reset routine
+//
+//  1. restores the raw metatable of _G and every whitelisted table
+//     (LNil if there was none originally), neutering setmetatable
+//     poisoning,
+//  2. walks each snapshotted nested table and restores its contents
+//     (deletes script-added fields, rebinds original fields),
+//  3. walks the current global table and deletes every key — of any
+//     type — that is not present in the globals snapshot (removes
+//     user-added globals such as KEYS, ARGV, GLOBAL_LEAK, _G[42]),
+//     and
+//  4. restores every globals-snapshot key to its original value (so
+//     a script that did `table = nil` or `redis = evil` cannot
+//     poison the next script).
+//
+// The value stack is also truncated to 0 and the script-context
+// binding is cleared so the redis.call/pcall closures cannot be
+// invoked against a stale context. Those closures are registered
+// ONCE at pool fill time and read the per-eval *luaScriptContext
+// out of each state's own Lua registry (see luaCtxRegistryKey /
+// pooledLuaState.ctxBinding); this is what keeps redis.call
+// lock-free on the hot path, unlike the first iteration which used
+// a package-level map guarded by sync.RWMutex.
 type luaStatePool struct {
 	idle    chan *pooledLuaState
 	maxIdle int
@@ -548,8 +531,13 @@ func (p *luaStatePool) get(ctx *luaScriptContext) *pooledLuaState {
 		if pls != nil {
 			p.hits.Add(1)
 		} else {
-			// Defence in depth: a nil element on the channel is
-			// treated as an allocation miss rather than a panic.
+			// Intentionally-unreachable defence in depth:
+			// put() above already guards `if pls == nil ||
+			// pls.state == nil { return }`, so no caller can
+			// enqueue a nil. If a future refactor breaks that
+			// invariant the nil arrives here as a miss + fresh
+			// allocation rather than a runtime panic. Not a
+			// branch a reader should try to hit.
 			p.misses.Add(1)
 			pls = newPooledLuaState()
 		}
diff --git a/adapter/redis_lua_pool_test.go b/adapter/redis_lua_pool_test.go
@@ -78,12 +78,14 @@ func TestLua_VMReuseDoesNotLeakGlobals(t *testing.T) {
 	pool.put(plsA)
 
 	// --- Script B: same pool, no leak -----------------------------
-	// sync.Pool is free to allocate a fresh item even immediately
-	// after a put under race/GC, so we do not assert pointer
-	// identity here. To assert the pool is effective at all, see
-	// TestLua_PoolRecordsReuseVsAllocation which uses the hit counter.
-	// What we DO assert is the security invariant: whichever state
-	// we got, it must not observe the leaked globals from script A.
+	// We avoid asserting pointer identity here because the channel
+	// pool's hit/miss outcome under concurrent get/put can race
+	// with other parallel sub-tests scheduled by t.Parallel; for
+	// the deterministic effectiveness check see
+	// TestLua_PoolRecordsReuseVsAllocation, which uses the hit
+	// counter directly. What we DO assert is the security
+	// invariant: whichever state we got, it must not observe the
+	// leaked globals from script A.
 	_ = ptrA
 	plsB := pool.get(nil)
 	stateB := plsB.state
@@ -109,10 +111,10 @@ func TestLua_VMReuseDoesNotLeakGlobals(t *testing.T) {
 	pool.put(plsB)
 
 	// NOTE: we intentionally do NOT assert pool.Hits() >= 1 here.
-	// As noted at line 81, sync.Pool may evict items under GC pressure,
-	// making a single-iteration hit assertion non-deterministic.
-	// Pool effectiveness is covered by TestLua_PoolRecordsReuseVsAllocation,
-	// which uses a loop to ensure reuse occurs.
+	// As noted above, parallel sub-test scheduling can racily steer
+	// the second get() to a fresh allocation; pool effectiveness is
+	// covered by TestLua_PoolRecordsReuseVsAllocation, which uses
+	// a loop to ensure reuse occurs.
 }
 
 // TestLua_VMReuseRestoresRebindsWhitelistedGlobals guards against a
@@ -140,9 +142,11 @@ func TestLua_VMReuseRestoresRebindsWhitelistedGlobals(t *testing.T) {
 
 // TestLua_PoolSerialAcquireReusesState verifies the pool serves
 // existing *lua.LState instances in sequential acquire/release cycles
-// -- the knob we care about for the heap-pressure win. sync.Pool is
-// free to reclaim under GC pressure, so we cannot assert on the exact
-// pointer; instead we count hits vs misses via the test hook.
+// — the knob we care about for the heap-pressure win. The channel
+// pool is deterministic on a single goroutine, but the test runs
+// under t.Parallel, so we assert via the hit counter rather than
+// pointer identity (a sibling test could in principle pre-empt
+// between get and put).
 func TestLua_PoolSerialAcquireReusesState(t *testing.T) {
 	t.Parallel()
 
@@ -159,28 +163,31 @@ func TestLua_PoolSerialAcquireReusesState(t *testing.T) {
 	// At least one hit proves the pool is actually handing back an
 	// existing VM rather than minting a new one every time.
 	require.GreaterOrEqual(t, pool.Hits(), uint64(1),
-		"pool never reported a hit; sync.Pool reuse not happening")
+		"pool never reported a hit; channel-pool reuse not happening")
 }
 
 // TestLua_PoolRecordsReuseVsAllocation pins down the "is the pool
 // actually doing anything?" question via the hit/miss counters. The
-// test guards against the subtle regression where sync.Pool.New is
-// (re-)configured: with a New func set, p.pool.Get() on an empty
-// pool would auto-construct and never return nil, so hit/miss
-// tracking would be meaningless. Two sub-scenarios are exercised:
+// test guards against a subtle regression where get() auto-fills on
+// empty (e.g. a hypothetical "warm pool eagerly" refactor): with
+// auto-fill, the first get on a brand-new pool would never record a
+// miss and the hit/miss accounting would silently break. Two
+// sub-scenarios are exercised:
 //
 //  1. Miss branch: a get() on a brand-new pool has nothing to hand
 //     out. It must increment the miss counter (fresh allocation) and
-//     leave hits at zero. This is deterministic -- sync.Pool's own
-//     scheduling cannot turn an empty pool into a non-empty one.
+//     leave hits at zero. Channel recv on an empty buffered channel
+//     fires the select default deterministically, so this branch is
+//     race-free.
 //  2. Hit branch: after many put/get cycles at least one acquire
-//     must actually be served from the pool. sync.Pool under -race
-//     randomises per-P caching and can drop items, so we cannot
-//     assert on a single put/get round-trip; instead we run a loop
-//     large enough that the probability of zero reuse is negligible.
+//     must actually be served from the pool. The pool is bounded
+//     and deterministic on a single goroutine, but t.Parallel can
+//     race concurrent sub-tests between put and get; the loop
+//     amortises that to a near-certainty rather than relying on a
+//     single round-trip.
 //
-// If sync.Pool.New were accidentally re-introduced, the miss branch
-// (step 1) would fail immediately: Misses would be 0, Hits would be 1.
+// If get() ever started auto-filling, the miss-branch assertion
+// would fail immediately (Misses == 0, Hits == 1).
 func TestLua_PoolRecordsReuseVsAllocation(t *testing.T) {
 	t.Parallel()
 
@@ -190,16 +197,16 @@ func TestLua_PoolRecordsReuseVsAllocation(t *testing.T) {
 	plsA := pool.get(nil)
 	require.NotNil(t, plsA, "get on empty pool must allocate a fresh state, not return nil")
 	require.Equal(t, uint64(0), pool.Hits(),
-		"empty pool must not record a hit on first acquire -- sync.Pool.New likely reintroduced")
+		"empty pool must not record a hit on first acquire — auto-fill leaked into get()")
 	require.Equal(t, uint64(1), pool.Misses(),
 		"empty pool must record exactly one miss on first acquire")
 	pool.put(plsA)
 
 	// Scenario 2: with the state now available, a loop of get/put
-	// cycles must observe at least one genuine reuse. We cannot
-	// assert on a single round-trip because sync.Pool under -race
-	// may drop the freshly-put item from the local P cache; over
-	// many iterations, however, at least one must be served.
+	// cycles must observe at least one genuine reuse. Single-
+	// goroutine reuse is deterministic for the channel pool, but
+	// t.Parallel can race sibling sub-tests between our put and
+	// get; the loop amortises that effect over many iterations.
 	const iters = 500
 	for i := 0; i < iters; i++ {
 		pool.put(pool.get(nil))