Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update#4731
Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update#4731
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
| MakeSessionWithID(gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any()). | ||
| DoAndReturn(func(_ context.Context, id string, _ *auth.Identity, _ bool, _ []*vmcp.Backend) (vmcpsession.MultiSession, error) { | ||
| createdSess = newMockSession(t, ctrl, id, tools) | ||
| // Close() will be called exactly once during Terminate | ||
| // Close() is called eagerly by onEvict when Terminate removes | ||
| // the entry from the node-local cache after storage.Delete. | ||
| createdSess.EXPECT().Close().Return(nil).Times(1) | ||
| return createdSess, nil | ||
| }).Times(1) |
There was a problem hiding this comment.
🔴 The test 'closes MultiSession backend connections' sets createdSess.EXPECT().Close().Return(nil).Times(1) but never seeds MetadataKeyTokenHash into storage, causing Terminate to take the Phase 1 placeholder path instead of the Phase 2 path that calls sm.sessions.Remove() → onEvict → Close(); gomock will fail at teardown with 'expected call to Close() Times(1), called 0 times'. The fix is to capture storage from newTestSessionManager (currently discarded as _) and call storage.Upsert with MetadataKeyTokenHash before invoking Terminate, matching the adjacent 'removes MultiSession from storage on Terminate' test.
Extended reasoning...
What the bug is and how it manifests: The test sets createdSess.EXPECT().Close().Return(nil).Times(1) but never seeds sessiontypes.MetadataKeyTokenHash into storage, which is the key Terminate checks to distinguish Phase 2 full MultiSession from Phase 1 placeholder.
The specific code path that triggers it: After CreateSession completes, it calls sm.storage.Update(storeCtx, sessionID, sess.GetMetadata()). The mock session's GetMetadata() returns map[string]string{} — an empty map with no MetadataKeyTokenHash. Storage ends up holding {} for this session ID. When Terminate is called, it loads this empty map and evaluates: if _, isFullSession := metadata[sessiontypes.MetadataKeyTokenHash]; isFullSession. Because the key is absent, isFullSession is false.
Why existing code does not prevent it: The updated Terminate implementation (this PR) replaced the old sentinel-based cache check with a pure storage-metadata inspection. The new code relies on MetadataKeyTokenHash in the loaded metadata to classify session phase. Because CreateSession stores whatever the mock returns from GetMetadata() (an empty map), Terminate classifies it as a placeholder and takes the Phase 1 path: it calls storage.Update to mark terminated, and never calls sm.sessions.Remove(sessionID). Without Remove, the LRU cache onEvict callback is never fired, so sess.Close() is never called.
Impact: This is a CI test failure introduced by the PR. The test will fail at gomock teardown with 'expected call to Close() Times(1), but was called 0 times'. The adjacent test 'removes MultiSession from storage on Terminate' was correctly updated in this PR to seed MetadataKeyTokenHash via storage.Upsert, but the 'closes MultiSession backend connections' test at line 568 was overlooked — it discards the storage return value (sm, _ :=) so it cannot even perform the necessary upsert.
How to fix it: Change sm, _ := newTestSessionManager(t, factory, registry) to sm, storage := newTestSessionManager(t, factory, registry) in the 'closes MultiSession backend connections' test, then add before the Terminate call: require.NoError(t, storage.Upsert(context.Background(), sessionID, map[string]string{sessiontypes.MetadataKeyTokenHash: ''})).
Step-by-step proof: (1) newMockSession sets up sess.EXPECT().GetMetadata().Return(map[string]string{}).AnyTimes(). (2) CreateSession calls sm.storage.Update(storeCtx, sessionID, sess.GetMetadata()) which stores {} (no MetadataKeyTokenHash) in LocalSessionDataStorage. (3) sm.sessions.Set(sessionID, sess) places the session in the LRU cache. (4) sm.Terminate(sessionID) loads metadata, gets {}. (5) _, isFullSession := metadata[sessiontypes.MetadataKeyTokenHash] → isFullSession == false. (6) Terminate falls into Phase 1 branch, calls storage.Update with terminated=true. sm.sessions.Remove is never called. (7) Without Remove, the LRU eviction callback is never triggered. (8) sess.Close() is never called. (9) gomock fails at teardown: expected call to Close() Times(1), called 0 times.
| func (alwaysFailDataStorage) Create(_ context.Context, _ string, _ map[string]string) (bool, error) { | ||
| return false, errors.New("storage unavailable") | ||
| } | ||
| func (alwaysFailDataStorage) Update(_ context.Context, _ string, _ map[string]string) (bool, error) { | ||
| return false, errors.New("storage unavailable") | ||
| } | ||
| func (alwaysFailDataStorage) Delete(_ context.Context, _ string) error { return nil } | ||
| func (alwaysFailDataStorage) Close() error { return nil } | ||
|
|
There was a problem hiding this comment.
🔴 The test helper configurableFailDataStorage overrides Upsert, Create, and Delete — but not Update. This PR changed Terminate's Phase 1 path to call storage.Update instead of storage.Upsert, so the failure-injection counter (shouldFail) is never incremented for Update calls. Both fallback tests — "placeholder termination falls back to delete when upsert fails" and "placeholder termination fails when both upsert and delete fail" — silently pass the Update call through to the real storage, causing Terminate to succeed when the tests expect it to fail.
Extended reasoning...
What the bug is and how it manifests
configurableFailDataStorage (session_manager_test.go:108–150) is a test double that wraps a real LocalSessionDataStorage and intercepts Upsert, Create, and Delete via a storeCallCount/shouldFail() counter. It is used by two tests that verify Terminate's placeholder fallback logic. This PR changed Terminate's Phase 1 path (placeholder → set terminated flag) from storage.Upsert to storage.Update. Because Update is not overridden in configurableFailDataStorage, every Update call falls through to the embedded baseStorage.Update, which succeeds unconditionally. The shouldFail() counter is never incremented for Update calls, so failure injection never triggers.
The specific code path that triggers it
Both affected tests construct a configurableFailDataStorage{failStoreAfter: 1} and call sm.Generate() followed by sm.Terminate(). Generate calls storage.Create (count=1, succeeds because 1 is not > failStoreAfter=1). Terminate's Phase 1 path then calls storage.Update with the terminated flag — but configurableFailDataStorage has no Update override, so shouldFail() is never called for this operation. The baseStorage.Update finds the key (created by Generate) and returns (true, nil). Terminate returns (false, nil) with no error.
Why existing code does not prevent it
The DataStorage interface now includes Update (added by this PR), but configurableFailDataStorage was not updated to intercept it. Go's embedding means the embedded DataStorage field satisfies the Update method automatically, silently bypassing the injection logic. There is no compile-time check that a test double intercepts all interface methods.
Impact
The two fallback tests assert conditions that can never be reached: (1) "placeholder termination falls back to delete when upsert fails" calls assert.ErrorIs(loadErr, ErrSessionNotFound) after Terminate, but since Update succeeded and marked terminated=true without deleting the key, the entry still exists — Load returns the terminated entry, not ErrSessionNotFound, and the assertion fails. (2) "placeholder termination fails when both upsert and delete fail" calls require.Error(t, err) after Terminate, but Terminate returns nil (Update succeeded), so the assertion fails. The fallback code path — where Update fails and Terminate falls back to Delete — has zero test coverage after this PR.
How to fix it
Add an Update override to configurableFailDataStorage that calls shouldFail() and delegates to baseStorage.Update on success, mirroring the existing Upsert override:
func (s *configurableFailDataStorage) Update(ctx context.Context, id string, metadata map[string]string) (bool, error) {
if s.shouldFail() {
return false, errors.New("injected Update failure")
}
return s.DataStorage.Update(ctx, id, metadata)
}
Step-by-step proof
Trace "placeholder termination falls back to delete when upsert fails" (failStoreAfter=1, failDelete=false):
- sm.Generate() → baseStorage.Create — storeCallCount=1, shouldFail()=(1>1)=false → key created
- sm.Terminate(sessionID) Phase 1 → storage.Update() called
- configurableFailDataStorage has no Update override → embedded baseStorage.Update() executes directly — storeCallCount stays at 1, shouldFail() never called
- baseStorage.Update finds the key and returns (true, nil) — terminated flag set, session NOT deleted
- sm.Terminate returns (false, nil)
- Test: assert.ErrorIs(loadErr, ErrSessionNotFound) — FAILS because baseStorage.Load returns the terminated entry (not ErrSessionNotFound)
Trace "placeholder termination fails when both upsert and delete fail" (failStoreAfter=1, failDelete=true):
- Same bypass: Update succeeds, Terminate returns (false, nil)
- Test: require.Error(t, err) — FAILS because err is nil
| winner, found := c.lruCache.Get(key) | ||
| if !found { | ||
| // Winner was evicted between ContainsOrAdd and Get; keep our | ||
| // freshly loaded value rather than returning a zero value. | ||
| return result{v: v}, nil | ||
| } |
There was a problem hiding this comment.
🔴 In the !found branch of ValidatingCache.Get (pkg/cache/validating_cache.go:135-138), when a concurrent winner is evicted by LRU pressure between ContainsOrAdd and lruCache.Get, the freshly-loaded value v is returned to the caller without calling onEvict(key, v). For the session manager use case, onEvict is sess.Close() which releases backend HTTP connections — skipping it means those connections are never closed, leaking resources. The fix is to call c.onEvict(key, v) in the !found branch before returning.
Extended reasoning...
What the bug is and how it manifests
In ValidatingCache.Get (pkg/cache/validating_cache.go), after load(key) completes, the code calls c.lruCache.ContainsOrAdd(key, v). When ContainsOrAdd returns ok=true, a concurrent writer already stored a value for this key, so v should be discarded. The code then calls c.lruCache.Get(key) to retrieve the winner. If the winner was itself evicted by LRU pressure in the narrow window between these two independent lock acquisitions, found is false and the code takes the !found branch (lines 135-138), returning result{v: v} directly — without calling onEvict(key, v) and without storing v in the cache.
The specific code path that triggers it
At line 131: ok, _ := c.lruCache.ContainsOrAdd(key, v) — returns ok=true when a concurrent Set stored a value between load() returning and ContainsOrAdd running. At line 133: winner, found := c.lruCache.Get(key) — ContainsOrAdd and Get are separate mutex acquisitions; another goroutine can add a new key to a full cache between them, evicting the winner. At lines 135-138: if !found { return result{v: v}, nil } — returns v to the caller without: (1) storing v in the cache, (2) calling c.onEvict(key, v). The code comment at line 125 explicitly acknowledges the race: "ContainsOrAdd and Get are separate lock acquisitions". The design comment at line 27 states "the prior writer's value wins and the just-loaded value is discarded via onEvict" — this contract is violated in the !found sub-case.
Why existing code doesn't prevent it
The normal discard path at lines 141-143 correctly calls c.onEvict(key, v) when the winner is alive. The !found branch is an intentional fallback to avoid returning a zero value, but it omits the matching onEvict call. There is no other mechanism that cleans up v when the winner disappears between the two lock acquisitions.
What the impact would be
Two consequences: (1) v is not cached — the next Get for the same key triggers another full load() call; (2) more critically for the session manager, onEvict = sess.Close() is never called for v. Since v was never stored in sm.sessions, a subsequent sm.sessions.Remove(sessionID) in Terminate() is a no-op. The returned session holds open backend HTTP connections that are never explicitly closed, leaking resources until process restart.
Step-by-step proof (capacity=1)
- Cache has capacity=1. Key 'k' is absent. Goroutine A calls
Get('k'), misses, enters singleflight,load('k')starts. - Goroutine B calls
Set('k', 'winner')— cache:{k: winner}. - Goroutine C calls
Set('k2', 'x')— cache is full; 'k' (winner) is evicted viaonEvict. Cache:{k2: x}. - A's
loadreturns'from-load'. A callsContainsOrAdd('k', 'from-load'). Since 'k' is now absent (evicted in step 3),ContainsOrAddadds it and returnsok=false— this is the normal path. (Alternatively:) - Race variant: C's eviction in step 3 happens after
ContainsOrAddsees 'k' present (ok=true from B's write) but beforelruCache.Get('k')runs. Thenfound=false. A returnsresult{v: 'from-load'}to its caller without callingonEvict('k', 'from-load'). The caller holds a session with open HTTP connections;onEvict/sess.Close()is never called, leaking those connections.
How to fix it
Call c.onEvict(key, v) in the !found branch before returning:
if !found {
if c.onEvict != nil {
c.onEvict(key, v)
}
return result{v: v}, nil
}Alternatively — and arguably more correct — since the winner is gone and v is fresh from storage, store v in the cache (c.lruCache.Add(key, v)) so the next Get finds it without a redundant load() round-trip.
jerm-dro
left a comment
There was a problem hiding this comment.
Some failing tests that seem minor.
I'm assuming there's no major changes here compared to the original, but LMK if that's not true
Summary
Re-applies the changes from #4669 which were reverted in #4730.
pkg/cache.ValidatingCache[K,V]: a fully-typed, capacity-bounded LRU cache using mutex+map+container/listterminatedSentinel/ sentinel+rollback patternDataStorage.Update(SET XX semantics) to close the TOCTOU resurrection windowFactoryConfig.CacheCapacityto wire the LRU limit throughsessionmanager.Newpkg/cache/validating_cache_test.goFixes #4494
Type of change
Test plan
task test)task lint-fix)🤖 Generated with Claude Code