Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update by yrobla · Pull Request #4731 · stacklok/toolhive

yrobla · 2026-04-10T06:31:59Z

Summary

Re-applies the changes from #4669 which were reverted in #4730.

Reintroduces pkg/cache.ValidatingCache[K,V]: a fully-typed, capacity-bounded LRU cache using mutex+map+container/list
Removes the terminatedSentinel / sentinel+rollback pattern
Adds DataStorage.Update (SET XX semantics) to close the TOCTOU resurrection window
Adds FactoryConfig.CacheCapacity to wire the LRU limit through sessionmanager.New
Cache tests live in pkg/cache/validating_cache_test.go

Fixes #4494

Type of change

New feature

Test plan

Unit tests (task test)
Linting (task lint-fix)

🤖 Generated with Claude Code

…pattern, add storage Update" (#4730)" This reverts commit acc1f66.

github-actions

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.

This review will be automatically dismissed once you add the justification section.

claude · 2026-04-10T06:50:22Z

pkg/vmcp/server/sessionmanager/session_manager_test.go

 			MakeSessionWithID(gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any()).
 			DoAndReturn(func(_ context.Context, id string, _ *auth.Identity, _ bool, _ []*vmcp.Backend) (vmcpsession.MultiSession, error) {
 				createdSess = newMockSession(t, ctrl, id, tools)
-				// Close() will be called exactly once during Terminate
+				// Close() is called eagerly by onEvict when Terminate removes
+				// the entry from the node-local cache after storage.Delete.
 				createdSess.EXPECT().Close().Return(nil).Times(1)
 				return createdSess, nil
 			}).Times(1)


🔴 The test 'closes MultiSession backend connections' sets createdSess.EXPECT().Close().Return(nil).Times(1) but never seeds MetadataKeyTokenHash into storage, causing Terminate to take the Phase 1 placeholder path instead of the Phase 2 path that calls sm.sessions.Remove() → onEvict → Close(); gomock will fail at teardown with 'expected call to Close() Times(1), called 0 times'. The fix is to capture storage from newTestSessionManager (currently discarded as _) and call storage.Upsert with MetadataKeyTokenHash before invoking Terminate, matching the adjacent 'removes MultiSession from storage on Terminate' test.

Extended reasoning...

What the bug is and how it manifests: The test sets createdSess.EXPECT().Close().Return(nil).Times(1) but never seeds sessiontypes.MetadataKeyTokenHash into storage, which is the key Terminate checks to distinguish Phase 2 full MultiSession from Phase 1 placeholder.

The specific code path that triggers it: After CreateSession completes, it calls sm.storage.Update(storeCtx, sessionID, sess.GetMetadata()). The mock session's GetMetadata() returns map[string]string{} — an empty map with no MetadataKeyTokenHash. Storage ends up holding {} for this session ID. When Terminate is called, it loads this empty map and evaluates: if _, isFullSession := metadata[sessiontypes.MetadataKeyTokenHash]; isFullSession. Because the key is absent, isFullSession is false.

Why existing code does not prevent it: The updated Terminate implementation (this PR) replaced the old sentinel-based cache check with a pure storage-metadata inspection. The new code relies on MetadataKeyTokenHash in the loaded metadata to classify session phase. Because CreateSession stores whatever the mock returns from GetMetadata() (an empty map), Terminate classifies it as a placeholder and takes the Phase 1 path: it calls storage.Update to mark terminated, and never calls sm.sessions.Remove(sessionID). Without Remove, the LRU cache onEvict callback is never fired, so sess.Close() is never called.

Impact: This is a CI test failure introduced by the PR. The test will fail at gomock teardown with 'expected call to Close() Times(1), but was called 0 times'. The adjacent test 'removes MultiSession from storage on Terminate' was correctly updated in this PR to seed MetadataKeyTokenHash via storage.Upsert, but the 'closes MultiSession backend connections' test at line 568 was overlooked — it discards the storage return value (sm, _ :=) so it cannot even perform the necessary upsert.

How to fix it: Change sm, _ := newTestSessionManager(t, factory, registry) to sm, storage := newTestSessionManager(t, factory, registry) in the 'closes MultiSession backend connections' test, then add before the Terminate call: require.NoError(t, storage.Upsert(context.Background(), sessionID, map[string]string{sessiontypes.MetadataKeyTokenHash: ''})).

Step-by-step proof: (1) newMockSession sets up sess.EXPECT().GetMetadata().Return(map[string]string{}).AnyTimes(). (2) CreateSession calls sm.storage.Update(storeCtx, sessionID, sess.GetMetadata()) which stores {} (no MetadataKeyTokenHash) in LocalSessionDataStorage. (3) sm.sessions.Set(sessionID, sess) places the session in the LRU cache. (4) sm.Terminate(sessionID) loads metadata, gets {}. (5) _, isFullSession := metadata[sessiontypes.MetadataKeyTokenHash] → isFullSession == false. (6) Terminate falls into Phase 1 branch, calls storage.Update with terminated=true. sm.sessions.Remove is never called. (7) Without Remove, the LRU eviction callback is never triggered. (8) sess.Close() is never called. (9) gomock fails at teardown: expected call to Close() Times(1), called 0 times.

claude · 2026-04-10T06:50:22Z

pkg/vmcp/server/sessionmanager/session_manager_test.go

 func (alwaysFailDataStorage) Create(_ context.Context, _ string, _ map[string]string) (bool, error) {
 	return false, errors.New("storage unavailable")
 }
+func (alwaysFailDataStorage) Update(_ context.Context, _ string, _ map[string]string) (bool, error) {
+	return false, errors.New("storage unavailable")
+}
 func (alwaysFailDataStorage) Delete(_ context.Context, _ string) error { return nil }
 func (alwaysFailDataStorage) Close() error                             { return nil }



🔴 The test helper configurableFailDataStorage overrides Upsert, Create, and Delete — but not Update. This PR changed Terminate's Phase 1 path to call storage.Update instead of storage.Upsert, so the failure-injection counter (shouldFail) is never incremented for Update calls. Both fallback tests — "placeholder termination falls back to delete when upsert fails" and "placeholder termination fails when both upsert and delete fail" — silently pass the Update call through to the real storage, causing Terminate to succeed when the tests expect it to fail.

Extended reasoning...

What the bug is and how it manifests

configurableFailDataStorage (session_manager_test.go:108–150) is a test double that wraps a real LocalSessionDataStorage and intercepts Upsert, Create, and Delete via a storeCallCount/shouldFail() counter. It is used by two tests that verify Terminate's placeholder fallback logic. This PR changed Terminate's Phase 1 path (placeholder → set terminated flag) from storage.Upsert to storage.Update. Because Update is not overridden in configurableFailDataStorage, every Update call falls through to the embedded baseStorage.Update, which succeeds unconditionally. The shouldFail() counter is never incremented for Update calls, so failure injection never triggers.

The specific code path that triggers it

Both affected tests construct a configurableFailDataStorage{failStoreAfter: 1} and call sm.Generate() followed by sm.Terminate(). Generate calls storage.Create (count=1, succeeds because 1 is not > failStoreAfter=1). Terminate's Phase 1 path then calls storage.Update with the terminated flag — but configurableFailDataStorage has no Update override, so shouldFail() is never called for this operation. The baseStorage.Update finds the key (created by Generate) and returns (true, nil). Terminate returns (false, nil) with no error.

Why existing code does not prevent it

The DataStorage interface now includes Update (added by this PR), but configurableFailDataStorage was not updated to intercept it. Go's embedding means the embedded DataStorage field satisfies the Update method automatically, silently bypassing the injection logic. There is no compile-time check that a test double intercepts all interface methods.

Impact

The two fallback tests assert conditions that can never be reached: (1) "placeholder termination falls back to delete when upsert fails" calls assert.ErrorIs(loadErr, ErrSessionNotFound) after Terminate, but since Update succeeded and marked terminated=true without deleting the key, the entry still exists — Load returns the terminated entry, not ErrSessionNotFound, and the assertion fails. (2) "placeholder termination fails when both upsert and delete fail" calls require.Error(t, err) after Terminate, but Terminate returns nil (Update succeeded), so the assertion fails. The fallback code path — where Update fails and Terminate falls back to Delete — has zero test coverage after this PR.

How to fix it

Add an Update override to configurableFailDataStorage that calls shouldFail() and delegates to baseStorage.Update on success, mirroring the existing Upsert override:

func (s *configurableFailDataStorage) Update(ctx context.Context, id string, metadata map[string]string) (bool, error) { if s.shouldFail() { return false, errors.New("injected Update failure") } return s.DataStorage.Update(ctx, id, metadata) }

Step-by-step proof

Trace "placeholder termination falls back to delete when upsert fails" (failStoreAfter=1, failDelete=false):

sm.Generate() → baseStorage.Create — storeCallCount=1, shouldFail()=(1>1)=false → key created

sm.Terminate(sessionID) Phase 1 → storage.Update() called

configurableFailDataStorage has no Update override → embedded baseStorage.Update() executes directly — storeCallCount stays at 1, shouldFail() never called

baseStorage.Update finds the key and returns (true, nil) — terminated flag set, session NOT deleted

sm.Terminate returns (false, nil)

Test: assert.ErrorIs(loadErr, ErrSessionNotFound) — FAILS because baseStorage.Load returns the terminated entry (not ErrSessionNotFound)

Trace "placeholder termination fails when both upsert and delete fail" (failStoreAfter=1, failDelete=true):

Same bypass: Update succeeds, Terminate returns (false, nil)

Test: require.Error(t, err) — FAILS because err is nil

claude · 2026-04-10T06:50:22Z

pkg/cache/validating_cache.go

+			winner, found := c.lruCache.Get(key)
+			if !found {
+				// Winner was evicted between ContainsOrAdd and Get; keep our
+				// freshly loaded value rather than returning a zero value.
+				return result{v: v}, nil
+			}


🔴 In the !found branch of ValidatingCache.Get (pkg/cache/validating_cache.go:135-138), when a concurrent winner is evicted by LRU pressure between ContainsOrAdd and lruCache.Get, the freshly-loaded value v is returned to the caller without calling onEvict(key, v). For the session manager use case, onEvict is sess.Close() which releases backend HTTP connections — skipping it means those connections are never closed, leaking resources. The fix is to call c.onEvict(key, v) in the !found branch before returning.

Extended reasoning...

What the bug is and how it manifests

In ValidatingCache.Get (pkg/cache/validating_cache.go), after load(key) completes, the code calls c.lruCache.ContainsOrAdd(key, v). When ContainsOrAdd returns ok=true, a concurrent writer already stored a value for this key, so v should be discarded. The code then calls c.lruCache.Get(key) to retrieve the winner. If the winner was itself evicted by LRU pressure in the narrow window between these two independent lock acquisitions, found is false and the code takes the !found branch (lines 135-138), returning result{v: v} directly — without calling onEvict(key, v) and without storing v in the cache.

The specific code path that triggers it

At line 131: ok, _ := c.lruCache.ContainsOrAdd(key, v) — returns ok=true when a concurrent Set stored a value between load() returning and ContainsOrAdd running. At line 133: winner, found := c.lruCache.Get(key) — ContainsOrAdd and Get are separate mutex acquisitions; another goroutine can add a new key to a full cache between them, evicting the winner. At lines 135-138: if !found { return result{v: v}, nil } — returns v to the caller without: (1) storing v in the cache, (2) calling c.onEvict(key, v). The code comment at line 125 explicitly acknowledges the race: "ContainsOrAdd and Get are separate lock acquisitions". The design comment at line 27 states "the prior writer's value wins and the just-loaded value is discarded via onEvict" — this contract is violated in the !found sub-case.

Why existing code doesn't prevent it

The normal discard path at lines 141-143 correctly calls c.onEvict(key, v) when the winner is alive. The !found branch is an intentional fallback to avoid returning a zero value, but it omits the matching onEvict call. There is no other mechanism that cleans up v when the winner disappears between the two lock acquisitions.

What the impact would be

Two consequences: (1) v is not cached — the next Get for the same key triggers another full load() call; (2) more critically for the session manager, onEvict = sess.Close() is never called for v. Since v was never stored in sm.sessions, a subsequent sm.sessions.Remove(sessionID) in Terminate() is a no-op. The returned session holds open backend HTTP connections that are never explicitly closed, leaking resources until process restart.

Step-by-step proof (capacity=1)

Cache has capacity=1. Key 'k' is absent. Goroutine A calls Get('k'), misses, enters singleflight, load('k') starts.

Goroutine B calls Set('k', 'winner') — cache: {k: winner}.

Goroutine C calls Set('k2', 'x') — cache is full; 'k' (winner) is evicted via onEvict. Cache: {k2: x}.

A's load returns 'from-load'. A calls ContainsOrAdd('k', 'from-load'). Since 'k' is now absent (evicted in step 3), ContainsOrAdd adds it and returns ok=false — this is the normal path. (Alternatively:)

Race variant: C's eviction in step 3 happens after ContainsOrAdd sees 'k' present (ok=true from B's write) but before lruCache.Get('k') runs. Then found=false. A returns result{v: 'from-load'} to its caller without calling onEvict('k', 'from-load'). The caller holds a session with open HTTP connections; onEvict/sess.Close() is never called, leaking those connections.

How to fix it

Call c.onEvict(key, v) in the !found branch before returning:

if !found { if c.onEvict != nil { c.onEvict(key, v) } return result{v: v}, nil }

Alternatively — and arguably more correct — since the winner is gone and v is fresh from storage, store v in the cache (c.lruCache.Add(key, v)) so the next Get finds it without a redundant load() round-trip.

jerm-dro

Some failing tests that seem minor.

I'm assuming there's no major changes here compared to the original, but LMK if that's not true

Revert "Revert "Add LRU capacity to ValidatingCache, remove sentinel …

ba5262e

…pattern, add storage Update" (#4730)" This reverts commit acc1f66.

yrobla requested review from ChrisJBurns, JAORMX, amirejaz, blkt, jerm-dro and jhrozek as code owners April 10, 2026 06:32

github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Apr 10, 2026

github-actions bot requested changes Apr 10, 2026

View reviewed changes

yrobla changed the title ~~Revert "Revert "Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update""~~ Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update Apr 10, 2026

github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 10, 2026

claude bot reviewed Apr 10, 2026

View reviewed changes

jerm-dro approved these changes Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update#4731

Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update#4731
yrobla wants to merge 1 commit intomainfrom
revert-4730-issue-4494

yrobla commented Apr 10, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

claude bot Apr 10, 2026

Uh oh!

claude bot Apr 10, 2026

Uh oh!

claude bot Apr 10, 2026

Uh oh!

jerm-dro left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yrobla commented Apr 10, 2026

Summary

Type of change

Test plan

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Large PR Detected

How to unblock this PR:

Alternative:

Uh oh!

claude bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

jerm-dro left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants