Skip to content

Commit 164c251

Browse files
ChrisJBurnsclaude
andauthored
Freeze MCPServer generation per pod via downward API (#5364)
* Add tests for MCPServer generation freeze invariant Lock-in test in pkg/container/kubernetes/client_test.go documenting the gate's blind spot when two callers pass equal MCPServerGeneration with different images. TDD test in cmd/thv-proxyrunner/app/run_test.go for the env-var override (THV_MCPSERVER_GENERATION) that the fix in #5360 will introduce. Fails today; goes green when the override is implemented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Freeze proxyrunner MCPServerGeneration via env var The /etc/runconfig ConfigMap volume mounts live (no subPath), so a restarted old-RS proxyrunner pod re-reads runconfig.json after a helm upgrade and picks up the new MCPServer generation. Both old and new pods then call DeployWorkload with the same ourGen, defeating the strict-greater-than gate at shouldSkipStatefulSetApply. Honor THV_MCPSERVER_GENERATION (sourced via downward API in a follow-up operator change) as an override on the file value, freezing the generation per pod at creation time. Parallel to how the image is already frozen via the CLI positional arg. Part of #5360. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Project MCPServer generation into proxyrunner via downward API Stamp the MCPServer generation as a pod-template annotation on the proxy Deployment, and project that annotation into the proxyrunner container as the THV_MCPSERVER_GENERATION env var via the downward API. The env var is bound to the pod's own annotations at creation time, so a restarted old-RS pod cannot acquire the new generation by re-reading the live-mounted RunConfig ConfigMap. The proxyrunner already honors this env var as an override on the file value (see prior commit). With both sides wired, two coexisting proxyrunner pods during a rolling update carry distinct generations again and the strict-greater-than gate at shouldSkipStatefulSetApply fires correctly. Mirror the new env var and annotation in deploymentNeedsUpdate so drift detection stays stable. Closes #5360. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add envtest coverage for generation-freeze contract Two new integration specs assert the operator side of the freeze end-to-end against a real apiserver: - On initial reconcile, the proxy Deployment's pod template carries the mcpserver-generation annotation set to MCPServer.metadata.generation, and the proxyrunner container declares THV_MCPSERVER_GENERATION via a downward-API FieldRef pointing at that exact annotation key. - On a Spec.Image patch the controller bumps the annotation value to the new generation and keeps the FieldRef wired correctly. This is the rolling-update path: new pods get a strictly-greater frozen generation than any restarted old-RS pod. Catches typos in either the annotation key or the FieldRef path that would silently produce an empty env var and let the proxyrunner fall through to the (live-mounted) file value. Complements the unit-level TestDeploymentForMCPServer_MCPServerGenerationDownwardAPI and TestTryLoadConfigFromFile_MCPServerGenerationEnvOverride. Part of #5360. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address review feedback on #5360 freeze - Defend applyMCPServerGenerationOverride against negative env values. metadata.generation is monotonic non-negative per K8s convention, so a negative override cannot come from a legitimate downward-API projection and must not be allowed to silently disable the apply-gate stamp. Add a Debug log on the success path so the override is observable in pod logs when diagnosing future occurrences. - Add TestApplyMCPServerGenerationOverride exercising the empty, valid, zero, unparseable, and negative branches in isolation. - Soften the position-mirror comment in deploymentNeedsUpdate. The prior text referenced the Redis password slot, but deploymentNeedsUpdate does not include the Redis password env var (pre-existing latent drift bug, tracked separately). - Add a header comment to TestResourceOverrides explaining that the "0" mcpserver-generation annotation value comes from the fake client not auto-incrementing metadata.generation. Realistic generation tracking is exercised by the envtest in mcpserver_generation_freeze_integration_test.go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix CI: codespell typo and gofmt alignment - "unparseable" → "unparsable" (codespell) in run.go, run_test.go. - Realign env-var map literals to gofmt's preferred column for the longest key (THV_MCPSERVER_GENERATION). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Set FieldRef.APIVersion=v1 to avoid perpetual drift ObjectFieldSelector.APIVersion is defaulted to "v1" by the API server on persistence. The drift comparator at deploymentNeedsUpdate was rebuilding the expected env var with an empty APIVersion, so equality.Semantic.DeepEqual returned false on every reconcile and the controller perpetually rewrote the proxy Deployment. That constant churn re-triggered the exact rolling- update race the freeze was supposed to close — confirmed by the failing VirtualMCPServer Optimizer + Circuit Breaker recovery spec in test-e2e-lifecycle: stale-image proxyrunner pods kept clobbering the new StatefulSet apply, leaving the backend mcp container in ImagePullBackOff. Explicitly set APIVersion: "v1" on both the construction site and the drift comparator so they match the API-server-defaulted value. Add unit-level assertion in TestDeploymentForMCPServer_MCPServerGenerationDownwardAPI and a new envtest spec "Does not flag spurious drift on a no-op reconcile" that watches the Deployment's resourceVersion across multiple reconciles and fails fast if drift detection misfires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * TEMP debug: promote gate skip log + override log to INFO Temporary diagnostic for #5360 v1.33/v1.34 E2E failure. Will be reverted before merge. * Revert debug INFO logs; port STS-template wait to optimizer test The debug commit (8ed6e39) confirmed the override + gate are working correctly end-to-end on a real cluster: pod env=3 (latest): file=3 env=3, apply proceeds — writes GOOD image pod env=2 (stale): file=3 env=2, gate skips (theirs=3 > ours=2) Revert the INFO promotion now that diagnosis is complete; both logs go back to Debug. The remaining E2E flake is a pre-existing race in virtualmcp_optimizer_circuit_breaker_test.go: the test deletes the Pending StatefulSet pod immediately after patching Spec.Image without waiting for the proxyrunner to apply the new template, so the pod can be recreated against the stale (bad) template and the StatefulSet controller may not re-roll an already-unhealthy pod afterwards. Mirrors the same guard added to virtualmcp_circuit_breaker_test.go in #5079. Before the apply-gate, the race "resolved" by the stale-image apply periodically resetting the template, letting the test catch a lucky good window. With the gate in place the apply order is deterministic, which makes the missing wait visible — hence the test fix lands here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 4bb80f0 commit 164c251

8 files changed

Lines changed: 660 additions & 27 deletions

File tree

cmd/thv-operator/controllers/mcpserver_controller.go

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ import (
1111
"fmt"
1212
"maps"
1313
"os"
14+
"strconv"
1415
"strings"
1516
"time"
1617

@@ -1151,6 +1152,25 @@ func (r *MCPServerReconciler) deploymentForMCPServer(
11511152
// name collision (ResourceOverrides.Env only accepts plain strings).
11521153
env = append(env, r.buildRedisPasswordEnvVar(m)...)
11531154

1155+
// Project the MCPServer generation pod-template annotation into the
1156+
// proxyrunner container via the downward API. The proxyrunner uses this
1157+
// to override the value read from the live-mounted RunConfig ConfigMap,
1158+
// freezing it per pod at creation time. See #5360.
1159+
//
1160+
// APIVersion must be explicitly "v1" — the API server defaults it on
1161+
// persistence and equality.Semantic.DeepEqual treats "" != "v1" as drift,
1162+
// which would otherwise force a Deployment update on every reconcile.
1163+
env = append(env, corev1.EnvVar{
1164+
Name: kubernetes.EnvVarMCPServerGeneration,
1165+
ValueFrom: &corev1.EnvVarSource{
1166+
FieldRef: &corev1.ObjectFieldSelector{
1167+
APIVersion: "v1",
1168+
FieldPath: fmt.Sprintf("metadata.annotations['%s']",
1169+
kubernetes.RunConfigMCPServerGenerationAnnotation),
1170+
},
1171+
},
1172+
})
1173+
11541174
// Add volume mounts for user-defined volumes
11551175
for _, v := range m.Spec.Volumes {
11561176
volumeMounts = append(volumeMounts, corev1.VolumeMount{
@@ -1264,6 +1284,12 @@ func (r *MCPServerReconciler) deploymentForMCPServer(
12641284
// Add RunConfig checksum annotation to trigger pod rollout when config changes
12651285
deploymentTemplateAnnotations = checksum.AddRunConfigChecksumToPodTemplate(deploymentTemplateAnnotations, runConfigChecksum)
12661286

1287+
// Stamp the MCPServer generation on the proxy Deployment's pod template so the
1288+
// downward-API env var below resolves to a value that is frozen at pod creation
1289+
// time, not live-updated like the runconfig.json ConfigMap mount. See #5360.
1290+
deploymentTemplateAnnotations[kubernetes.RunConfigMCPServerGenerationAnnotation] =
1291+
strconv.FormatInt(m.Generation, 10)
1292+
12671293
if m.Spec.ResourceOverrides != nil && m.Spec.ResourceOverrides.ProxyDeployment != nil {
12681294
if m.Spec.ResourceOverrides.ProxyDeployment.Labels != nil {
12691295
deploymentLabels = ctrlutil.MergeLabels(ls, m.Spec.ResourceOverrides.ProxyDeployment.Labels)
@@ -1759,6 +1785,26 @@ func (r *MCPServerReconciler) deploymentNeedsUpdate(
17591785
}
17601786
}
17611787

1788+
// Project the MCPServer generation pod-template annotation into the
1789+
// proxyrunner container via the downward API. Position must come
1790+
// before the embedded-auth env vars below so the slice order matches
1791+
// deploymentForMCPServer and equality.Semantic.DeepEqual against
1792+
// container.Env succeeds.
1793+
//
1794+
// APIVersion must mirror the construction site at "v1" — the API
1795+
// server defaults it on persistence and an empty string here would
1796+
// produce false drift on every reconcile. See #5360.
1797+
expectedProxyEnv = append(expectedProxyEnv, corev1.EnvVar{
1798+
Name: kubernetes.EnvVarMCPServerGeneration,
1799+
ValueFrom: &corev1.EnvVarSource{
1800+
FieldRef: &corev1.ObjectFieldSelector{
1801+
APIVersion: "v1",
1802+
FieldPath: fmt.Sprintf("metadata.annotations['%s']",
1803+
kubernetes.RunConfigMCPServerGenerationAnnotation),
1804+
},
1805+
},
1806+
})
1807+
17621808
// Add embedded auth server environment variables. AuthServerRef takes precedence;
17631809
// externalAuthConfigRef is used as a fallback (legacy path).
17641810
if configName := ctrlutil.EmbeddedAuthServerConfigName(
@@ -1879,6 +1925,11 @@ func (r *MCPServerReconciler) deploymentNeedsUpdate(
18791925
// Check if pod template annotations have changed (including runconfig checksum)
18801926
expectedPodTemplateAnnotations := make(map[string]string)
18811927
expectedPodTemplateAnnotations = checksum.AddRunConfigChecksumToPodTemplate(expectedPodTemplateAnnotations, runConfigChecksum)
1928+
// Mirrors deploymentForMCPServer: stamp the MCPServer generation so the
1929+
// downward-API env var injected into the proxyrunner container resolves
1930+
// to a frozen-per-pod value (#5360).
1931+
expectedPodTemplateAnnotations[kubernetes.RunConfigMCPServerGenerationAnnotation] =
1932+
strconv.FormatInt(mcpServer.Generation, 10)
18821933

18831934
if mcpServer.Spec.ResourceOverrides != nil &&
18841935
mcpServer.Spec.ResourceOverrides.ProxyDeployment != nil &&

cmd/thv-operator/controllers/mcpserver_resource_overrides_test.go

Lines changed: 107 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,13 @@ func TestResourceOverrides(t *testing.T) {
222222
scheme := runtime.NewScheme()
223223
require.NoError(t, mcpv1beta1.AddToScheme(scheme))
224224

225+
// Note: expectedPodTemplateAnns entries below carry
226+
// "toolhive.stacklok.dev/mcpserver-generation": "0" because the controller
227+
// stamps strconv.FormatInt(m.Generation, 10) and the fake client does not
228+
// auto-increment metadata.generation on Create (the real API server starts
229+
// at 1). Envtest coverage in
230+
// cmd/thv-operator/test-integration/mcp-server/mcpserver_generation_freeze_integration_test.go
231+
// exercises the realistic generation-tracking behavior.
225232
tests := []struct {
226233
name string
227234
mcpServer *mcpv1beta1.MCPServer
@@ -252,7 +259,8 @@ func TestResourceOverrides(t *testing.T) {
252259
},
253260
expectedDeploymentAnns: map[string]string{},
254261
expectedPodTemplateAnns: map[string]string{
255-
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
262+
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
263+
"toolhive.stacklok.dev/mcpserver-generation": "0",
256264
},
257265
expectedServiceLabels: map[string]string{
258266
"app": "mcpserver",
@@ -315,7 +323,8 @@ func TestResourceOverrides(t *testing.T) {
315323
"monitoring/scrape": "true",
316324
},
317325
expectedPodTemplateAnns: map[string]string{
318-
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
326+
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
327+
"toolhive.stacklok.dev/mcpserver-generation": "0",
319328
},
320329
expectedServiceLabels: map[string]string{
321330
"app": "mcpserver",
@@ -376,7 +385,8 @@ func TestResourceOverrides(t *testing.T) {
376385
},
377386
expectedDeploymentAnns: map[string]string{},
378387
expectedPodTemplateAnns: map[string]string{
379-
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
388+
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
389+
"toolhive.stacklok.dev/mcpserver-generation": "0",
380390
},
381391
expectedServiceLabels: map[string]string{
382392
"app": "mcpserver",
@@ -415,7 +425,8 @@ func TestResourceOverrides(t *testing.T) {
415425
},
416426
expectedDeploymentAnns: map[string]string{},
417427
expectedPodTemplateAnns: map[string]string{
418-
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
428+
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
429+
"toolhive.stacklok.dev/mcpserver-generation": "0",
419430
},
420431
expectedServiceLabels: map[string]string{
421432
"app": "mcpserver",
@@ -481,7 +492,8 @@ func TestResourceOverrides(t *testing.T) {
481492
"version": "v1.2.3",
482493
},
483494
expectedPodTemplateAnns: map[string]string{
484-
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
495+
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
496+
"toolhive.stacklok.dev/mcpserver-generation": "0",
485497
},
486498
expectedServiceLabels: map[string]string{
487499
"app": "mcpserver",
@@ -525,9 +537,10 @@ func TestResourceOverrides(t *testing.T) {
525537
},
526538
expectedDeploymentAnns: map[string]string{},
527539
expectedPodTemplateAnns: map[string]string{
528-
"vault.hashicorp.com/agent-inject": "true",
529-
"vault.hashicorp.com/role": "toolhive-mcp-workloads",
530-
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
540+
"vault.hashicorp.com/agent-inject": "true",
541+
"vault.hashicorp.com/role": "toolhive-mcp-workloads",
542+
"toolhive.stacklok.dev/runconfig-checksum": "test-checksum",
543+
"toolhive.stacklok.dev/mcpserver-generation": "0",
531544
},
532545
expectedServiceLabels: map[string]string{
533546
"app": "mcpserver",
@@ -582,30 +595,33 @@ func TestResourceOverrides(t *testing.T) {
582595
switch tt.name {
583596
case "with proxy environment variables":
584597
expectedEnvVars = map[string]string{
585-
"HTTP_PROXY": "http://proxy.example.com:8080",
586-
"NO_PROXY": "localhost,127.0.0.1",
587-
"CUSTOM_ENV": "custom-value",
588-
"XDG_CONFIG_HOME": "/tmp",
589-
"HOME": "/tmp",
590-
"TOOLHIVE_RUNTIME": "kubernetes",
591-
"UNSTRUCTURED_LOGS": "false",
598+
"HTTP_PROXY": "http://proxy.example.com:8080",
599+
"NO_PROXY": "localhost,127.0.0.1",
600+
"CUSTOM_ENV": "custom-value",
601+
"THV_MCPSERVER_GENERATION": "", // downward API; Value is empty, ValueFrom set
602+
"XDG_CONFIG_HOME": "/tmp",
603+
"HOME": "/tmp",
604+
"TOOLHIVE_RUNTIME": "kubernetes",
605+
"UNSTRUCTURED_LOGS": "false",
592606
}
593607
case "with debug logging via TOOLHIVE_DEBUG env var":
594608
expectedEnvVars = map[string]string{
595-
"TOOLHIVE_DEBUG": "true",
596-
"XDG_CONFIG_HOME": "/tmp",
597-
"HOME": "/tmp",
598-
"TOOLHIVE_RUNTIME": "kubernetes",
599-
"UNSTRUCTURED_LOGS": "false",
609+
"TOOLHIVE_DEBUG": "true",
610+
"THV_MCPSERVER_GENERATION": "", // downward API; Value is empty, ValueFrom set
611+
"XDG_CONFIG_HOME": "/tmp",
612+
"HOME": "/tmp",
613+
"TOOLHIVE_RUNTIME": "kubernetes",
614+
"UNSTRUCTURED_LOGS": "false",
600615
}
601616
default:
602617
expectedEnvVars = map[string]string{
603-
"LOG_LEVEL": "debug",
604-
"METRICS_ENABLED": "true",
605-
"XDG_CONFIG_HOME": "/tmp",
606-
"HOME": "/tmp",
607-
"TOOLHIVE_RUNTIME": "kubernetes",
608-
"UNSTRUCTURED_LOGS": "false",
618+
"LOG_LEVEL": "debug",
619+
"METRICS_ENABLED": "true",
620+
"THV_MCPSERVER_GENERATION": "", // downward API; Value is empty, ValueFrom set
621+
"XDG_CONFIG_HOME": "/tmp",
622+
"HOME": "/tmp",
623+
"TOOLHIVE_RUNTIME": "kubernetes",
624+
"UNSTRUCTURED_LOGS": "false",
609625
}
610626
}
611627

@@ -657,7 +673,10 @@ func TestDeploymentForMCPServer_PodTemplateOverridesPreserveRunConfigChecksum(t
657673
assert.Equal(t, "value",
658674
deployment.Spec.Template.Annotations["user.example.com/some-key"],
659675
"user override must survive")
660-
assert.Len(t, deployment.Spec.Template.Annotations, 2,
676+
assert.Contains(t, deployment.Spec.Template.Annotations,
677+
kubernetes.RunConfigMCPServerGenerationAnnotation,
678+
"mcpserver-generation must be stamped for the downward-API env var (#5360)")
679+
assert.Len(t, deployment.Spec.Template.Annotations, 3,
661680
"no extra keys should leak into the pod template")
662681
}
663682

@@ -1140,3 +1159,64 @@ func TestMCPServerServiceNeedsUpdate(t *testing.T) {
11401159
})
11411160
}
11421161
}
1162+
1163+
// TestDeploymentForMCPServer_MCPServerGenerationDownwardAPI verifies that the
1164+
// proxy Deployment stamps the MCPServer generation as a pod-template annotation
1165+
// AND projects that annotation into the proxyrunner container as the
1166+
// THV_MCPSERVER_GENERATION env var via the downward API. This is the
1167+
// frozen-per-pod path that closes the race described in #5360 — the env var's
1168+
// value is bound to the pod's own annotations at creation time, so a restarted
1169+
// old-RS pod cannot acquire the new generation by re-reading the live-mounted
1170+
// RunConfig ConfigMap.
1171+
func TestDeploymentForMCPServer_MCPServerGenerationDownwardAPI(t *testing.T) {
1172+
t.Parallel()
1173+
1174+
scheme := runtime.NewScheme()
1175+
require.NoError(t, mcpv1beta1.AddToScheme(scheme))
1176+
require.NoError(t, corev1.AddToScheme(scheme))
1177+
1178+
client := fake.NewClientBuilder().WithScheme(scheme).Build()
1179+
r := newTestMCPServerReconciler(client, scheme, kubernetes.PlatformKubernetes)
1180+
1181+
mcpServer := &mcpv1beta1.MCPServer{
1182+
ObjectMeta: metav1.ObjectMeta{
1183+
Name: "test-server",
1184+
Namespace: "default",
1185+
Generation: 7,
1186+
},
1187+
Spec: mcpv1beta1.MCPServerSpec{
1188+
Image: "test-image",
1189+
ProxyPort: 8080,
1190+
},
1191+
}
1192+
1193+
deployment, err := r.deploymentForMCPServer(t.Context(), mcpServer, "test-checksum")
1194+
require.NoError(t, err)
1195+
require.NotNil(t, deployment)
1196+
1197+
assert.Equal(t, "7",
1198+
deployment.Spec.Template.Annotations[kubernetes.RunConfigMCPServerGenerationAnnotation],
1199+
"pod template must stamp the MCPServer generation so the downward-API env var resolves")
1200+
1201+
require.Len(t, deployment.Spec.Template.Spec.Containers, 1)
1202+
var got *corev1.EnvVar
1203+
for i := range deployment.Spec.Template.Spec.Containers[0].Env {
1204+
if deployment.Spec.Template.Spec.Containers[0].Env[i].Name == kubernetes.EnvVarMCPServerGeneration {
1205+
got = &deployment.Spec.Template.Spec.Containers[0].Env[i]
1206+
break
1207+
}
1208+
}
1209+
require.NotNil(t, got, "container must declare the %s env var", kubernetes.EnvVarMCPServerGeneration)
1210+
require.NotNil(t, got.ValueFrom, "env var must use ValueFrom (downward API), not a literal Value")
1211+
require.NotNil(t, got.ValueFrom.FieldRef)
1212+
assert.Equal(t,
1213+
"metadata.annotations['"+kubernetes.RunConfigMCPServerGenerationAnnotation+"']",
1214+
got.ValueFrom.FieldRef.FieldPath,
1215+
"FieldRef must point at the mcpserver-generation pod annotation")
1216+
// APIVersion must be set explicitly so the drift comparator at
1217+
// deploymentNeedsUpdate matches the API-server-defaulted value. An empty
1218+
// APIVersion here results in equality.Semantic.DeepEqual returning false on
1219+
// every reconcile, causing perpetual Deployment rewrites. See #5360.
1220+
assert.Equal(t, "v1", got.ValueFrom.FieldRef.APIVersion,
1221+
"FieldRef.APIVersion must match the API server default of v1 to avoid false drift")
1222+
}

0 commit comments

Comments
 (0)