Skip to content

Commit 45a5417

Browse files
ChrisJBurnsclaude
andcommitted
Add upgrade walkthrough + CR fixtures to docs/operator/upgrade-guide
End-to-end reproducible verification for the StorageVersionMigrator: deploy v0.21.0 (last v1alpha1-only release), create one CR of each of the 12 toolhive kinds at v1alpha1, upgrade to multi-version CRDs + this branch's operator with the migrator explicitly disabled, re-apply at v1beta1, confirm storedVersions is stuck at [v1alpha1, v1beta1] on all 12 CRDs. Then enable the migrator and watch all 12 converge to [v1beta1] within ~1s. Includes an optional step 11 that simulates the next release dropping v1alpha1 from spec.versions — proving the migrator's storedVersions trim is what unblocks that change. This is what was actually run to validate PR #5011 against a real kind cluster. Putting it in the docs tree (not the worktree root) so it's discoverable and reproducible after the PR merges. CR fixtures (crs-v1alpha1.yaml and crs-v1beta1.yaml) originate from the graduate-crds upgrade-test fixtures; copying them in keeps this guide self-contained. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c47904a commit 45a5417

3 files changed

Lines changed: 621 additions & 0 deletions

File tree

Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,323 @@
1+
# Upgrade walkthrough — v1alpha1 → v1beta1 with StorageVersionMigrator
2+
3+
End-to-end manual walkthrough that replays a real user upgrade against a local kind cluster: install the previous v0.21.0 release, create a CR of each of the 12 toolhive CRD kinds at `v1alpha1`, upgrade to the new multi-version chart, deploy this branch's operator with the migrator **disabled**, re-apply the CRs at `v1beta1`, and confirm `status.storedVersions` is stuck at `[v1alpha1, v1beta1]` on every CRD. Then enable the `StorageVersionMigrator` and confirm it converges every CRD to `[v1beta1]`.
4+
5+
Total run time: ~30 minutes. The slow part is the first `ko build` of the operator + proxyrunner + vmcp images (~3 min); subsequent runs use the build cache and finish in ~30s.
6+
7+
This guide is the canonical reproducible verification for the migrator. Companion reading:
8+
9+
- [`docs/operator/storage-version-migration.md`](../storage-version-migration.md) — reference docs for the controller itself (label contract, opt-out, mechanism).
10+
- [Issue #4969](https://github.com/stacklok/toolhive/issues/4969) — the motivating problem.
11+
12+
## Prerequisites
13+
14+
- `kind`, `kubectl`, `helm`, `ko`, `task` on PATH (`go install github.com/google/ko@latest` for ko)
15+
- Working directory: the repo root (`task operator-deploy-local` and the relative chart paths assume this).
16+
- Cluster name is `toolhive`. If you already have a cluster with that name, delete it first or run from a different worktree.
17+
18+
The CR fixtures used below ship alongside this doc:
19+
20+
- [`crs-v1alpha1.yaml`](./crs-v1alpha1.yaml) — one CR of each of the 12 kinds at `v1alpha1`
21+
- [`crs-v1beta1.yaml`](./crs-v1beta1.yaml) — byte-identical to the v1alpha1 file except for `apiVersion`, simulating what `sed -i 's/v1alpha1/v1beta1/g'` would produce
22+
23+
---
24+
25+
## 0 · Set up the cluster
26+
27+
```bash
28+
# If you already have a "toolhive" kind cluster from a previous run, delete it
29+
kind delete cluster --name toolhive 2>/dev/null
30+
31+
kind create cluster --name toolhive --wait 60s
32+
kind get kubeconfig --name toolhive > kconfig.yaml
33+
export KUBECONFIG=$(pwd)/kconfig.yaml
34+
```
35+
36+
## 1 · Install v0.21.0 (the last v1alpha1-only release)
37+
38+
```bash
39+
helm install toolhive-operator-crds \
40+
oci://ghcr.io/stacklok/toolhive/toolhive-operator-crds \
41+
--version 0.21.0 --wait
42+
43+
helm install toolhive-operator \
44+
oci://ghcr.io/stacklok/toolhive/toolhive-operator \
45+
--version 0.21.0 \
46+
--namespace toolhive-system --create-namespace --wait
47+
48+
kubectl get crd mcpservers.toolhive.stacklok.dev -o jsonpath='{.spec.versions[*].name}'
49+
# expected: v1alpha1 (only one version)
50+
51+
kubectl wait --for=condition=Available deployment -n toolhive-system --all --timeout=180s
52+
```
53+
54+
## 2 · Create one CR of each of the 12 kinds at v1alpha1
55+
56+
```bash
57+
kubectl create namespace upgrade-test
58+
kubectl apply -f docs/operator/upgrade-guide/crs-v1alpha1.yaml
59+
60+
# Confirm all 12 landed
61+
kubectl get \
62+
mcpservers,mcpremoteproxies,mcptoolconfigs,mcpgroups,embeddingservers,mcpregistries,virtualmcpservers,virtualmcpcompositetooldefinitions,mcpoidcconfigs,mcptelemetryconfigs,mcpexternalauthconfigs,mcpserverentries \
63+
-n upgrade-test --no-headers | wc -l
64+
# expected: 12
65+
```
66+
67+
## 3 · Let the old operator reconcile + capture the baseline
68+
69+
```bash
70+
sleep 60
71+
kubectl get deployments -n upgrade-test
72+
# expected: up to 5 Deployments — test-server, test-remote-proxy, test-virtual-server,
73+
# test-registry-api, and (sometimes) test-embedding shows as a StatefulSet not a Deployment
74+
75+
# Snapshot the UIDs for later comparison
76+
kubectl get deployments -n upgrade-test \
77+
-o jsonpath='{range .items[*]}{.metadata.name}={.metadata.uid}{"\n"}{end}' \
78+
| sort > /tmp/before-upgrade.txt
79+
cat /tmp/before-upgrade.txt
80+
```
81+
82+
These UIDs are the canary. If they change after the operator upgrade, a workload was recreated → downtime.
83+
84+
## 4 · Upgrade the CRDs chart to multi-version
85+
86+
```bash
87+
helm upgrade toolhive-operator-crds deploy/charts/operator-crds --wait --timeout 120s
88+
89+
kubectl get crd mcpservers.toolhive.stacklok.dev -o jsonpath='{.spec.versions[*].name}'
90+
# expected: v1alpha1 v1beta1
91+
92+
kubectl get crd mcpservers.toolhive.stacklok.dev -o jsonpath='{.spec.versions[?(@.storage==true)].name}'
93+
# expected: v1beta1
94+
```
95+
96+
## 5 · Build the new operator + deploy with the migrator DISABLED
97+
98+
This is the key deviation from `task operator-deploy-local` — we want to control the feature flag, so we run ko + helm manually rather than using the task.
99+
100+
```bash
101+
# ~3 minutes on first run, ~30s with build cache
102+
OP=$(KO_DOCKER_REPO=kind.local ko build --local -B ./cmd/thv-operator | tail -1)
103+
PR=$(KO_DOCKER_REPO=kind.local ko build --local -B ./cmd/thv-proxyrunner | tail -1)
104+
VM=$(KO_DOCKER_REPO=kind.local ko build --local -B ./cmd/vmcp | tail -1)
105+
106+
# Load all three into the kind cluster
107+
kind load docker-image --name toolhive "$OP"
108+
kind load docker-image --name toolhive "$PR"
109+
kind load docker-image --name toolhive "$VM"
110+
111+
# Persist for later steps
112+
echo "$OP" > /tmp/img-op
113+
echo "$PR" > /tmp/img-pr
114+
echo "$VM" > /tmp/img-vm
115+
116+
# Helm upgrade with migrator EXPLICITLY DISABLED
117+
helm upgrade toolhive-operator deploy/charts/operator \
118+
--set operator.image="$OP" \
119+
--set operator.toolhiveRunnerImage="$PR" \
120+
--set operator.vmcpImage="$VM" \
121+
--set operator.features.storageVersionMigrator=false \
122+
--namespace toolhive-system
123+
124+
kubectl rollout status deployment -n toolhive-system --timeout=180s
125+
126+
# Confirm flag took effect
127+
NEW_POD=$(kubectl get pods -n toolhive-system --no-headers | grep "toolhive-operator-" | awk '{print $1}' | head -1)
128+
kubectl get pod "$NEW_POD" -n toolhive-system -o yaml | grep -A1 ENABLE_STORAGE_VERSION_MIGRATOR
129+
# expected: value: "false"
130+
131+
kubectl logs "$NEW_POD" -n toolhive-system | grep "ENABLE_STORAGE_VERSION_MIGRATOR is disabled"
132+
# expected: one line — "ENABLE_STORAGE_VERSION_MIGRATOR is disabled, skipping StorageVersionMigrator controller"
133+
```
134+
135+
## 6 · Verify zero downtime — Deployment UIDs unchanged
136+
137+
```bash
138+
kubectl get deployments -n upgrade-test \
139+
-o jsonpath='{range .items[*]}{.metadata.name}={.metadata.uid}{"\n"}{end}' \
140+
| sort > /tmp/after-upgrade.txt
141+
142+
diff /tmp/before-upgrade.txt /tmp/after-upgrade.txt && echo "All Deployment UIDs unchanged"
143+
```
144+
145+
## 7 · Re-apply all 12 CRs at v1beta1 (the user migration step)
146+
147+
```bash
148+
kubectl apply -f docs/operator/upgrade-guide/crs-v1beta1.yaml
149+
# expected: 12 "configured" lines (not "created")
150+
151+
# Wait for the operator to observe each update
152+
sleep 10
153+
```
154+
155+
## 8 · Confirm storedVersions is stuck at `[v1alpha1, v1beta1]` on all 12 CRDs (migrator is OFF)
156+
157+
```bash
158+
echo "=== storedVersions on ALL 12 CRDs (migrator OFF) ==="
159+
for crd in $(kubectl get crd -o name | grep toolhive.stacklok.dev | sort); do
160+
name=$(echo $crd | sed 's|.*/||')
161+
stored=$(kubectl get $crd -o jsonpath='{.status.storedVersions}')
162+
printf " %-55s %s\n" "$name" "$stored"
163+
done
164+
```
165+
166+
Expected: every line ends with `["v1alpha1","v1beta1"]`. This is the "post-graduation, pre-migration" state every cluster lands in after the v0.21.0 → multi-version upgrade. **Without the migrator, this state is permanent** — any future operator release that drops `v1alpha1` from `spec.versions` would fail with:
167+
168+
```
169+
status.storedVersions[0]: Invalid value: "v1alpha1": must appear in spec.versions
170+
```
171+
172+
## 9 · Enable the migrator + watch it converge
173+
174+
Helm upgrade to flip the feature flag:
175+
176+
```bash
177+
OP=$(cat /tmp/img-op)
178+
PR=$(cat /tmp/img-pr)
179+
VM=$(cat /tmp/img-vm)
180+
181+
helm upgrade toolhive-operator deploy/charts/operator \
182+
--set operator.image="$OP" \
183+
--set operator.toolhiveRunnerImage="$PR" \
184+
--set operator.vmcpImage="$VM" \
185+
--namespace toolhive-system
186+
187+
kubectl rollout status deployment -n toolhive-system --timeout=180s
188+
189+
# Confirm flag is now true
190+
NEW_POD=$(kubectl get pods -n toolhive-system --no-headers | grep "toolhive-operator-" | awk '{print $1}' | head -1)
191+
kubectl get pod "$NEW_POD" -n toolhive-system -o yaml | grep -A1 ENABLE_STORAGE_VERSION_MIGRATOR
192+
# expected: value: "true"
193+
```
194+
195+
Wait for convergence:
196+
197+
```bash
198+
echo "=== waiting up to 60s for all 12 CRDs to converge ==="
199+
for i in $(seq 1 60); do
200+
count=$(for c in $(kubectl get crd -o name | grep toolhive.stacklok.dev); do
201+
kubectl get $c -o jsonpath='{.status.storedVersions}'
202+
echo
203+
done | grep -c '^\["v1beta1"\]$')
204+
if [ "$count" = "12" ]; then
205+
echo "All 12 CRDs converged after ${i}s"
206+
break
207+
fi
208+
sleep 1
209+
done
210+
```
211+
212+
In practice this completes in ~1-2 seconds once the new pod is ready.
213+
214+
Verify:
215+
216+
```bash
217+
echo "=== storedVersions on ALL 12 CRDs (migrator ON, post-converge) ==="
218+
for crd in $(kubectl get crd -o name | grep toolhive.stacklok.dev | sort); do
219+
name=$(echo $crd | sed 's|.*/||')
220+
stored=$(kubectl get $crd -o jsonpath='{.status.storedVersions}')
221+
printf " %-55s %s\n" "$name" "$stored"
222+
done
223+
# expected: every line ends with ["v1beta1"]
224+
225+
echo "=== StorageVersionMigrationSucceeded events ==="
226+
kubectl get events -A --field-selector reason=StorageVersionMigrationSucceeded --no-headers | wc -l
227+
# expected: 12 — one event per CRD
228+
229+
echo "=== StorageVersionMigrationFailed events (should be 0) ==="
230+
kubectl get events -A --field-selector reason=StorageVersionMigrationFailed --no-headers | wc -l
231+
# expected: 0
232+
```
233+
234+
Confirm CRs still healthy (admission webhooks on MCPServer / MCPGroup / VirtualMCPServer all accepted the per-CR Updates):
235+
236+
```bash
237+
kubectl get \
238+
mcpservers,mcpremoteproxies,mcptoolconfigs,mcpgroups,embeddingservers,mcpregistries,virtualmcpservers,virtualmcpcompositetooldefinitions,mcpoidcconfigs,mcptelemetryconfigs,mcpexternalauthconfigs,mcpserverentries \
239+
-n upgrade-test --no-headers | wc -l
240+
# expected: 12
241+
```
242+
243+
## 10 · (Optional) Inspect migrator logs
244+
245+
```bash
246+
NEW_POD=$(kubectl get pods -n toolhive-system --no-headers | grep "toolhive-operator-" | awk '{print $1}' | head -1)
247+
kubectl logs "$NEW_POD" -n toolhive-system | grep "storage version migration complete" | wc -l
248+
# expected: 12 lines — one per CRD
249+
```
250+
251+
**If this prints 0**, the migration may have happened in a previous container instance (the operator pod can restart for unrelated reasons in a kind cluster — leases, OOM, etc.). Try the previous container's logs:
252+
253+
```bash
254+
kubectl logs "$NEW_POD" -n toolhive-system --previous | grep "storage version migration complete" | wc -l
255+
```
256+
257+
The migration is complete in either case — the events on the CRDs in step 9 are the authoritative signal.
258+
259+
## 11 · (Optional) Simulate the next release that drops v1alpha1
260+
261+
Once `storedVersions` is `[v1beta1]` everywhere, the apiserver will accept removal of `v1alpha1` from `spec.versions` — the safety interlock the migrator exists to satisfy. To demonstrate:
262+
263+
```bash
264+
# Direct apiserver patch — the same end state a future "drop v1alpha1" chart would produce.
265+
for crd in $(kubectl get crd -o name | grep toolhive.stacklok.dev); do
266+
name=$(echo $crd | sed 's|.*/||')
267+
newversions=$(kubectl get $crd -o json | jq '.spec.versions | map(select(.name != "v1alpha1"))')
268+
patch=$(jq -n --argjson v "$newversions" '{spec:{versions:$v}}')
269+
printf " %-55s " "$name"
270+
kubectl patch $crd --type=merge -p "$patch" 2>&1 | tail -1
271+
done
272+
```
273+
274+
Every line should say `... patched`. Verify v1alpha1 access now fails:
275+
276+
```bash
277+
kubectl get mcpgroups.v1alpha1.toolhive.stacklok.dev test-group -n upgrade-test
278+
# expected: error — the server doesn't have a resource type "mcpgroups"
279+
280+
kubectl get mcpgroups.v1beta1.toolhive.stacklok.dev test-group -n upgrade-test
281+
# expected: the resource
282+
```
283+
284+
**Negative test**: if you skip step 9 (the migrator pass) and jump straight to step 11, every `kubectl patch` will fail with:
285+
286+
```
287+
Invalid value: "v1alpha1": must appear in spec.versions
288+
```
289+
290+
That's the apiserver wall the migrator exists to clear.
291+
292+
## 12 · Cleanup
293+
294+
```bash
295+
kind delete cluster --name toolhive
296+
rm -f kconfig.yaml /tmp/before-upgrade.txt /tmp/after-upgrade.txt \
297+
/tmp/img-op /tmp/img-pr /tmp/img-vm
298+
```
299+
300+
---
301+
302+
## Summary of what's verified
303+
304+
| Check | Validates | Step |
305+
|---|---|---|
306+
| Operator startup logs show migrator skipped when disabled | The `setupStorageVersionMigrator` branch in `app/app.go` honours `ENABLE_STORAGE_VERSION_MIGRATOR=false` | 5 |
307+
| Zero-downtime upgrade across operator chart upgrade | The PR's operator changes don't recreate any managed workload | 6 |
308+
| `storedVersions` is `[v1alpha1, v1beta1]` after re-apply with migrator OFF | Migrator did not run; baseline state is correctly preserved | 8 |
309+
| Helm-upgrade flip enables the migrator | Feature flag round-trips correctly | 9 |
310+
| `storedVersions` converges to `[v1beta1]` on all 12 CRDs | The actual migration logic works against real ToolHive CRDs | 9 |
311+
| 12 `StorageVersionMigrationSucceeded` events on the CRDs | The Recorder is correctly wired and per-CRD migrations are observable | 9 |
312+
| 0 `StorageVersionMigrationFailed` events | No CRD's per-CR Update loop returned a non-retriable error | 9 |
313+
| All 12 CRs still healthy after migration | Validating admission webhooks (MCPServer / MCPGroup / VirtualMCPServer) tolerate the per-CR Updates | 9 |
314+
| RBAC permits storedVersions trim | The ClusterRole has the correct verbs for `customresourcedefinitions/status` and `*.toolhive.stacklok.dev/*` | implicit — trim succeeds in step 9 |
315+
| Apiserver permits `v1alpha1` removal from `spec.versions` once `storedVersions` is clean | The deprecation chain end-to-end works | 11 |
316+
317+
## What this does NOT cover
318+
319+
- **Mid-migration crash recovery**: no test forces the operator to crash mid-loop. Envtest covers the conflict-handling and re-encode-failure paths separately.
320+
- **Pagination at real-cluster scale**: 12 CRs is well below the 500-default page size. The envtest suite explicitly tests the continue-token loop with 7 CRs at `PageSize=3`.
321+
- **Operator restart resilience under load**: kind clusters are resource-limited and the operator pod sometimes restarts during tests for unrelated reasons (the `--previous` log fallback in step 10 covers this).
322+
323+
These are covered by the envtest suite at `cmd/thv-operator/test-integration/storageversionmigrator/`. This walkthrough covers what envtest can't: helm chart wiring, real ToolHive CRD schemas with their actual webhooks, and the full upgrade sequence a real user would run.

0 commit comments

Comments
 (0)