|
| 1 | +# Upgrade walkthrough — v1alpha1 → v1beta1 with StorageVersionMigrator |
| 2 | + |
| 3 | +End-to-end manual walkthrough that replays a real user upgrade against a local kind cluster: install the previous v0.21.0 release, create a CR of each of the 12 toolhive CRD kinds at `v1alpha1`, upgrade to the new multi-version chart, deploy this branch's operator with the migrator **disabled**, re-apply the CRs at `v1beta1`, and confirm `status.storedVersions` is stuck at `[v1alpha1, v1beta1]` on every CRD. Then enable the `StorageVersionMigrator` and confirm it converges every CRD to `[v1beta1]`. |
| 4 | + |
| 5 | +Total run time: ~30 minutes. The slow part is the first `ko build` of the operator + proxyrunner + vmcp images (~3 min); subsequent runs use the build cache and finish in ~30s. |
| 6 | + |
| 7 | +This guide is the canonical reproducible verification for the migrator. Companion reading: |
| 8 | + |
| 9 | +- [`docs/operator/storage-version-migration.md`](../storage-version-migration.md) — reference docs for the controller itself (label contract, opt-out, mechanism). |
| 10 | +- [Issue #4969](https://github.com/stacklok/toolhive/issues/4969) — the motivating problem. |
| 11 | + |
| 12 | +## Prerequisites |
| 13 | + |
| 14 | +- `kind`, `kubectl`, `helm`, `ko`, `task` on PATH (`go install github.com/google/ko@latest` for ko) |
| 15 | +- Working directory: the repo root (`task operator-deploy-local` and the relative chart paths assume this). |
| 16 | +- Cluster name is `toolhive`. If you already have a cluster with that name, delete it first or run from a different worktree. |
| 17 | + |
| 18 | +The CR fixtures used below ship alongside this doc: |
| 19 | + |
| 20 | +- [`crs-v1alpha1.yaml`](./crs-v1alpha1.yaml) — one CR of each of the 12 kinds at `v1alpha1` |
| 21 | +- [`crs-v1beta1.yaml`](./crs-v1beta1.yaml) — byte-identical to the v1alpha1 file except for `apiVersion`, simulating what `sed -i 's/v1alpha1/v1beta1/g'` would produce |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## 0 · Set up the cluster |
| 26 | + |
| 27 | +```bash |
| 28 | +# If you already have a "toolhive" kind cluster from a previous run, delete it |
| 29 | +kind delete cluster --name toolhive 2>/dev/null |
| 30 | + |
| 31 | +kind create cluster --name toolhive --wait 60s |
| 32 | +kind get kubeconfig --name toolhive > kconfig.yaml |
| 33 | +export KUBECONFIG=$(pwd)/kconfig.yaml |
| 34 | +``` |
| 35 | + |
| 36 | +## 1 · Install v0.21.0 (the last v1alpha1-only release) |
| 37 | + |
| 38 | +```bash |
| 39 | +helm install toolhive-operator-crds \ |
| 40 | + oci://ghcr.io/stacklok/toolhive/toolhive-operator-crds \ |
| 41 | + --version 0.21.0 --wait |
| 42 | + |
| 43 | +helm install toolhive-operator \ |
| 44 | + oci://ghcr.io/stacklok/toolhive/toolhive-operator \ |
| 45 | + --version 0.21.0 \ |
| 46 | + --namespace toolhive-system --create-namespace --wait |
| 47 | + |
| 48 | +kubectl get crd mcpservers.toolhive.stacklok.dev -o jsonpath='{.spec.versions[*].name}' |
| 49 | +# expected: v1alpha1 (only one version) |
| 50 | + |
| 51 | +kubectl wait --for=condition=Available deployment -n toolhive-system --all --timeout=180s |
| 52 | +``` |
| 53 | + |
| 54 | +## 2 · Create one CR of each of the 12 kinds at v1alpha1 |
| 55 | + |
| 56 | +```bash |
| 57 | +kubectl create namespace upgrade-test |
| 58 | +kubectl apply -f docs/operator/upgrade-guide/crs-v1alpha1.yaml |
| 59 | + |
| 60 | +# Confirm all 12 landed |
| 61 | +kubectl get \ |
| 62 | + mcpservers,mcpremoteproxies,mcptoolconfigs,mcpgroups,embeddingservers,mcpregistries,virtualmcpservers,virtualmcpcompositetooldefinitions,mcpoidcconfigs,mcptelemetryconfigs,mcpexternalauthconfigs,mcpserverentries \ |
| 63 | + -n upgrade-test --no-headers | wc -l |
| 64 | +# expected: 12 |
| 65 | +``` |
| 66 | + |
| 67 | +## 3 · Let the old operator reconcile + capture the baseline |
| 68 | + |
| 69 | +```bash |
| 70 | +sleep 60 |
| 71 | +kubectl get deployments -n upgrade-test |
| 72 | +# expected: up to 5 Deployments — test-server, test-remote-proxy, test-virtual-server, |
| 73 | +# test-registry-api, and (sometimes) test-embedding shows as a StatefulSet not a Deployment |
| 74 | + |
| 75 | +# Snapshot the UIDs for later comparison |
| 76 | +kubectl get deployments -n upgrade-test \ |
| 77 | + -o jsonpath='{range .items[*]}{.metadata.name}={.metadata.uid}{"\n"}{end}' \ |
| 78 | + | sort > /tmp/before-upgrade.txt |
| 79 | +cat /tmp/before-upgrade.txt |
| 80 | +``` |
| 81 | + |
| 82 | +These UIDs are the canary. If they change after the operator upgrade, a workload was recreated → downtime. |
| 83 | + |
| 84 | +## 4 · Upgrade the CRDs chart to multi-version |
| 85 | + |
| 86 | +```bash |
| 87 | +helm upgrade toolhive-operator-crds deploy/charts/operator-crds --wait --timeout 120s |
| 88 | + |
| 89 | +kubectl get crd mcpservers.toolhive.stacklok.dev -o jsonpath='{.spec.versions[*].name}' |
| 90 | +# expected: v1alpha1 v1beta1 |
| 91 | + |
| 92 | +kubectl get crd mcpservers.toolhive.stacklok.dev -o jsonpath='{.spec.versions[?(@.storage==true)].name}' |
| 93 | +# expected: v1beta1 |
| 94 | +``` |
| 95 | + |
| 96 | +## 5 · Build the new operator + deploy with the migrator DISABLED |
| 97 | + |
| 98 | +This is the key deviation from `task operator-deploy-local` — we want to control the feature flag, so we run ko + helm manually rather than using the task. |
| 99 | + |
| 100 | +```bash |
| 101 | +# ~3 minutes on first run, ~30s with build cache |
| 102 | +OP=$(KO_DOCKER_REPO=kind.local ko build --local -B ./cmd/thv-operator | tail -1) |
| 103 | +PR=$(KO_DOCKER_REPO=kind.local ko build --local -B ./cmd/thv-proxyrunner | tail -1) |
| 104 | +VM=$(KO_DOCKER_REPO=kind.local ko build --local -B ./cmd/vmcp | tail -1) |
| 105 | + |
| 106 | +# Load all three into the kind cluster |
| 107 | +kind load docker-image --name toolhive "$OP" |
| 108 | +kind load docker-image --name toolhive "$PR" |
| 109 | +kind load docker-image --name toolhive "$VM" |
| 110 | + |
| 111 | +# Persist for later steps |
| 112 | +echo "$OP" > /tmp/img-op |
| 113 | +echo "$PR" > /tmp/img-pr |
| 114 | +echo "$VM" > /tmp/img-vm |
| 115 | + |
| 116 | +# Helm upgrade with migrator EXPLICITLY DISABLED |
| 117 | +helm upgrade toolhive-operator deploy/charts/operator \ |
| 118 | + --set operator.image="$OP" \ |
| 119 | + --set operator.toolhiveRunnerImage="$PR" \ |
| 120 | + --set operator.vmcpImage="$VM" \ |
| 121 | + --set operator.features.storageVersionMigrator=false \ |
| 122 | + --namespace toolhive-system |
| 123 | + |
| 124 | +kubectl rollout status deployment -n toolhive-system --timeout=180s |
| 125 | + |
| 126 | +# Confirm flag took effect |
| 127 | +NEW_POD=$(kubectl get pods -n toolhive-system --no-headers | grep "toolhive-operator-" | awk '{print $1}' | head -1) |
| 128 | +kubectl get pod "$NEW_POD" -n toolhive-system -o yaml | grep -A1 ENABLE_STORAGE_VERSION_MIGRATOR |
| 129 | +# expected: value: "false" |
| 130 | + |
| 131 | +kubectl logs "$NEW_POD" -n toolhive-system | grep "ENABLE_STORAGE_VERSION_MIGRATOR is disabled" |
| 132 | +# expected: one line — "ENABLE_STORAGE_VERSION_MIGRATOR is disabled, skipping StorageVersionMigrator controller" |
| 133 | +``` |
| 134 | + |
| 135 | +## 6 · Verify zero downtime — Deployment UIDs unchanged |
| 136 | + |
| 137 | +```bash |
| 138 | +kubectl get deployments -n upgrade-test \ |
| 139 | + -o jsonpath='{range .items[*]}{.metadata.name}={.metadata.uid}{"\n"}{end}' \ |
| 140 | + | sort > /tmp/after-upgrade.txt |
| 141 | + |
| 142 | +diff /tmp/before-upgrade.txt /tmp/after-upgrade.txt && echo "All Deployment UIDs unchanged" |
| 143 | +``` |
| 144 | + |
| 145 | +## 7 · Re-apply all 12 CRs at v1beta1 (the user migration step) |
| 146 | + |
| 147 | +```bash |
| 148 | +kubectl apply -f docs/operator/upgrade-guide/crs-v1beta1.yaml |
| 149 | +# expected: 12 "configured" lines (not "created") |
| 150 | + |
| 151 | +# Wait for the operator to observe each update |
| 152 | +sleep 10 |
| 153 | +``` |
| 154 | + |
| 155 | +## 8 · Confirm storedVersions is stuck at `[v1alpha1, v1beta1]` on all 12 CRDs (migrator is OFF) |
| 156 | + |
| 157 | +```bash |
| 158 | +echo "=== storedVersions on ALL 12 CRDs (migrator OFF) ===" |
| 159 | +for crd in $(kubectl get crd -o name | grep toolhive.stacklok.dev | sort); do |
| 160 | + name=$(echo $crd | sed 's|.*/||') |
| 161 | + stored=$(kubectl get $crd -o jsonpath='{.status.storedVersions}') |
| 162 | + printf " %-55s %s\n" "$name" "$stored" |
| 163 | +done |
| 164 | +``` |
| 165 | + |
| 166 | +Expected: every line ends with `["v1alpha1","v1beta1"]`. This is the "post-graduation, pre-migration" state every cluster lands in after the v0.21.0 → multi-version upgrade. **Without the migrator, this state is permanent** — any future operator release that drops `v1alpha1` from `spec.versions` would fail with: |
| 167 | + |
| 168 | +``` |
| 169 | +status.storedVersions[0]: Invalid value: "v1alpha1": must appear in spec.versions |
| 170 | +``` |
| 171 | + |
| 172 | +## 9 · Enable the migrator + watch it converge |
| 173 | + |
| 174 | +Helm upgrade to flip the feature flag: |
| 175 | + |
| 176 | +```bash |
| 177 | +OP=$(cat /tmp/img-op) |
| 178 | +PR=$(cat /tmp/img-pr) |
| 179 | +VM=$(cat /tmp/img-vm) |
| 180 | + |
| 181 | +helm upgrade toolhive-operator deploy/charts/operator \ |
| 182 | + --set operator.image="$OP" \ |
| 183 | + --set operator.toolhiveRunnerImage="$PR" \ |
| 184 | + --set operator.vmcpImage="$VM" \ |
| 185 | + --namespace toolhive-system |
| 186 | + |
| 187 | +kubectl rollout status deployment -n toolhive-system --timeout=180s |
| 188 | + |
| 189 | +# Confirm flag is now true |
| 190 | +NEW_POD=$(kubectl get pods -n toolhive-system --no-headers | grep "toolhive-operator-" | awk '{print $1}' | head -1) |
| 191 | +kubectl get pod "$NEW_POD" -n toolhive-system -o yaml | grep -A1 ENABLE_STORAGE_VERSION_MIGRATOR |
| 192 | +# expected: value: "true" |
| 193 | +``` |
| 194 | + |
| 195 | +Wait for convergence: |
| 196 | + |
| 197 | +```bash |
| 198 | +echo "=== waiting up to 60s for all 12 CRDs to converge ===" |
| 199 | +for i in $(seq 1 60); do |
| 200 | + count=$(for c in $(kubectl get crd -o name | grep toolhive.stacklok.dev); do |
| 201 | + kubectl get $c -o jsonpath='{.status.storedVersions}' |
| 202 | + echo |
| 203 | + done | grep -c '^\["v1beta1"\]$') |
| 204 | + if [ "$count" = "12" ]; then |
| 205 | + echo "All 12 CRDs converged after ${i}s" |
| 206 | + break |
| 207 | + fi |
| 208 | + sleep 1 |
| 209 | +done |
| 210 | +``` |
| 211 | + |
| 212 | +In practice this completes in ~1-2 seconds once the new pod is ready. |
| 213 | + |
| 214 | +Verify: |
| 215 | + |
| 216 | +```bash |
| 217 | +echo "=== storedVersions on ALL 12 CRDs (migrator ON, post-converge) ===" |
| 218 | +for crd in $(kubectl get crd -o name | grep toolhive.stacklok.dev | sort); do |
| 219 | + name=$(echo $crd | sed 's|.*/||') |
| 220 | + stored=$(kubectl get $crd -o jsonpath='{.status.storedVersions}') |
| 221 | + printf " %-55s %s\n" "$name" "$stored" |
| 222 | +done |
| 223 | +# expected: every line ends with ["v1beta1"] |
| 224 | + |
| 225 | +echo "=== StorageVersionMigrationSucceeded events ===" |
| 226 | +kubectl get events -A --field-selector reason=StorageVersionMigrationSucceeded --no-headers | wc -l |
| 227 | +# expected: 12 — one event per CRD |
| 228 | + |
| 229 | +echo "=== StorageVersionMigrationFailed events (should be 0) ===" |
| 230 | +kubectl get events -A --field-selector reason=StorageVersionMigrationFailed --no-headers | wc -l |
| 231 | +# expected: 0 |
| 232 | +``` |
| 233 | + |
| 234 | +Confirm CRs still healthy (admission webhooks on MCPServer / MCPGroup / VirtualMCPServer all accepted the per-CR Updates): |
| 235 | + |
| 236 | +```bash |
| 237 | +kubectl get \ |
| 238 | + mcpservers,mcpremoteproxies,mcptoolconfigs,mcpgroups,embeddingservers,mcpregistries,virtualmcpservers,virtualmcpcompositetooldefinitions,mcpoidcconfigs,mcptelemetryconfigs,mcpexternalauthconfigs,mcpserverentries \ |
| 239 | + -n upgrade-test --no-headers | wc -l |
| 240 | +# expected: 12 |
| 241 | +``` |
| 242 | + |
| 243 | +## 10 · (Optional) Inspect migrator logs |
| 244 | + |
| 245 | +```bash |
| 246 | +NEW_POD=$(kubectl get pods -n toolhive-system --no-headers | grep "toolhive-operator-" | awk '{print $1}' | head -1) |
| 247 | +kubectl logs "$NEW_POD" -n toolhive-system | grep "storage version migration complete" | wc -l |
| 248 | +# expected: 12 lines — one per CRD |
| 249 | +``` |
| 250 | + |
| 251 | +**If this prints 0**, the migration may have happened in a previous container instance (the operator pod can restart for unrelated reasons in a kind cluster — leases, OOM, etc.). Try the previous container's logs: |
| 252 | + |
| 253 | +```bash |
| 254 | +kubectl logs "$NEW_POD" -n toolhive-system --previous | grep "storage version migration complete" | wc -l |
| 255 | +``` |
| 256 | + |
| 257 | +The migration is complete in either case — the events on the CRDs in step 9 are the authoritative signal. |
| 258 | + |
| 259 | +## 11 · (Optional) Simulate the next release that drops v1alpha1 |
| 260 | + |
| 261 | +Once `storedVersions` is `[v1beta1]` everywhere, the apiserver will accept removal of `v1alpha1` from `spec.versions` — the safety interlock the migrator exists to satisfy. To demonstrate: |
| 262 | + |
| 263 | +```bash |
| 264 | +# Direct apiserver patch — the same end state a future "drop v1alpha1" chart would produce. |
| 265 | +for crd in $(kubectl get crd -o name | grep toolhive.stacklok.dev); do |
| 266 | + name=$(echo $crd | sed 's|.*/||') |
| 267 | + newversions=$(kubectl get $crd -o json | jq '.spec.versions | map(select(.name != "v1alpha1"))') |
| 268 | + patch=$(jq -n --argjson v "$newversions" '{spec:{versions:$v}}') |
| 269 | + printf " %-55s " "$name" |
| 270 | + kubectl patch $crd --type=merge -p "$patch" 2>&1 | tail -1 |
| 271 | +done |
| 272 | +``` |
| 273 | + |
| 274 | +Every line should say `... patched`. Verify v1alpha1 access now fails: |
| 275 | + |
| 276 | +```bash |
| 277 | +kubectl get mcpgroups.v1alpha1.toolhive.stacklok.dev test-group -n upgrade-test |
| 278 | +# expected: error — the server doesn't have a resource type "mcpgroups" |
| 279 | + |
| 280 | +kubectl get mcpgroups.v1beta1.toolhive.stacklok.dev test-group -n upgrade-test |
| 281 | +# expected: the resource |
| 282 | +``` |
| 283 | + |
| 284 | +**Negative test**: if you skip step 9 (the migrator pass) and jump straight to step 11, every `kubectl patch` will fail with: |
| 285 | + |
| 286 | +``` |
| 287 | +Invalid value: "v1alpha1": must appear in spec.versions |
| 288 | +``` |
| 289 | + |
| 290 | +That's the apiserver wall the migrator exists to clear. |
| 291 | + |
| 292 | +## 12 · Cleanup |
| 293 | + |
| 294 | +```bash |
| 295 | +kind delete cluster --name toolhive |
| 296 | +rm -f kconfig.yaml /tmp/before-upgrade.txt /tmp/after-upgrade.txt \ |
| 297 | + /tmp/img-op /tmp/img-pr /tmp/img-vm |
| 298 | +``` |
| 299 | + |
| 300 | +--- |
| 301 | + |
| 302 | +## Summary of what's verified |
| 303 | + |
| 304 | +| Check | Validates | Step | |
| 305 | +|---|---|---| |
| 306 | +| Operator startup logs show migrator skipped when disabled | The `setupStorageVersionMigrator` branch in `app/app.go` honours `ENABLE_STORAGE_VERSION_MIGRATOR=false` | 5 | |
| 307 | +| Zero-downtime upgrade across operator chart upgrade | The PR's operator changes don't recreate any managed workload | 6 | |
| 308 | +| `storedVersions` is `[v1alpha1, v1beta1]` after re-apply with migrator OFF | Migrator did not run; baseline state is correctly preserved | 8 | |
| 309 | +| Helm-upgrade flip enables the migrator | Feature flag round-trips correctly | 9 | |
| 310 | +| `storedVersions` converges to `[v1beta1]` on all 12 CRDs | The actual migration logic works against real ToolHive CRDs | 9 | |
| 311 | +| 12 `StorageVersionMigrationSucceeded` events on the CRDs | The Recorder is correctly wired and per-CRD migrations are observable | 9 | |
| 312 | +| 0 `StorageVersionMigrationFailed` events | No CRD's per-CR Update loop returned a non-retriable error | 9 | |
| 313 | +| All 12 CRs still healthy after migration | Validating admission webhooks (MCPServer / MCPGroup / VirtualMCPServer) tolerate the per-CR Updates | 9 | |
| 314 | +| RBAC permits storedVersions trim | The ClusterRole has the correct verbs for `customresourcedefinitions/status` and `*.toolhive.stacklok.dev/*` | implicit — trim succeeds in step 9 | |
| 315 | +| Apiserver permits `v1alpha1` removal from `spec.versions` once `storedVersions` is clean | The deprecation chain end-to-end works | 11 | |
| 316 | + |
| 317 | +## What this does NOT cover |
| 318 | + |
| 319 | +- **Mid-migration crash recovery**: no test forces the operator to crash mid-loop. Envtest covers the conflict-handling and re-encode-failure paths separately. |
| 320 | +- **Pagination at real-cluster scale**: 12 CRs is well below the 500-default page size. The envtest suite explicitly tests the continue-token loop with 7 CRs at `PageSize=3`. |
| 321 | +- **Operator restart resilience under load**: kind clusters are resource-limited and the operator pod sometimes restarts during tests for unrelated reasons (the `--previous` log fallback in step 10 covers this). |
| 322 | + |
| 323 | +These are covered by the envtest suite at `cmd/thv-operator/test-integration/storageversionmigrator/`. This walkthrough covers what envtest can't: helm chart wiring, real ToolHive CRD schemas with their actual webhooks, and the full upgrade sequence a real user would run. |
0 commit comments