diff --git a/docs/operator/storage-version-migration.md b/docs/operator/storage-version-migration.md new file mode 100644 index 0000000000..36e4cc6c84 --- /dev/null +++ b/docs/operator/storage-version-migration.md @@ -0,0 +1,134 @@ +# Storage Version Migration + +The ToolHive operator ships a `StorageVersionMigrator` controller that keeps every ToolHive CRD's `status.storedVersions` list clean, so a future operator release can drop deprecated API versions (e.g. `v1alpha1`) without orphaning objects in etcd. + +The controller is **opt-in** via the `operator.features.storageVersionMigrator` chart value (or the `TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR` environment variable if you inject it directly via `operator.env`). + +## Why this exists + +When a CRD graduates from, say, `v1alpha1` to `v1beta1` with both versions served and `v1beta1` as the storage version, existing objects continue to work — they are transparently converted on read/write. But the API server records every version that has ever been used for storage in `CustomResourceDefinition.status.storedVersions`. Until that list is trimmed, the API server refuses to let you remove a version from `spec.versions`, because doing so would orphan any etcd-stored objects encoded at that version. + +The cleanup is not automatic. Someone has to re-store every existing object at the current storage version, then explicitly patch `status.storedVersions` to drop the old entry. The `StorageVersionMigrator` controller does this for you, on every opted-in ToolHive CRD, continuously. See [upstream Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/storage-version-migration/) for the underlying mechanism. + +## How it works + +For each opted-in CRD the controller does: + +1. Live-reads the CRD via `APIReader` (bypassing the informer cache, so it sees the current `storedVersions`). +2. Reads `spec.versions` to find the entry with `storage: true`. +3. If `status.storedVersions` already equals `[]`, exits — nothing to do. +4. Otherwise, paginates through every Custom Resource of the kind and issues a plain `Get` + `Update` against the main resource. The API server re-encodes the request body at the current storage version and compares the resulting bytes to what's in etcd. If they differ, etcd is rewritten at the new storage version. If they match (already migrated), the API server elides the write. +5. Once every CR has been processed without errors, patches `CRD.status.storedVersions` to `[]` using an optimistic-lock merge — so concurrent API-server writes cause a clean retry rather than a silent overwrite. + +### Concurrent-write safety + +The migrator handles conflicts conservatively. If a per-CR `Update` returns `IsConflict` (another writer raced), the controller retries the per-CR Get + Update up to three attempts. If every retry conflicts, the migration pass returns a sentinel error and the controller requeues itself after 30 seconds without bumping the exponential backoff. `status.storedVersions` is only trimmed when every CR in a pass was successfully re-stored. + +This is the upstream `kube-storage-version-migrator` semantics — a Conflict means another writer succeeded, which itself re-encoded the CR at the storage version, so the migration is effectively done for that CR even if our own Update didn't land. + +### Admission webhook interaction + +Each per-CR `Update` goes through the API server's admission chain (mutating then validating webhooks) before reaching the storage-encoder elision check. **Only the etcd write and watch fanout are elided** when the encoded bytes match what's already stored — admission webhooks fire on every Update regardless. + +For ToolHive's own webhooks this is fine; they only reject changes that break spec invariants, and a same-spec round-trip Update cannot trigger those rejections. **If you run a cluster-wide admission policy engine** like Kyverno, Gatekeeper, or OPA, check that your policies don't reject same-spec round-trip Updates of ToolHive CRs before enabling the migrator. A policy that requires a `lastUpdatedBy` annotation, for example, would reject every migrator Update and the controller would never converge. + +## The opt-in label + +A CRD participates in migration only if it carries: + +```yaml +metadata: + labels: + toolhive.stacklok.dev/auto-migrate-storage-version: "true" +``` + +The label is set at CRD-generation time via a kubebuilder marker on each Go root type: + +```go +//+kubebuilder:object:root=true +//+kubebuilder:storageversion +//+kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true +type MCPServer struct { ... } +``` + +`task operator-manifests` bakes the label into the generated CRD YAML. Every ToolHive CRD that carries `+kubebuilder:storageversion` ships with the marker today. + +A CI test (`TestStorageVersionRootMarkerCoverage` in `cmd/thv-operator/controllers/marker_coverage_test.go`) fails the build if a root type is added without the migrate marker. There is no self-serve "exclude" marker — every storage-version root type must opt in. If a future CRD legitimately should not be auto-migrated, the path is to add an entry to the `excludedRootTypes` allowlist in the test file via PR review, not via a self-serve marker. + +### Adding a new CRD + +Add the marker to the type alongside the existing kubebuilder markers: + +```go +//+kubebuilder:object:root=true +//+kubebuilder:storageversion +//+kubebuilder:subresource:status +//+kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true +type NewShinyThing struct { ... } +``` + +Then `task operator-manifests` to regenerate the CRD YAML. CI verifies the marker is present. + +## Enabling the controller + +Set the Helm feature flag at install or upgrade time: + +```yaml +operator: + features: + storageVersionMigrator: true +``` + +This sets `TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR=true` on the operator Deployment and registers the reconciler with the manager. + +Once enabled, the controller is dormant on CRDs whose `storedVersions` already equals `[]` — most of the time, most CRDs. It only does meaningful work when a CRD's stored-versions list is dirty (typically right after a graduation release). + +## Per-CRD emergency escape hatch + +Removing the label on a live cluster excludes that single CRD from migration immediately: + +```bash +kubectl label crd/mcpservers.toolhive.stacklok.dev \ + toolhive.stacklok.dev/auto-migrate-storage-version- +``` + +Intended for incident response only. If you deploy the operator via GitOps (Argo CD, Flux) or `helm upgrade`, the chart will re-apply the chart-set label within seconds. For a long-term per-cluster opt-out, leave `storageVersionMigrator: false` and accept that you will need to handle storage-version cleanup yourself before any version-removal release. + +## Interaction with version-removal releases + +The `StorageVersionMigrator` must have run against your cluster *before* an operator release that drops a deprecated CRD version ships. The typical sequence is: + +1. **Release N**: both versions served, newer version is storage, `StorageVersionMigrator` available (opt-in via the chart flag). Operators that enable the migrator have their `storedVersions` trimmed during this deprecation window. +2. **Release N+1+**: the deprecated version is removed from `spec.versions`. Because every cluster that enabled the migrator already has clean `storedVersions`, the CRD update applies. + +> **⚠ Skip-a-version upgrade trap.** If your cluster upgrades directly from a pre-migrator release to the version-removal release without ever running an intermediate release that runs the migrator, the helm upgrade will **fail** when the API server refuses to remove the deprecated version from `spec.versions`. To recover: deploy [kube-storage-version-migrator](https://github.com/kubernetes-sigs/kube-storage-version-migrator) once to clean `storedVersions`, then retry the upgrade. To avoid the trap entirely, install each release in sequence, and enable `storageVersionMigrator: true` at least one release before any version-removal release. + +## Verification + +For any ToolHive CRD in a cluster where the controller has run successfully: + +```bash +kubectl get crd mcpservers.toolhive.stacklok.dev \ + -o jsonpath='{.status.storedVersions}' +# ["v1beta1"] +``` + +If the list contains more than one entry, the controller has not yet finished migrating — check operator logs for reconcile errors and the `StorageVersionMigrationFailed` event on the CRD. The controller will also INFO-log `storage version migration not converging — sustained concurrent writes` after five consecutive conflict-only passes against the same CRD, which is the signal to investigate sibling reconcilers or admission policies that may be racing with the migrator. + +## RBAC + +The operator ServiceAccount carries (generated from kubebuilder markers, applied by the operator Helm chart): + +- `customresourcedefinitions.apiextensions.k8s.io`: `get`, `list`, `watch` +- `customresourcedefinitions/status.apiextensions.k8s.io`: `update`, `patch` +- `*.toolhive.stacklok.dev`: `get`, `list`, `update` + +The wildcard on `toolhive.stacklok.dev` resources is intentional: the set of opted-in CRDs is a runtime label decision, not a codegen-time enumeration. The runtime gate (`isManagedCRD` requiring the opt-in label) ensures the controller only writes to CRDs that explicitly opted in, even though the RBAC bound is wider. + +## Related + +- Issue: [stacklok/toolhive#4969](https://github.com/stacklok/toolhive/issues/4969) +- PR-A — controller: [stacklok/toolhive#5362](https://github.com/stacklok/toolhive/pull/5362) +- PR-B — opt-in labels + marker-coverage CI: [stacklok/toolhive#5391](https://github.com/stacklok/toolhive/pull/5391) +- Kubernetes CRD versioning: [official docs](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/) +- Upstream reference: [`kube-storage-version-migrator`](https://github.com/kubernetes-sigs/kube-storage-version-migrator) diff --git a/docs/operator/upgrade-guide/README.md b/docs/operator/upgrade-guide/README.md new file mode 100644 index 0000000000..4b0b0e3911 --- /dev/null +++ b/docs/operator/upgrade-guide/README.md @@ -0,0 +1,369 @@ +# Upgrade walkthrough — v1alpha1 → v1beta1 with StorageVersionMigrator + +End-to-end manual walkthrough that replays a real user upgrade against a local kind cluster: install the previous v0.21.0 release, create a CR of each of the 12 toolhive CRD kinds that graduated from `v1alpha1` to `v1beta1`, upgrade to the new multi-version chart, deploy this branch's operator with the migrator **disabled**, re-apply the CRs at `v1beta1`, and confirm `status.storedVersions` is stuck at `[v1alpha1, v1beta1]` on every graduated CRD. Then enable the `StorageVersionMigrator` and confirm it converges every graduated CRD to `[v1beta1]`. + +> **Scope note**: the chart actually ships **13** labeled toolhive CRDs, but the 13th — `mcpwebhookconfigs` — never graduated (its only version is `v1alpha1`, which is also its storage version). The migrator's `isMigrationNeeded` check is a no-op on it. This walkthrough scopes every verification loop to the 12 graduated CRDs and excludes `mcpwebhookconfigs`; we'll note where this matters. + +Total run time: ~30 minutes. The slow part is the first `ko build` of the operator + proxyrunner + vmcp images (~3 min); subsequent runs use the build cache and finish in ~30s. + +This guide is the canonical reproducible verification for the migrator. Companion reading: + +- [`docs/operator/storage-version-migration.md`](../storage-version-migration.md) — reference docs for the controller itself (label contract, opt-in model, mechanism). +- [Issue #4969](https://github.com/stacklok/toolhive/issues/4969) — the motivating problem. + +## Prerequisites + +- `kind`, `kubectl`, `helm`, `ko`, `task` on PATH (`go install github.com/google/ko@latest` for ko) +- Working directory: the repo root (`task operator-deploy-local` and the relative chart paths assume this). +- Cluster name is `toolhive`. If you already have a cluster with that name, delete it first or run from a different worktree. + +The CR fixtures used below ship alongside this doc: + +- [`crs-v1alpha1.yaml`](./crs-v1alpha1.yaml) — one CR of each of the 12 graduated kinds at `v1alpha1` +- [`crs-v1beta1.yaml`](./crs-v1beta1.yaml) — byte-identical to the v1alpha1 file except for `apiVersion`, simulating what `sed -i 's/v1alpha1/v1beta1/g'` would produce + +--- + +## 0 · Set up the cluster + +```bash +# If you already have a "toolhive" kind cluster from a previous run, delete it +kind delete cluster --name toolhive 2>/dev/null + +kind create cluster --name toolhive --wait 60s +kind get kubeconfig --name toolhive > kconfig.yaml +export KUBECONFIG=$(pwd)/kconfig.yaml +``` + +## 1 · Install v0.21.0 (the last v1alpha1-only release) + +```bash +helm install toolhive-operator-crds \ + oci://ghcr.io/stacklok/toolhive/toolhive-operator-crds \ + --version 0.21.0 --wait + +helm install toolhive-operator \ + oci://ghcr.io/stacklok/toolhive/toolhive-operator \ + --version 0.21.0 \ + --namespace toolhive-system --create-namespace --wait + +kubectl get crd mcpservers.toolhive.stacklok.dev -o jsonpath='{.spec.versions[*].name}' +# expected: v1alpha1 (only one version) + +kubectl wait --for=condition=Available deployment -n toolhive-system --all --timeout=180s +``` + +## 2 · Create one CR of each of the 12 graduated kinds at v1alpha1 + +```bash +# The 12 graduated CRDs. Excludes mcpwebhookconfigs (single-version v1alpha1). +# Used by every verification loop in steps 8–11 so the walkthrough doesn't drift +# if a future PR adds a 14th CRD. +GRADUATED_CRDS=( + embeddingservers.toolhive.stacklok.dev + mcpexternalauthconfigs.toolhive.stacklok.dev + mcpgroups.toolhive.stacklok.dev + mcpoidcconfigs.toolhive.stacklok.dev + mcpregistries.toolhive.stacklok.dev + mcpremoteproxies.toolhive.stacklok.dev + mcpservers.toolhive.stacklok.dev + mcpserverentries.toolhive.stacklok.dev + mcptelemetryconfigs.toolhive.stacklok.dev + mcptoolconfigs.toolhive.stacklok.dev + virtualmcpcompositetooldefinitions.toolhive.stacklok.dev + virtualmcpservers.toolhive.stacklok.dev +) + +kubectl create namespace upgrade-test +kubectl apply -f docs/operator/upgrade-guide/crs-v1alpha1.yaml + +# Confirm all 12 landed +kubectl get \ + mcpservers,mcpremoteproxies,mcptoolconfigs,mcpgroups,embeddingservers,mcpregistries,virtualmcpservers,virtualmcpcompositetooldefinitions,mcpoidcconfigs,mcptelemetryconfigs,mcpexternalauthconfigs,mcpserverentries \ + -n upgrade-test --no-headers | wc -l +# expected: 12 +``` + +## 3 · Let the old operator reconcile + capture the baseline + +```bash +sleep 60 +kubectl get deployments -n upgrade-test +# expected: up to 5 Deployments — test-server, test-remote-proxy, test-virtual-server, +# test-registry-api, and (sometimes) test-embedding shows as a StatefulSet not a Deployment + +# Snapshot the UIDs for later comparison +kubectl get deployments -n upgrade-test \ + -o jsonpath='{range .items[*]}{.metadata.name}={.metadata.uid}{"\n"}{end}' \ + | sort > /tmp/before-upgrade.txt +cat /tmp/before-upgrade.txt +``` + +These UIDs are the canary. If they change after the operator upgrade, a workload was recreated → downtime. + +## 4 · Upgrade the CRDs chart to multi-version + +```bash +helm upgrade toolhive-operator-crds deploy/charts/operator-crds --wait --timeout 120s + +kubectl get crd mcpservers.toolhive.stacklok.dev -o jsonpath='{.spec.versions[*].name}' +# expected: v1alpha1 v1beta1 + +kubectl get crd mcpservers.toolhive.stacklok.dev -o jsonpath='{.spec.versions[?(@.storage==true)].name}' +# expected: v1beta1 +``` + +## 5 · Build the new operator + deploy with the migrator DISABLED + +This is the key deviation from `task operator-deploy-local` — we want to control the feature flag, so we run ko + helm manually rather than using the task. + +```bash +# ~3 minutes on first run, ~30s with build cache +OP=$(KO_DOCKER_REPO=kind.local ko build --local -B ./cmd/thv-operator | tail -1) +PR=$(KO_DOCKER_REPO=kind.local ko build --local -B ./cmd/thv-proxyrunner | tail -1) +VM=$(KO_DOCKER_REPO=kind.local ko build --local -B ./cmd/vmcp | tail -1) + +# Load all three into the kind cluster +kind load docker-image --name toolhive "$OP" +kind load docker-image --name toolhive "$PR" +kind load docker-image --name toolhive "$VM" + +# Persist for later steps +echo "$OP" > /tmp/img-op +echo "$PR" > /tmp/img-pr +echo "$VM" > /tmp/img-vm + +# Helm upgrade with migrator EXPLICITLY DISABLED +helm upgrade toolhive-operator deploy/charts/operator \ + --set operator.image="$OP" \ + --set operator.toolhiveRunnerImage="$PR" \ + --set operator.vmcpImage="$VM" \ + --set operator.features.storageVersionMigrator=false \ + --namespace toolhive-system + +kubectl rollout status deployment -n toolhive-system --timeout=180s + +# Confirm flag took effect — read off the Deployment spec directly, not a pod. +# A pod-based check races with the old pod's Terminating state and can return +# the pre-upgrade env value, which looks like a feature-flag bug but isn't. +kubectl get deploy -n toolhive-system toolhive-operator \ + -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR")].value}' +# expected: false + +# The "disabled" log is emitted at V(1); at default verbosity it is not +# visible. Setting LOG_LEVEL=DEBUG on the operator would surface: +# "StorageVersionMigrator disabled" envVar="TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR" +``` + +## 6 · Verify zero downtime — Deployment UIDs unchanged + +```bash +kubectl get deployments -n upgrade-test \ + -o jsonpath='{range .items[*]}{.metadata.name}={.metadata.uid}{"\n"}{end}' \ + | sort > /tmp/after-upgrade.txt + +diff /tmp/before-upgrade.txt /tmp/after-upgrade.txt && echo "All Deployment UIDs unchanged" +``` + +## 7 · Re-apply all 12 CRs at v1beta1 (the user migration step) + +```bash +kubectl apply -f docs/operator/upgrade-guide/crs-v1beta1.yaml +# expected: 12 "configured" lines (not "created") + +# Wait for the operator to observe each update +sleep 10 +``` + +## 8 · Confirm storedVersions is stuck at `[v1alpha1, v1beta1]` on the 12 graduated CRDs (migrator is OFF) + +```bash +echo "=== storedVersions on the 12 graduated CRDs (migrator OFF) ===" +for crd in "${GRADUATED_CRDS[@]}"; do + stored=$(kubectl get crd "$crd" -o jsonpath='{.status.storedVersions}') + printf " %-55s %s\n" "$crd" "$stored" +done +``` + +Expected: every line ends with `["v1alpha1","v1beta1"]`. This is the "post-graduation, pre-migration" state every cluster lands in after the v0.21.0 → multi-version upgrade. (The 13th labeled CRD, `mcpwebhookconfigs`, is excluded from this loop because it's single-version and reads `["v1alpha1"]` — see the scope note at the top.) **Without the migrator, this state is permanent** — any future operator release that drops `v1alpha1` from `spec.versions` would fail with: + +``` +status.storedVersions[0]: Invalid value: "v1alpha1": must appear in spec.versions +``` + +## 9 · Enable the migrator + watch it converge + +Helm upgrade to flip the feature flag: + +```bash +OP=$(cat /tmp/img-op) +PR=$(cat /tmp/img-pr) +VM=$(cat /tmp/img-vm) + +helm upgrade toolhive-operator deploy/charts/operator \ + --set operator.image="$OP" \ + --set operator.toolhiveRunnerImage="$PR" \ + --set operator.vmcpImage="$VM" \ + --set operator.features.storageVersionMigrator=true \ + --namespace toolhive-system + +kubectl rollout status deployment -n toolhive-system --timeout=180s + +# Confirm flag is now true — read off the Deployment spec directly (see step 5 +# for why a pod-based check races with the old pod's Terminating state). +# NOTE: the flag must be set explicitly. The chart default is +# storageVersionMigrator=false (the feature is opt-in), and `helm upgrade` does +# not reuse the previous release's --set values, so omitting it here renders the +# env var as false and the migrator never runs. +kubectl get deploy -n toolhive-system toolhive-operator \ + -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR")].value}' +# expected: true +``` + +Wait for convergence: + +```bash +expected="${#GRADUATED_CRDS[@]}" # 12 today; the loop tracks the array, so a future PR that adds a CRD won't drift. +echo "=== waiting up to 60s for ${expected} graduated CRDs to converge ===" +for i in $(seq 1 60); do + count=0 + for c in "${GRADUATED_CRDS[@]}"; do + if [ "$(kubectl get crd "$c" -o jsonpath='{.status.storedVersions}')" = '["v1beta1"]' ]; then + count=$((count + 1)) + fi + done + if [ "$count" = "$expected" ]; then + echo "All ${expected} graduated CRDs converged after ${i}s" + break + fi + sleep 1 +done + +# If the loop exited without converging, surface what's outstanding so the +# reader has something actionable to investigate (operator logs, *Failed +# events, admission-policy rejections). +if [ "$count" != "$expected" ]; then + echo "TIMEOUT: only ${count}/${expected} converged. Check operator logs and" + echo " StorageVersionMigrationFailed events on the CRDs." +fi +``` + +In practice this completes in ~1-2 seconds once the new pod is ready. + +Verify: + +```bash +echo "=== storedVersions on the ${expected} graduated CRDs (migrator ON, post-converge) ===" +for crd in "${GRADUATED_CRDS[@]}"; do + stored=$(kubectl get crd "$crd" -o jsonpath='{.status.storedVersions}') + printf " %-55s %s\n" "$crd" "$stored" +done +# expected: every line ends with ["v1beta1"] + +echo "=== StorageVersionMigrationSucceeded events ===" +kubectl get events -A --field-selector reason=StorageVersionMigrationSucceeded --no-headers | wc -l +# expected: ${expected} — one event per graduated CRD. The 13th (mcpwebhookconfigs) +# is a no-op for the migrator (storedVersions already == [storageVersion]) so it +# emits no event. + +echo "=== StorageVersionMigrationFailed events (should be 0) ===" +kubectl get events -A --field-selector reason=StorageVersionMigrationFailed --no-headers | wc -l +# expected: 0 +``` + +Confirm CRs still healthy (admission webhooks on MCPServer / MCPGroup / VirtualMCPServer all accepted the per-CR Updates): + +```bash +kubectl get \ + mcpservers,mcpremoteproxies,mcptoolconfigs,mcpgroups,embeddingservers,mcpregistries,virtualmcpservers,virtualmcpcompositetooldefinitions,mcpoidcconfigs,mcptelemetryconfigs,mcpexternalauthconfigs,mcpserverentries \ + -n upgrade-test --no-headers | wc -l +# expected: 12 +``` + +## 10 · (Optional) Inspect migrator logs + +Pod-selection is needed here (logs aren't readable off a Deployment), but filter to Running pods so a still-Terminating old pod from the helm upgrade can't be picked up: + +```bash +NEW_POD=$(kubectl get pods -n toolhive-system \ + --field-selector=status.phase=Running --no-headers \ + | grep "toolhive-operator-" | awk '{print $1}' | head -1) +kubectl logs "$NEW_POD" -n toolhive-system | grep "storage version migration complete" | wc -l +# expected: ${expected} lines — one per graduated CRD +``` + +**If this prints 0**, the migration may have happened in a previous container instance (the operator pod can restart for unrelated reasons in a kind cluster — leases, OOM, etc.). Try the previous container's logs: + +```bash +kubectl logs "$NEW_POD" -n toolhive-system --previous | grep "storage version migration complete" | wc -l +``` + +The migration is complete in either case — the events on the CRDs in step 9 are the authoritative signal. + +## 11 · (Optional) Simulate the next release that drops v1alpha1 + +Once `storedVersions` is `[v1beta1]` everywhere, the apiserver will accept removal of `v1alpha1` from `spec.versions` — the safety interlock the migrator exists to satisfy. To demonstrate: + +```bash +# Direct apiserver patch — the same end state a future "drop v1alpha1" chart +# would produce. Scoped to the 12 graduated CRDs: mcpwebhookconfigs is +# single-version v1alpha1, so removing v1alpha1 from its spec.versions would +# leave it with zero versions and the apiserver would reject the patch. +for crd in "${GRADUATED_CRDS[@]}"; do + newversions=$(kubectl get crd "$crd" -o json | jq '.spec.versions | map(select(.name != "v1alpha1"))') + patch=$(jq -n --argjson v "$newversions" '{spec:{versions:$v}}') + printf " %-55s " "$crd" + kubectl patch crd "$crd" --type=merge -p "$patch" 2>&1 | tail -1 +done +``` + +Every line should say `... patched`. Verify v1alpha1 access now fails: + +```bash +kubectl get mcpgroups.v1alpha1.toolhive.stacklok.dev test-group -n upgrade-test +# expected: error — the server doesn't have a resource type "mcpgroups" + +kubectl get mcpgroups.v1beta1.toolhive.stacklok.dev test-group -n upgrade-test +# expected: the resource +``` + +**Negative test**: if you skip step 9 (the migrator pass) and jump straight to step 11, every `kubectl patch` will fail with: + +``` +Invalid value: "v1alpha1": must appear in spec.versions +``` + +That's the apiserver wall the migrator exists to clear. + +## 12 · Cleanup + +```bash +kind delete cluster --name toolhive +rm -f kconfig.yaml /tmp/before-upgrade.txt /tmp/after-upgrade.txt \ + /tmp/img-op /tmp/img-pr /tmp/img-vm +``` + +--- + +## Summary of what's verified + +| Check | Validates | Step | +|---|---|---| +| Operator startup logs show migrator skipped when disabled | The `setupStorageVersionMigrator` branch in `app/app.go` honours `TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR=false` | 5 | +| Zero-downtime upgrade across operator chart upgrade | The PR's operator changes don't recreate any managed workload | 6 | +| `storedVersions` is `[v1alpha1, v1beta1]` after re-apply with migrator OFF | Migrator did not run; baseline state is correctly preserved | 8 | +| Helm-upgrade flip enables the migrator | Feature flag round-trips correctly | 9 | +| `storedVersions` converges to `[v1beta1]` on all 12 graduated CRDs | The actual migration logic works against real ToolHive CRDs | 9 | +| 12 `StorageVersionMigrationSucceeded` events on the graduated CRDs | The Recorder is correctly wired and per-CRD migrations are observable | 9 | +| 0 `StorageVersionMigrationFailed` events | No CRD's per-CR Update loop returned a non-retriable error | 9 | +| All 12 CRs still healthy after migration | Validating admission webhooks (MCPServer / MCPGroup / VirtualMCPServer) tolerate the per-CR Updates | 9 | +| RBAC permits storedVersions trim | The ClusterRole has the correct verbs for `customresourcedefinitions/status` and `*.toolhive.stacklok.dev/*` | implicit — trim succeeds in step 9 | +| Apiserver permits `v1alpha1` removal from `spec.versions` once `storedVersions` is clean | The deprecation chain end-to-end works | 11 | + +## What this does NOT cover + +- **Mid-migration crash recovery**: no test forces the operator to crash mid-loop. Envtest covers the conflict-handling and re-encode-failure paths separately. +- **Pagination at real-cluster scale**: 12 CRs is well below the 500-default page size. The envtest suite explicitly tests the continue-token loop with 7 CRs at `PageSize=3`. +- **Operator restart resilience under load**: kind clusters are resource-limited and the operator pod sometimes restarts during tests for unrelated reasons (the `--previous` log fallback in step 10 covers this). + +These are covered by the envtest suite at `cmd/thv-operator/test-integration/storageversionmigrator/`. This walkthrough covers what envtest can't: helm chart wiring, real ToolHive CRD schemas with their actual webhooks, and the full upgrade sequence a real user would run. diff --git a/docs/operator/upgrade-guide/crs-v1alpha1.yaml b/docs/operator/upgrade-guide/crs-v1alpha1.yaml new file mode 100644 index 0000000000..ec1a2d0aa5 --- /dev/null +++ b/docs/operator/upgrade-guide/crs-v1alpha1.yaml @@ -0,0 +1,149 @@ +# All 12 ToolHive CR kinds at v1alpha1. +# Apply with: +# kubectl create namespace upgrade-test +# kubectl apply -f crs-v1alpha1.yaml +# +# Note: MCPServerEntry references MCPGroup (groupRef.name: test-group), +# so MCPGroup must be applied first. kubectl handles ordering within a +# single file by best-effort, and since all refs here are by name (not +# UID), no strict ordering is required — missing refs just stay pending. +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPGroup +metadata: + name: test-group + namespace: upgrade-test +spec: + description: Test group for upgrade validation +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPServer +metadata: + name: test-server + namespace: upgrade-test +spec: + image: ghcr.io/github/github-mcp-server + transport: stdio +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPRemoteProxy +metadata: + name: test-remote-proxy + namespace: upgrade-test +spec: + remoteUrl: https://example.com/mcp + transport: streamable-http +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPToolConfig +metadata: + name: test-tool-config + namespace: upgrade-test +spec: + toolsFilter: + - allowed_tool +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: EmbeddingServer +metadata: + name: test-embedding + namespace: upgrade-test +spec: + image: ghcr.io/stacklok/test-embedding:latest + model: test-model +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPRegistry +metadata: + name: test-registry + namespace: upgrade-test +spec: + displayName: Test Registry + configYAML: | + sources: + - name: k8s + format: upstream + kubernetes: {} + registries: + - name: default + sources: ["k8s"] + database: + host: postgres + port: 5432 + user: db_app + database: registry + auth: + mode: anonymous +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPOIDCConfig +metadata: + name: test-oidc + namespace: upgrade-test +spec: + type: inline + inline: + issuer: https://auth.example.com + jwksUrl: https://auth.example.com/.well-known/jwks.json +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPTelemetryConfig +metadata: + name: test-telemetry + namespace: upgrade-test +spec: + openTelemetry: + endpoint: http://otel-collector:4317 + metrics: + enabled: true + tracing: + enabled: true +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPExternalAuthConfig +metadata: + name: test-external-auth + namespace: upgrade-test +spec: + type: tokenExchange + tokenExchange: + tokenUrl: https://auth.example.com/token + audience: https://api.example.com + clientId: test-client + clientSecretRef: + name: test-secret + key: client-secret +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: VirtualMCPServer +metadata: + name: test-virtual-server + namespace: upgrade-test +spec: + groupRef: + name: test-group + incomingAuth: + type: anonymous +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: VirtualMCPCompositeToolDefinition +metadata: + name: test-composite-tool + namespace: upgrade-test +spec: + name: test_workflow + description: Test composite workflow + steps: + - id: step-1 + tool: placeholder.tool +--- +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: MCPServerEntry +metadata: + name: test-server-entry + namespace: upgrade-test +spec: + groupRef: + name: test-group + transport: streamable-http + remoteUrl: https://example.com/mcp diff --git a/docs/operator/upgrade-guide/crs-v1beta1.yaml b/docs/operator/upgrade-guide/crs-v1beta1.yaml new file mode 100644 index 0000000000..b8570aa141 --- /dev/null +++ b/docs/operator/upgrade-guide/crs-v1beta1.yaml @@ -0,0 +1,149 @@ +# All 12 ToolHive CR kinds at v1beta1. +# Apply with: +# kubectl create namespace upgrade-test +# kubectl apply -f crs-v1beta1.yaml +# +# Note: MCPServerEntry references MCPGroup (groupRef.name: test-group), +# so MCPGroup must be applied first. kubectl handles ordering within a +# single file by best-effort, and since all refs here are by name (not +# UID), no strict ordering is required — missing refs just stay pending. +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: MCPGroup +metadata: + name: test-group + namespace: upgrade-test +spec: + description: Test group for upgrade validation +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: MCPServer +metadata: + name: test-server + namespace: upgrade-test +spec: + image: ghcr.io/github/github-mcp-server + transport: stdio +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: MCPRemoteProxy +metadata: + name: test-remote-proxy + namespace: upgrade-test +spec: + remoteUrl: https://example.com/mcp + transport: streamable-http +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: MCPToolConfig +metadata: + name: test-tool-config + namespace: upgrade-test +spec: + toolsFilter: + - allowed_tool +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: EmbeddingServer +metadata: + name: test-embedding + namespace: upgrade-test +spec: + image: ghcr.io/stacklok/test-embedding:latest + model: test-model +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: MCPRegistry +metadata: + name: test-registry + namespace: upgrade-test +spec: + displayName: Test Registry + configYAML: | + sources: + - name: k8s + format: upstream + kubernetes: {} + registries: + - name: default + sources: ["k8s"] + database: + host: postgres + port: 5432 + user: db_app + database: registry + auth: + mode: anonymous +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: MCPOIDCConfig +metadata: + name: test-oidc + namespace: upgrade-test +spec: + type: inline + inline: + issuer: https://auth.example.com + jwksUrl: https://auth.example.com/.well-known/jwks.json +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: MCPTelemetryConfig +metadata: + name: test-telemetry + namespace: upgrade-test +spec: + openTelemetry: + endpoint: http://otel-collector:4317 + metrics: + enabled: true + tracing: + enabled: true +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: MCPExternalAuthConfig +metadata: + name: test-external-auth + namespace: upgrade-test +spec: + type: tokenExchange + tokenExchange: + tokenUrl: https://auth.example.com/token + audience: https://api.example.com + clientId: test-client + clientSecretRef: + name: test-secret + key: client-secret +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: VirtualMCPServer +metadata: + name: test-virtual-server + namespace: upgrade-test +spec: + groupRef: + name: test-group + incomingAuth: + type: anonymous +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: VirtualMCPCompositeToolDefinition +metadata: + name: test-composite-tool + namespace: upgrade-test +spec: + name: test_workflow + description: Test composite workflow + steps: + - id: step-1 + tool: placeholder.tool +--- +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: MCPServerEntry +metadata: + name: test-server-entry + namespace: upgrade-test +spec: + groupRef: + name: test-group + transport: streamable-http + remoteUrl: https://example.com/mcp