|
| 1 | +# Storage Version Migration |
| 2 | + |
| 3 | +The ToolHive operator ships a `StorageVersionMigrator` controller that keeps every ToolHive CRD's `status.storedVersions` list clean, so a future operator release can drop deprecated API versions (e.g. `v1alpha1`) without orphaning objects in etcd. |
| 4 | + |
| 5 | +## Why this exists |
| 6 | + |
| 7 | +When a CRD graduates from, say, `v1alpha1` to `v1beta1` with both versions served and `v1beta1` as the storage version, existing objects continue to work — they are transparently converted on read/write. But the API server records every version that has ever been used for storage in `CustomResourceDefinition.status.storedVersions`. Until that list is trimmed, the Kubernetes API server refuses to let you remove a version from `spec.versions`, because doing so would orphan any etcd-stored objects encoded at that version. |
| 8 | + |
| 9 | +The cleanup is not automatic. Someone has to re-store every existing object at the current storage version, then explicitly patch `status.storedVersions` to drop the old entry. The `StorageVersionMigrator` controller does this for you, on every opted-in ToolHive CRD, continuously. See [upstream Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/storage-version-migration/) for the mechanism. |
| 10 | + |
| 11 | +## What the controller does |
| 12 | + |
| 13 | +For each opted-in CRD: |
| 14 | + |
| 15 | +1. Reads `spec.versions` to find the entry with `storage: true`. |
| 16 | +2. If `status.storedVersions` already equals `[<currentStorageVersion>]` and only one version is served, nothing to do. |
| 17 | +3. Otherwise, lists every Custom Resource of that kind and issues a metadata-only Server-Side Apply against the `/status` subresource with field manager `thv-storage-version-migrator`. This forces the API server to re-encode each object at the current storage version without triggering admission webhooks (SSA on `/status` typically bypasses webhooks registered on the main resource, and the empty apply owns no fields so it doesn't fight other controllers). |
| 18 | +4. Once every object has been re-stored, patches `CRD.status.storedVersions` to `[<currentStorageVersion>]` using an optimistic-lock merge — so concurrent API-server writes cause a clean retry rather than a silent overwrite. |
| 19 | + |
| 20 | +CRDs without a `/status` subresource fall back to main-resource SSA. |
| 21 | + |
| 22 | +## The opt-in label |
| 23 | + |
| 24 | +A CRD participates in migration only if it carries: |
| 25 | + |
| 26 | +```yaml |
| 27 | +metadata: |
| 28 | + labels: |
| 29 | + toolhive.stacklok.dev/auto-migrate-storage-version: "true" |
| 30 | +``` |
| 31 | +
|
| 32 | +The label is set at CRD-generation time via a kubebuilder marker on each Go type in `cmd/thv-operator/api/v1beta1/`: |
| 33 | + |
| 34 | +```go |
| 35 | +// +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true |
| 36 | +type MCPServer struct { ... } |
| 37 | +``` |
| 38 | + |
| 39 | +`task operator-manifests` bakes the label into the generated CRD YAML. All current ToolHive root types ship with the marker. A CI test (`TestV1beta1TypesMarkerCoverage`) fails the build if a root type is added without either this marker or an explicit `// +thv:storage-version-migrator:exclude` sibling marker — so the migrator cannot silently forget a new CRD. |
| 40 | + |
| 41 | +Adding a new CRD that should be migrated: |
| 42 | + |
| 43 | +```go |
| 44 | +// +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true |
| 45 | +type NewShinyThing struct { ... } |
| 46 | +``` |
| 47 | + |
| 48 | +Adding a new CRD that deliberately should NOT be migrated (e.g. an experimental kind that is still stabilising its schema): |
| 49 | + |
| 50 | +```go |
| 51 | +// +thv:storage-version-migrator:exclude |
| 52 | +type ExperimentalThing struct { ... } |
| 53 | +``` |
| 54 | + |
| 55 | +## Disabling the controller |
| 56 | + |
| 57 | +Set the Helm feature flag: |
| 58 | + |
| 59 | +```yaml |
| 60 | +operator: |
| 61 | + features: |
| 62 | + storageVersionMigrator: false # default: true |
| 63 | +``` |
| 64 | + |
| 65 | +This sets `ENABLE_STORAGE_VERSION_MIGRATOR=false` on the operator Deployment, and the reconciler is not registered with the manager. |
| 66 | + |
| 67 | +Disable only if you are running an external migrator such as [kube-storage-version-migrator](https://github.com/kubernetes-sigs/kube-storage-version-migrator). Disabling without a replacement is a footgun: the next ToolHive release that removes a deprecated API version will refuse to apply its CRD update until `storedVersions` is cleaned, and you will have to clean it yourself. |
| 68 | + |
| 69 | +## Per-CRD emergency escape hatch |
| 70 | + |
| 71 | +Removing the label on a live cluster excludes that single CRD from migration immediately: |
| 72 | + |
| 73 | +```bash |
| 74 | +kubectl label crd/mcpservers.toolhive.stacklok.dev \ |
| 75 | + toolhive.stacklok.dev/auto-migrate-storage-version- |
| 76 | +``` |
| 77 | + |
| 78 | +Intended for incident response only. If you deploy the operator via GitOps (Argo CD, Flux) or `helm upgrade`, the chart will re-apply the chart-set label within seconds. Use the `storageVersionMigrator` feature flag for long-term opt-out. |
| 79 | + |
| 80 | +## Interaction with version removal releases |
| 81 | + |
| 82 | +The `StorageVersionMigrator` must have had time to run against your cluster *before* an operator release that drops a deprecated CRD version ships. The typical sequence is: |
| 83 | + |
| 84 | +1. **Release N**: both versions served, newer version is storage, `StorageVersionMigrator` enabled. The controller quietly re-stores all objects and trims `storedVersions` on every cluster during this deprecation window. |
| 85 | +2. **Release N+1+**: the deprecated version is removed from `spec.versions`. Because every cluster's `storedVersions` was already cleaned in the previous release, the CRD update applies cleanly. |
| 86 | + |
| 87 | +If your cluster upgraded directly from a pre-migrator release to the version-removal release without ever running release N, you must clean `storedVersions` manually (or deploy `kube-storage-version-migrator` once) before the upgrade can succeed. |
| 88 | + |
| 89 | +## Verification |
| 90 | + |
| 91 | +For any ToolHive CRD in a cluster where the controller has run: |
| 92 | + |
| 93 | +```bash |
| 94 | +kubectl get crd mcpservers.toolhive.stacklok.dev \ |
| 95 | + -o jsonpath='{.status.storedVersions}' |
| 96 | +# ["v1beta1"] |
| 97 | +``` |
| 98 | + |
| 99 | +If the list contains more than one entry, the controller has not yet finished migrating — check operator logs for reconcile errors and the `StorageVersionMigrationFailed` event on the CRD. |
| 100 | + |
| 101 | +## RBAC |
| 102 | + |
| 103 | +The controller requires (generated from kubebuilder markers, applied by the operator Helm chart): |
| 104 | + |
| 105 | +- `customresourcedefinitions.apiextensions.k8s.io`: `get`, `list`, `watch` |
| 106 | +- `customresourcedefinitions/status.apiextensions.k8s.io`: `update`, `patch` |
| 107 | +- `*.toolhive.stacklok.dev`: `get`, `list`, `patch` |
| 108 | +- `*/status.toolhive.stacklok.dev`: `patch` |
| 109 | + |
| 110 | +## Related |
| 111 | + |
| 112 | +- Issue: [stacklok/toolhive#4969](https://github.com/stacklok/toolhive/issues/4969) |
| 113 | +- Kubernetes CRD versioning: [official docs](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/) |
| 114 | +- Reference implementation: [kubernetes-sigs/cluster-api `crdmigrator`](https://github.com/kubernetes-sigs/cluster-api/tree/main/controllers/crdmigrator) |
0 commit comments