Skip to content

Commit 4c36f87

Browse files
ChrisJBurnsclaude
andcommitted
Document storage version migration for operator users
Covers what the controller does, the opt-in label contract, how to disable the controller operator-wide, the per-CRD escape hatch and its interaction with GitOps, how the migrator interacts with future version-removal releases, and the required RBAC. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent dec9630 commit 4c36f87

1 file changed

Lines changed: 114 additions & 0 deletions

File tree

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Storage Version Migration
2+
3+
The ToolHive operator ships a `StorageVersionMigrator` controller that keeps every ToolHive CRD's `status.storedVersions` list clean, so a future operator release can drop deprecated API versions (e.g. `v1alpha1`) without orphaning objects in etcd.
4+
5+
## Why this exists
6+
7+
When a CRD graduates from, say, `v1alpha1` to `v1beta1` with both versions served and `v1beta1` as the storage version, existing objects continue to work — they are transparently converted on read/write. But the API server records every version that has ever been used for storage in `CustomResourceDefinition.status.storedVersions`. Until that list is trimmed, the Kubernetes API server refuses to let you remove a version from `spec.versions`, because doing so would orphan any etcd-stored objects encoded at that version.
8+
9+
The cleanup is not automatic. Someone has to re-store every existing object at the current storage version, then explicitly patch `status.storedVersions` to drop the old entry. The `StorageVersionMigrator` controller does this for you, on every opted-in ToolHive CRD, continuously. See [upstream Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/storage-version-migration/) for the mechanism.
10+
11+
## What the controller does
12+
13+
For each opted-in CRD:
14+
15+
1. Reads `spec.versions` to find the entry with `storage: true`.
16+
2. If `status.storedVersions` already equals `[<currentStorageVersion>]` and only one version is served, nothing to do.
17+
3. Otherwise, lists every Custom Resource of that kind and issues a metadata-only Server-Side Apply against the `/status` subresource with field manager `thv-storage-version-migrator`. This forces the API server to re-encode each object at the current storage version without triggering admission webhooks (SSA on `/status` typically bypasses webhooks registered on the main resource, and the empty apply owns no fields so it doesn't fight other controllers).
18+
4. Once every object has been re-stored, patches `CRD.status.storedVersions` to `[<currentStorageVersion>]` using an optimistic-lock merge — so concurrent API-server writes cause a clean retry rather than a silent overwrite.
19+
20+
CRDs without a `/status` subresource fall back to main-resource SSA.
21+
22+
## The opt-in label
23+
24+
A CRD participates in migration only if it carries:
25+
26+
```yaml
27+
metadata:
28+
labels:
29+
toolhive.stacklok.dev/auto-migrate-storage-version: "true"
30+
```
31+
32+
The label is set at CRD-generation time via a kubebuilder marker on each Go type in `cmd/thv-operator/api/v1beta1/`:
33+
34+
```go
35+
// +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true
36+
type MCPServer struct { ... }
37+
```
38+
39+
`task operator-manifests` bakes the label into the generated CRD YAML. All current ToolHive root types ship with the marker. A CI test (`TestV1beta1TypesMarkerCoverage`) fails the build if a root type is added without either this marker or an explicit `// +thv:storage-version-migrator:exclude` sibling marker — so the migrator cannot silently forget a new CRD.
40+
41+
Adding a new CRD that should be migrated:
42+
43+
```go
44+
// +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true
45+
type NewShinyThing struct { ... }
46+
```
47+
48+
Adding a new CRD that deliberately should NOT be migrated (e.g. an experimental kind that is still stabilising its schema):
49+
50+
```go
51+
// +thv:storage-version-migrator:exclude
52+
type ExperimentalThing struct { ... }
53+
```
54+
55+
## Disabling the controller
56+
57+
Set the Helm feature flag:
58+
59+
```yaml
60+
operator:
61+
features:
62+
storageVersionMigrator: false # default: true
63+
```
64+
65+
This sets `ENABLE_STORAGE_VERSION_MIGRATOR=false` on the operator Deployment, and the reconciler is not registered with the manager.
66+
67+
Disable only if you are running an external migrator such as [kube-storage-version-migrator](https://github.com/kubernetes-sigs/kube-storage-version-migrator). Disabling without a replacement is a footgun: the next ToolHive release that removes a deprecated API version will refuse to apply its CRD update until `storedVersions` is cleaned, and you will have to clean it yourself.
68+
69+
## Per-CRD emergency escape hatch
70+
71+
Removing the label on a live cluster excludes that single CRD from migration immediately:
72+
73+
```bash
74+
kubectl label crd/mcpservers.toolhive.stacklok.dev \
75+
toolhive.stacklok.dev/auto-migrate-storage-version-
76+
```
77+
78+
Intended for incident response only. If you deploy the operator via GitOps (Argo CD, Flux) or `helm upgrade`, the chart will re-apply the chart-set label within seconds. Use the `storageVersionMigrator` feature flag for long-term opt-out.
79+
80+
## Interaction with version removal releases
81+
82+
The `StorageVersionMigrator` must have had time to run against your cluster *before* an operator release that drops a deprecated CRD version ships. The typical sequence is:
83+
84+
1. **Release N**: both versions served, newer version is storage, `StorageVersionMigrator` enabled. The controller quietly re-stores all objects and trims `storedVersions` on every cluster during this deprecation window.
85+
2. **Release N+1+**: the deprecated version is removed from `spec.versions`. Because every cluster's `storedVersions` was already cleaned in the previous release, the CRD update applies cleanly.
86+
87+
If your cluster upgraded directly from a pre-migrator release to the version-removal release without ever running release N, you must clean `storedVersions` manually (or deploy `kube-storage-version-migrator` once) before the upgrade can succeed.
88+
89+
## Verification
90+
91+
For any ToolHive CRD in a cluster where the controller has run:
92+
93+
```bash
94+
kubectl get crd mcpservers.toolhive.stacklok.dev \
95+
-o jsonpath='{.status.storedVersions}'
96+
# ["v1beta1"]
97+
```
98+
99+
If the list contains more than one entry, the controller has not yet finished migrating — check operator logs for reconcile errors and the `StorageVersionMigrationFailed` event on the CRD.
100+
101+
## RBAC
102+
103+
The controller requires (generated from kubebuilder markers, applied by the operator Helm chart):
104+
105+
- `customresourcedefinitions.apiextensions.k8s.io`: `get`, `list`, `watch`
106+
- `customresourcedefinitions/status.apiextensions.k8s.io`: `update`, `patch`
107+
- `*.toolhive.stacklok.dev`: `get`, `list`, `patch`
108+
- `*/status.toolhive.stacklok.dev`: `patch`
109+
110+
## Related
111+
112+
- Issue: [stacklok/toolhive#4969](https://github.com/stacklok/toolhive/issues/4969)
113+
- Kubernetes CRD versioning: [official docs](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/)
114+
- Reference implementation: [kubernetes-sigs/cluster-api `crdmigrator`](https://github.com/kubernetes-sigs/cluster-api/tree/main/controllers/crdmigrator)

0 commit comments

Comments
 (0)