Skip to content

feat(main): add migration tool#327

Merged
Andrey Kolkov (androndo) merged 1 commit into
v1from
feat/migration-tool-to-v1
Jun 8, 2026
Merged

feat(main): add migration tool#327
Andrey Kolkov (androndo) merged 1 commit into
v1from
feat/migration-tool-to-v1

Conversation

@androndo

Copy link
Copy Markdown
Collaborator

No description provided.

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 82a292c6-085d-46cd-a65f-8b5a9df37b0e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/migration-tool-to-v1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added api-change controllers documentation Improvements or additions to documentation labels Jun 5, 2026

@lllamnyp Timofei Larkin (lllamnyp) left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration tooling is solid and the API-shape instinct — making adopted pods look native via durable identity rather than special-casing the controller — is right. But I'd like to change where that identity lives, and there's one hard blocker. Requesting changes.

Blocking 1 — a 77 MB binary is committed

The compiled etcd-migrate (77,422,194 bytes) is committed at the repo root. .gitignore already warns against exactly this for kubectl-etcd ("never commit a root-level build artifact") but /etcd-migrate was never added. git rm it, add /etcd-migrate to .gitignore, and amend the commit (it's a single commit) rather than deleting in a follow-up — a deleted-but-in-history 77 MB blob bloats every clone forever.

Blocking 2 — move the two migration knobs from spec fields to EtcdMember annotations

spec.headlessServiceName and EtcdMemberSpec.DataDirSubPath are permanent, user-facing API for a one-time migration need. I'd rather not grow the API for that. Replace both with operator-interpreted reserved annotations on EtcdMember (the internal, operator-managed resource), e.g.:

  • etcd-operator.cozystack.io/headless-service-name
  • etcd-operator.cozystack.io/data-dir-subpath

Why this is better: the knobs become invisible to normal users, and they self-wipe. The migration tool stamps them on the EtcdMembers it creates for the adopted pods; the operator never stamps them on members it creates, so every rolled/replaced member comes up native. Once the cluster has fully rolled, no member carries them and the cluster is indistinguishable from a natively-created one — no permanent knob, no deprecation later.

Concretely:

  • Member controller (buildPod): read both annotations. headless-service-name drives the Pod's spec.subdomain and the constructed peer/client advertise URLs; data-dir-subpath drives --data-dir (/var/lib/etcd/<subpath>). Absent → today's native behavior.
  • Validate data-dir-subpath in code. As a spec field it had an apiserver-enforced ^[a-zA-Z0-9][a-zA-Z0-9._-]*$ pattern; an annotation has no schema validation, so the controller must reject/ignore a value containing / or .. before substituting into --data-dir (mount-escape). Fail closed.
  • Cluster controller: where it builds per-member dial endpoints (memberEndpoints and the discovery/scale-up paths), read each member's headless-service-name annotation (default <cluster>). It must never set either annotation on members it creates in bootstrap/scaleUp.
  • additionalMetadata must exclude the reserved prefix. This is load-bearing, not hygiene: additionalMetadata merges onto member CRs, so if a user could set these keys through it, (a) every new member would inherit headless-service-name and the self-wipe breaks, and (b) data-dir-subpath becomes a user-controllable path into --data-dir. Strip the etcd-operator.cozystack.io/ reserved prefix from any additionalMetadata merge.
  • Remove headlessServiceName/dataDirSubPath from both specs, the CRDs, the CEL rules, and regenerate deepcopy/manifests.

Blocking 3 — let the legacy headless Service GC itself via owner references

The operator keeps managing only its native services (<cluster> headless + <cluster>-client); it should not learn about the legacy Service at all. The legacy headless Service is a migration artifact — but instead of leaving cleanup to a runbook, tie its lifetime to the adopted members with owner references so Kubernetes GC removes it exactly when the last legacy member rolls away.

Kubernetes deletes a dependent once it has owner references and all of its owners are gone. So if the legacy Service carries one owner reference per migration-created EtcdMember, it survives while any adopted member remains and is auto-GC'd when the last one is replaced. New members aren't owners, so they don't keep it alive.

The migration tool should, in this order (the ordering prevents a premature-GC race):

  1. Create the per-pod EtcdMembers (with the annotations above) and capture their UIDs.
  2. Patch the legacy headless Service's ownerReferences, replacing the existing ones, with one entry per new EtcdMember:
    ownerReferences:
    - apiVersion: etcd-operator.cozystack.io/v1alpha2
      kind: EtcdMember
      name: <member-name>
      uid: <member-uid>
      controller: false
      blockOwnerDeletion: false
    
    controller: false on all of them (at most one controller owner is allowed, and these aren't controllers); blockOwnerDeletion: false so deleting a member is never gated on the Service. Same namespace, which is required for namespaced owners.
  3. Only then delete the legacy etcd.aenix.io/v1alpha1 resources, and do so with orphan propagation (--cascade=orphan / propagationPolicy: Orphan) so the Service/Pods/PVCs survive the legacy CR deletion.

The reason for step 2 before step 3: the legacy Service currently has a controller owner (the legacy CR or its StatefulSet). If you delete that owner first, the Service is briefly sole-owned by a now-missing object and GC deletes it immediately. Replacing the owner refs before removing the legacy owners closes that window. Strip the stale refs entirely in the patch — don't leave a dangling controller ref.

Net effect: as the operator rolls members (each replacement deletes the old EtcdMember CR, dropping one owner ref), the Service stays until the final original member is gone, then disappears on its own. No operator code, no manual step, no runbook cleanup.

Non-blocking — cert SANs are a Cozystack coordination point

Server/peer certs here are external (Cozystack cert-manager). The operator's SAN contract is a wildcard pinned to the service name (*.<svc>.<ns>.svc). During the mixed window, adopted members resolve under the legacy name and rolled members under <cluster>, so the externally-issued cert the pods mount must carry both *.<legacy>.<ns>.svc and *.<cluster>.<ns>.svc, then drop the legacy SAN once rollover completes. The operator can't synthesize this, so docs/migration.md should state it explicitly as a prerequisite.

Tests to add

  • buildPod renders subdomain/URLs and --data-dir from the annotations; absent → native.
  • data-dir-subpath with / or .. is rejected/ignored (fail-closed).
  • Cluster controller builds per-member endpoints from the annotation and never stamps the reserved keys on created members.
  • additionalMetadata carrying a reserved key does not reach the member (stripped).
  • Migration tool: legacy Service ends up with one owner ref per EtcdMember and zero stale refs; legacy CRs deleted with orphan propagation.

Signed-off-by: Andrey Kolkov <androndo@gmail.com>

@lllamnyp Timofei Larkin (lllamnyp) left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-place migration tooling and the supporting operator changes are in good shape. Approving.

Adds the etcd-migrate in-place adoption CLI (cmd/etcd-migrate/*, internal/migrate/*, the shared internal/portforward/*), the migration-only reserved annotations (headless-service-name, data-dir-subpath) with per-member Service-name threading through both controllers, the typed spec.options, and the migration docs.

Verification:

  • go build ./..., go vet ./..., go test ./... all clean (migrate, cmd/etcd-migrate, controllers, api/v1alpha2, portforward, kubectl-etcd).
  • make manifests generate produces no drift — the committed CRDs/deepcopy match the new types.
  • The squash purged the previously-committed build artifact: no etcd-migrate blob remains anywhere in the branch lineage, and /etcd-migrate is gitignored so it can't recur.
  • The legacy-constant mirroring was checked against the actual legacy operator (not just the in-tree comments): the default.etcd data-dir, the <name>-<namespace> cluster token, the headless/client/configmap name derivations, and the v1alpha1 spec json tags all match.

Load-bearing logic I checked and found sound:

  • Apply ordering (apply_adopt.go): re-own the legacy headless Service → orphan-delete the legacy CR → orphan-delete + await the StatefulSet → re-own pods/PVCs → client-Service cutover. The owner-ref rewrite before the legacy-CR delete is what avoids a premature-GC window, and orphan propagation keeps the data plane; the GC of the legacy headless Service then happens exactly when the last adopted member rolls.
  • Auth-disable-before-backup (runMutationPhases): correct for the anonymous snapshot dial, and a failed auth-disable/backup flips the plan to ActionError which applyPlans skips — so an unprotected, still-auth'd cluster is never adopted. Tested.
  • Fail-closed data-dir rejects / and .., and the reserved-prefix strip in applyAdditionalMetadata blocks forging the migration annotations through additionalMetadata. Tested.
  • Per-member Service-name threading is consistent across both controllers, and completePendingMember echoes etcd's verbatim peer URLs — correct for the mixed adopted/native window.
  • spec.options is a closed typed set, latched through status.observed, validated against a real apiserver.

Docs (migration.md, concepts.md) were checked against the code and are accurate — the ClusterIP→headless cutover, the SAN/CA caveats, the self-wiping annotations, and the options map→typed mapping all match.

Non-blocking follow-up (not a merge blocker): inspect.go's legacyOperatorTLSConfig pins ServerName to <name>.<namespace>.svc and snapshotjob.go's LegacyClientEndpoint dials the same — both assume the default legacy client-Service name. A legacy cluster that set spec.serviceTemplate.name would pin/dial the wrong DNS and fail inspection or backup (loudly, no silent data loss — and the cutover code already honors the override via LegacyClientServiceName). This matches the documented Cozystack/Kamaji target layout. If broader support is ever wanted, thread LegacyClientServiceName(name, spec) into both the ServerName pin and the snapshot endpoint, with a test asserting the resolved host equals the ServiceTemplate.Name override.

LGTM.

@androndo Andrey Kolkov (androndo) merged commit 3cb469f into v1 Jun 8, 2026
2 checks passed
@androndo Andrey Kolkov (androndo) deleted the feat/migration-tool-to-v1 branch June 8, 2026 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change controllers documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants