feat(api)!: configure peer and client TLS independently#376
Conversation
etcd has three TLS surfaces -- peer<->peer, client->server, and the operator's own client identity -- but the alpha CRD conflated all three behind a single `spec.tls` toggle and a single shared issuer, so a user could neither serve peer TLS without client TLS (or vice versa) nor relax mutual client-cert auth per surface. The single predicate also forced `--client-cert-auth`/`--peer-client-cert-auth` on whenever TLS was set. Reshape `spec.tls` into two optional, independent surfaces -- `peer` and `client` -- each carrying its own provider, issuer, and `clientCertAuth` policy (default true). A nil surface serves/dials that surface in cleartext; both nil is fully cleartext and byte-identical to the pre-TLS path. The operator's own client identity follows the `client` surface. This is a clean break on the alpha API (no conversion webhook, which would re-introduce the conflation). The dead `caBundleSecret` knob is removed (type + cert interface; it was read nowhere). `IssuerGroup` is added to the cert-manager provider config (empty => cert-manager.io) so non-default issuer groups can be targeted. Threading (single source of truth, per surface): - peerScheme/clientScheme drive peer vs client URLs and flags independently - defaultArgs emits the server flag group iff client set, the peer flag group iff peer set, and each --*client-cert-auth gated on its surface - cert mounts: server secret iff client set, peer secret iff peer set (append-not-assign so they coexist with the storage data-dir mount) - cert provisioning: server+operator-client certs from the client surface, peer cert from the peer surface - etcdutils MemberList/ClusterHealth/AddMember/PromoteLearner/RemoveMember take an optional *tls.Config (nil => cleartext); the operator builds it from the client surface only, with a missing ca.crt now an explicit error Anti-misconfiguration guardrails: - apply-time CEL XValidation on each surface: provider/providerCfg coherence, mTLS-requires-a-CA, issuerKind/provider enums - a reconcile-time validateTLS backstop re-checks the spec-only rules and requeues on an invalid spec (for apiservers that don't enforce CEL) Also: the cert-site log.Printf calls move to the reconcile-scoped logger, and the "running without TLS" error-level log is demoted to Info (cleartext is a supported mode). The sample is rewritten to the two-tier shared-CA issuer pattern a multi-member peer-mTLS cluster actually needs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
GetCertificateConfig now echoes IssuerRef.Group into ExtraConfig under the issuerGroup key. The TestCertManagerProvider "Get certificate config" assertion built its expected config without that key, so reflect.DeepEqual failed. Add the empty-string issuerGroup to the expected config (no group set => cert-manager leaves Group == ""). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
A peer-TLS cluster could not form past the seed member. With
--peer-client-cert-auth, etcd verifies a joining peer's source IP against
its cert SANs via isHostInDNS (client/pkg transport), which forward-resolves
the cluster's headless-service DNS name. A joining member is not Ready until
it has actually joined the cluster, so without publishNotReadyAddresses its
pod IP is absent from the service's A record, the peer cert SAN check fails
("tls: <ip> does not match any of DNSNames"), the peer handshake is rejected
with EOF, and the joining member crashloops -- deadlocking the cluster at one
member.
Set PublishNotReadyAddresses=true on the headless service so not-yet-Ready
members are resolvable and peer mTLS can complete. Cleartext peer traffic does
not hit this verification path, so non-TLS clusters are unaffected.
Discovered by the T6 multi-member TLS e2e (first live exercise of multi-member
peer mTLS); verified on kind: 3-member both-surface TLS cluster reaches 3
voting members with this change and crashloops at 1 without it.
Belongs to: pr/tls-independence
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…aviour
Fill the two genuine documentation gaps the reshape left:
- spec.tls is effectively create-time; toggling it on a running cluster
drops quorum. Document the create-new-not-flip caveat on the
EtcdClusterSpec.TLS field where a spec author hits it (regenerated CRD).
- buildClientTLSConfig sets no ServerName, so verification relies on the
dialed pod FQDN matching the server cert SANs; custom AltNames must keep
covering *.{name}.{ns}.svc.cluster.local.
Comments only, no behaviour change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xrl The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @xrl. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…egration Brings the etcd-io#376 TLS reshape (clientEndpointForOrdinalIndex, etcdutils.ClusterHealth, per-surface TLS args) plus T5 and T6 e2e coverage. Cert-block conflict (T0 vs T5) resolved to log + Recorder.Eventf(reasonClientCertificateError) + requeue. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
…to downstream/integration Conflict in internal/controller/utils.go resolved (rerere) to keep both the reconcile-pool/QoS changes and the etcd-io#376 TLS reshape. Re-aligned the reconcile-pool test TestFetchAndValidateStateClientCertificateError to the reshaped EtcdClusterTLS{Client: &TLSSurface{...CertManagerCfg...}} type and added a FakeRecorder so the client-cert failure path's Eventf does not panic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
…stream/integration Conflict in internal/controller/etcdcluster_controller.go resolved (rerere) to keep both quorum-loss recovery and the etcd-io#376 TLS reshape. Re-aligned quorum_recovery.go to the reshaped signatures (this branch was cut off main, blind to etcd-io#376): clientEndpointForOrdinalIndex now takes a per-surface scheme (clientScheme(ec)) and clusterHealth now takes (ctx, ec, eps). Behavior preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
…ream/integration Brings the mutating+validating admission webhooks for EtcdCluster and the config/webhook + config/certmanager wiring. Conflict in cmd/main.go resolved (rerere) to keep all three controller registrations plus the webhook setup. Re-aligned the webhook (and its unit/envtest/e2e tests) from the pre-etcd-io#376 single-surface TLSCertificate type to the reshaped two-surface EtcdClusterTLS{Peer,Client}: validateTLS now fans out to a per-surface validateTLSSurface and the defaulter defaults each configured surface, so field-path messages gain a surface segment (spec.tls.client.*). Behavior preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
What
etcd has three TLS surfaces — peer↔peer, client→server, and the operator's own client identity — but the alpha CRD conflated all three behind a single
spec.tlstoggle and a single shared issuer. A user could neither serve peer TLS without client TLS (or vice versa) nor relax mutual client-cert auth per surface, and the single predicate forced--client-cert-auth/--peer-client-cert-authon whenever TLS was set.This reshapes
spec.tlsinto two optional, independent surfaces —spec.tls.peerandspec.tls.client— each carrying its own provider, issuer, andclientCertAuthpolicy (defaulttrue). A nil surface serves/dials that surface in cleartext; both nil is fully cleartext and byte-identical to the pre-TLS path. The operator's own client identity follows theclientsurface.Breaking alpha API change — and why
This is a clean break on the alpha API (
feat(api)!). There is no conversion webhook on purpose: a webhook mapping the old singlespec.tlsto the new surfaces would have to pick one surface or fan out to both, which re-introduces exactly the conflation this change removes. The deadcaBundleSecretknob (type + cert interface) is dropped — it was read nowhere.IssuerGroupis added to the cert-manager provider config (empty ⇒cert-manager.io) so non-default issuer groups can be targeted.Migration:
spec.tlsis effectively create-time — toggling it on a running cluster drops quorum, so a cleartext cluster is migrated by standing up a fresh TLS cluster, not flipped in place. Documented on the regenerated CRD field.Threading (single source of truth, per surface)
peerScheme/clientSchemedrive peer vs client URLs and flags independently.defaultArgsemits the server flag group iffclientset, the peer flag group iffpeerset, each--*client-cert-authgated on its surface.clientset, peer secret iffpeerset (append-not-assign, so they coexist with the storage data-dir mount).clientsurface, peer cert from thepeersurface.etcdutils(MemberList/ClusterHealth/AddMember/PromoteLearner/RemoveMember) take an optional*tls.Config(nil ⇒ cleartext); the operator builds it from theclientsurface only, a missingca.crtnow an explicit error.peer-TLS cluster setsPublishNotReadyAddresses=trueon the headless service so a not-yet-Ready joining member is DNS-resolvable and peer mTLS SAN verification (isHostInDNS) can complete — without it a peer-mTLS cluster deadlocks at one member.Guardrails
XValidationper surface: provider/providerCfg coherence, mTLS-requires-a-CA, issuerKind/provider enums.validateTLSbackstop re-checks the spec-only rules and requeues on an invalid spec (for apiservers that don't enforce CEL).Tests
tls_independence_test.goandtls_cel_test.gocover per-surface scheme/flag/mount/provisioning matrices and the CEL + reconcile validation rules;etcdutils_test.gocovers the optional*tls.Configthreading.go build ./...andgo vet ./...clean.Relationship to the prior TLS plan
This supersedes the earlier conflated, step-wise plan (T2 mounts / T3 flags+scheme / T4 client
*tls.Config): the reshape does all three, per surface, in one cohesive breaking change rather than layering them on the single-toggle API. It implements the intent of #371 / #372 / #373.Refs #371, #372, #373
Related work — etcd TLS & operability
Independent peer/client TLS reshape and surrounding operability work, in dependency / stacking order (
→marks this PR):spec.tls.{peer,client}surfaces; breaking alpha API change (no conversion webhook, by design)TLSReadycondition + TLS lifecycle EventsPeerCANotSharedThe TLS reshape (#376) supersedes the earlier conflated T2/T3/T4 plan (per-surface mounts, flags+scheme, and client
*tls.Confignow all live in #376). T5←#376 and T6←#377 are stacked: review/merge in order. T0, the reconcile/QoS knobs, and the stress harness are independent.