Skip to content

feat(api)!: configure peer and client TLS independently#376

Open
xrl wants to merge 4 commits into
etcd-io:mainfrom
xrl:pr/tls-independence
Open

feat(api)!: configure peer and client TLS independently#376
xrl wants to merge 4 commits into
etcd-io:mainfrom
xrl:pr/tls-independence

Conversation

@xrl

@xrl xrl commented Jun 18, 2026

Copy link
Copy Markdown

What

etcd has three TLS surfaces — peer↔peer, client→server, and the operator's own client identity — but the alpha CRD conflated all three behind a single spec.tls toggle and a single shared issuer. A user could neither serve peer TLS without client TLS (or vice versa) nor relax mutual client-cert auth per surface, and the single predicate forced --client-cert-auth/--peer-client-cert-auth on whenever TLS was set.

This reshapes spec.tls into two optional, independent surfacesspec.tls.peer and spec.tls.client — each carrying its own provider, issuer, and clientCertAuth policy (default true). A nil surface serves/dials that surface in cleartext; both nil is fully cleartext and byte-identical to the pre-TLS path. The operator's own client identity follows the client surface.

Breaking alpha API change — and why

This is a clean break on the alpha API (feat(api)!). There is no conversion webhook on purpose: a webhook mapping the old single spec.tls to the new surfaces would have to pick one surface or fan out to both, which re-introduces exactly the conflation this change removes. The dead caBundleSecret knob (type + cert interface) is dropped — it was read nowhere. IssuerGroup is added to the cert-manager provider config (empty ⇒ cert-manager.io) so non-default issuer groups can be targeted.

Migration: spec.tls is effectively create-time — toggling it on a running cluster drops quorum, so a cleartext cluster is migrated by standing up a fresh TLS cluster, not flipped in place. Documented on the regenerated CRD field.

Threading (single source of truth, per surface)

  • peerScheme/clientScheme drive peer vs client URLs and flags independently.
  • defaultArgs emits the server flag group iff client set, the peer flag group iff peer set, each --*client-cert-auth gated on its surface.
  • Cert mounts: server secret iff client set, peer secret iff peer set (append-not-assign, so they coexist with the storage data-dir mount).
  • Cert provisioning: server + operator-client certs from the client surface, peer cert from the peer surface.
  • etcdutils (MemberList/ClusterHealth/AddMember/PromoteLearner/RemoveMember) take an optional *tls.Config (nil ⇒ cleartext); the operator builds it from the client surface only, a missing ca.crt now an explicit error.
  • A peer-TLS cluster sets PublishNotReadyAddresses=true on the headless service so a not-yet-Ready joining member is DNS-resolvable and peer mTLS SAN verification (isHostInDNS) can complete — without it a peer-mTLS cluster deadlocks at one member.

Guardrails

  • Apply-time CEL XValidation per surface: provider/providerCfg coherence, mTLS-requires-a-CA, issuerKind/provider enums.
  • A reconcile-time validateTLS backstop re-checks the spec-only rules and requeues on an invalid spec (for apiservers that don't enforce CEL).

Tests

tls_independence_test.go and tls_cel_test.go cover per-surface scheme/flag/mount/provisioning matrices and the CEL + reconcile validation rules; etcdutils_test.go covers the optional *tls.Config threading. go build ./... and go vet ./... clean.

Relationship to the prior TLS plan

This supersedes the earlier conflated, step-wise plan (T2 mounts / T3 flags+scheme / T4 client *tls.Config): the reshape does all three, per surface, in one cohesive breaking change rather than layering them on the single-toggle API. It implements the intent of #371 / #372 / #373.

Refs #371, #372, #373


Related work — etcd TLS & operability

Independent peer/client TLS reshape and surrounding operability work, in dependency / stacking order ( marks this PR):

Change Issue PR Depends on
TLS independence — independent spec.tls.{peer,client} surfaces; breaking alpha API change (no conversion webhook, by design) #371 #372 #373
TLSReady condition + TLS lifecycle Events #376
multi-member TLS quorum e2e + PeerCANotShared #377
stop swallowing the client-certificate error (requeue) #370
configurable reconcile worker pool + Burstable etcd QoS
kind stress/scale e2e harness (1/3/7, churn, quorum watcher)

The TLS reshape (#376) supersedes the earlier conflated T2/T3/T4 plan (per-surface mounts, flags+scheme, and client *tls.Config now all live in #376). T5←#376 and T6←#377 are stacked: review/merge in order. T0, the reconcile/QoS knobs, and the stress harness are independent.

xrl and others added 4 commits June 17, 2026 22:20
etcd has three TLS surfaces -- peer<->peer, client->server, and the
operator's own client identity -- but the alpha CRD conflated all three
behind a single `spec.tls` toggle and a single shared issuer, so a user
could neither serve peer TLS without client TLS (or vice versa) nor relax
mutual client-cert auth per surface. The single predicate also forced
`--client-cert-auth`/`--peer-client-cert-auth` on whenever TLS was set.

Reshape `spec.tls` into two optional, independent surfaces -- `peer` and
`client` -- each carrying its own provider, issuer, and `clientCertAuth`
policy (default true). A nil surface serves/dials that surface in
cleartext; both nil is fully cleartext and byte-identical to the pre-TLS
path. The operator's own client identity follows the `client` surface.

This is a clean break on the alpha API (no conversion webhook, which would
re-introduce the conflation). The dead `caBundleSecret` knob is removed
(type + cert interface; it was read nowhere). `IssuerGroup` is added to the
cert-manager provider config (empty => cert-manager.io) so non-default
issuer groups can be targeted.

Threading (single source of truth, per surface):
- peerScheme/clientScheme drive peer vs client URLs and flags independently
- defaultArgs emits the server flag group iff client set, the peer flag
  group iff peer set, and each --*client-cert-auth gated on its surface
- cert mounts: server secret iff client set, peer secret iff peer set
  (append-not-assign so they coexist with the storage data-dir mount)
- cert provisioning: server+operator-client certs from the client surface,
  peer cert from the peer surface
- etcdutils MemberList/ClusterHealth/AddMember/PromoteLearner/RemoveMember
  take an optional *tls.Config (nil => cleartext); the operator builds it
  from the client surface only, with a missing ca.crt now an explicit error

Anti-misconfiguration guardrails:
- apply-time CEL XValidation on each surface: provider/providerCfg
  coherence, mTLS-requires-a-CA, issuerKind/provider enums
- a reconcile-time validateTLS backstop re-checks the spec-only rules and
  requeues on an invalid spec (for apiservers that don't enforce CEL)

Also: the cert-site log.Printf calls move to the reconcile-scoped logger,
and the "running without TLS" error-level log is demoted to Info (cleartext
is a supported mode). The sample is rewritten to the two-tier shared-CA
issuer pattern a multi-member peer-mTLS cluster actually needs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
GetCertificateConfig now echoes IssuerRef.Group into ExtraConfig under
the issuerGroup key. The TestCertManagerProvider "Get certificate config"
assertion built its expected config without that key, so reflect.DeepEqual
failed. Add the empty-string issuerGroup to the expected config (no group
set => cert-manager leaves Group == "").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
A peer-TLS cluster could not form past the seed member. With
--peer-client-cert-auth, etcd verifies a joining peer's source IP against
its cert SANs via isHostInDNS (client/pkg transport), which forward-resolves
the cluster's headless-service DNS name. A joining member is not Ready until
it has actually joined the cluster, so without publishNotReadyAddresses its
pod IP is absent from the service's A record, the peer cert SAN check fails
("tls: <ip> does not match any of DNSNames"), the peer handshake is rejected
with EOF, and the joining member crashloops -- deadlocking the cluster at one
member.

Set PublishNotReadyAddresses=true on the headless service so not-yet-Ready
members are resolvable and peer mTLS can complete. Cleartext peer traffic does
not hit this verification path, so non-TLS clusters are unaffected.

Discovered by the T6 multi-member TLS e2e (first live exercise of multi-member
peer mTLS); verified on kind: 3-member both-surface TLS cluster reaches 3
voting members with this change and crashloops at 1 without it.

Belongs to: pr/tls-independence

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…aviour

Fill the two genuine documentation gaps the reshape left:

- spec.tls is effectively create-time; toggling it on a running cluster
  drops quorum. Document the create-new-not-flip caveat on the
  EtcdClusterSpec.TLS field where a spec author hits it (regenerated CRD).
- buildClientTLSConfig sets no ServerName, so verification relies on the
  dialed pod FQDN matching the server cert SANs; custom AltNames must keep
  covering *.{name}.{ns}.svc.cluster.local.

Comments only, no behaviour change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign jmhbnz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

xrl added a commit to xrl/etcd-operator that referenced this pull request Jun 19, 2026
…egration

Brings the etcd-io#376 TLS reshape (clientEndpointForOrdinalIndex, etcdutils.ClusterHealth,
per-surface TLS args) plus T5 and T6 e2e coverage. Cert-block conflict (T0 vs T5)
resolved to log + Recorder.Eventf(reasonClientCertificateError) + requeue.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
xrl added a commit to xrl/etcd-operator that referenced this pull request Jun 19, 2026
…to downstream/integration

Conflict in internal/controller/utils.go resolved (rerere) to keep both the
reconcile-pool/QoS changes and the etcd-io#376 TLS reshape. Re-aligned the
reconcile-pool test TestFetchAndValidateStateClientCertificateError to the
reshaped EtcdClusterTLS{Client: &TLSSurface{...CertManagerCfg...}} type and
added a FakeRecorder so the client-cert failure path's Eventf does not panic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
xrl added a commit to xrl/etcd-operator that referenced this pull request Jun 19, 2026
…stream/integration

Conflict in internal/controller/etcdcluster_controller.go resolved (rerere) to
keep both quorum-loss recovery and the etcd-io#376 TLS reshape. Re-aligned
quorum_recovery.go to the reshaped signatures (this branch was cut off main,
blind to etcd-io#376): clientEndpointForOrdinalIndex now takes a per-surface scheme
(clientScheme(ec)) and clusterHealth now takes (ctx, ec, eps). Behavior
preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
xrl added a commit to xrl/etcd-operator that referenced this pull request Jun 19, 2026
…ream/integration

Brings the mutating+validating admission webhooks for EtcdCluster and the
config/webhook + config/certmanager wiring. Conflict in cmd/main.go resolved
(rerere) to keep all three controller registrations plus the webhook setup.
Re-aligned the webhook (and its unit/envtest/e2e tests) from the pre-etcd-io#376
single-surface TLSCertificate type to the reshaped two-surface
EtcdClusterTLS{Peer,Client}: validateTLS now fans out to a per-surface
validateTLSSurface and the defaulter defaults each configured surface, so
field-path messages gain a surface segment (spec.tls.client.*). Behavior
preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants