From b5a8a629b9e732d5414f704ae8808e7fe35c5b45 Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Thu, 16 Apr 2026 18:12:43 +0800 Subject: [PATCH 1/4] docs(kep): draft fail-fast restart budget and init-phase dns for lws Signed-off-by: Peter Pan --- .../NNN-distributed-preflight-check/README.md | 233 ++++++++++++++++++ 1 file changed, 233 insertions(+) create mode 100644 keps/NNN-distributed-preflight-check/README.md diff --git a/keps/NNN-distributed-preflight-check/README.md b/keps/NNN-distributed-preflight-check/README.md new file mode 100644 index 000000000..2b6334d12 --- /dev/null +++ b/keps/NNN-distributed-preflight-check/README.md @@ -0,0 +1,233 @@ + +# KEP-NNN: Fail-Fast Restart Budget and Init-Phase DNS for LeaderWorkerSet + + +- [Summary](#summary) +- [Goals](#goals) +- [Non-Goals](#non-goals) +- [Proposal](#proposal) +- [Why No Failed Before](#why-no-failed-before) +- [Design Details](#design-details) + - [API](#api) + - [Controller Behavior](#controller-behavior) + - [Status Semantics](#status-semantics) + - [Operational Notes](#operational-notes) +- [Risks and Mitigations](#risks-and-mitigations) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Test Plan](#test-plan) +- [Implementation History](#implementation-history) + + +## Summary + +This KEP adds two opt-in capabilities to LeaderWorkerSet (LWS): + +1. `leaderWorkerTemplate.maxGroupRestarts` plus a terminal `Failed` condition for bounded retries. +2. `networkConfig.publishNotReadyAddresses` to make peer FQDN resolvable in init phase (secondary to fail-fast). + +This proposal addresses two generic LWS gaps: +- group recreation can loop forever without a terminal failure boundary. +- peer DNS FQDN may be unavailable during init-container phase unless explicitly published. +It applies to any workload with init-phase peer communication and to any repeated +group recreation caused by failures in init-containers or main containers. + + +### Story + +When users run distributed pre-checks (for example NCCL tests) in LWS init-containers, +a failure can push the group into an infinite recreate loop in the `RecreateGroupOnPodRestart` path. +This KEP adds a fail-fast boundary after N group recreation attempts, and also ensures +leader FQDN can be resolved during init phase when explicitly enabled. + + +## Goals + +1. Provide a native fail-fast mechanism after N group recreation attempts, + regardless of whether the trigger is init-container failure or main-container failure. +2. Enable init-containers to resolve `LWS_LEADER_ADDRESS` during init phase. +3. Keep both features backward compatible and opt-in (`false`/`nil` by default). +4. Reuse existing env vars (`LWS_LEADER_ADDRESS`, `LWS_GROUP_SIZE`, `LWS_WORKER_INDEX`). + +## Non-Goals + +1. Add new env vars for this feature. +2. Ship built-in preflight images/scripts. +3. Change container-level restart semantics. +4. Add automatic remediation after failure. + +## Proposal + +Decision matrix: + +| Problem | Solution | Origin | +|---|---|---| +| Infinite recreate loop, no terminal state (including main-container repeated failure) | `maxGroupRestarts` + `Failed` condition | LWS requirement | +| Init-container cannot resolve leader FQDN | `networkConfig.publishNotReadyAddresses: true` | Kubeflow Trainer PR #3417 discussion | +| Opt-in vs opt-out concern | Both fields are opt-in (`nil/false` default) | Maintainer feedback pattern in Trainer | +| "Why not startup/readiness probe?" | Keep as rejected alternative | Trainer discussion on probe semantics | +| Need more env vars? | No, reuse existing LWS envs | Existing LWS behavior | + +User-visible behavior change: + +1. Default behavior is unchanged. + - If `maxGroupRestarts` is unset, group recreation remains unbounded. + - If `publishNotReadyAddresses` is unset/false, init-phase DNS behavior stays as today. +2. If users set `leaderWorkerTemplate.maxGroupRestarts: N`, + LWS allows at most `N` group recreations in `RecreateGroupOnPodRestart` path. +3. After the limit is exceeded, LWS sets `Failed=True` and stops further group recreation. +4. `Failed` is a new LWS condition type added by this KEP. + - Current built-in conditions are `Available`, `Progressing`, `UpdateInProgress`. + - Controllers/clients that parse `status.conditions` should tolerate and handle the new type. +5. If users set `networkConfig.publishNotReadyAddresses: true`, + peer FQDN can be resolved during init phase. + +Example: + +```yaml +apiVersion: leaderworkerset.x-k8s.io/v1 +kind: LeaderWorkerSet +metadata: + name: serving +spec: + leaderWorkerTemplate: + restartPolicy: RecreateGroupOnPodRestart + maxGroupRestarts: 3 + networkConfig: + subdomainPolicy: UniquePerReplica + publishNotReadyAddresses: true +``` + +## Why No Failed Before + +Historically, LWS was designed with a "keep reconciling" model: + +1. Built-in conditions (`Available`, `Progressing`, `UpdateInProgress`) describe + availability/progress/rollout, not terminal failure. +2. For restart paths, previous behavior assumed continuous self-healing was preferred + over introducing a terminal state at LWS level. +3. There was no user-configured retry budget in API, so introducing `Failed` would + have been ambiguous ("failed after how many attempts?"). + +This KEP changes that precondition by adding an explicit budget +(`maxGroupRestarts`). With a clear threshold, `Failed` now has deterministic and +user-controlled semantics, and avoids unbounded recreate loops. + +## Design Details + +### API + +API changes in this KEP: + +1. **Spec field additions (new user knobs)**: + - `spec.leaderWorkerTemplate.maxGroupRestarts` (*int32, optional, minimum 0) + - `spec.networkConfig.publishNotReadyAddresses` (bool, default false) +2. **Status semantic extension**: + - no new status field is added; + - `status.conditions[]` introduces a new condition type value: `Failed`. +3. **Compatibility**: + - CRD schema shape for `status.conditions` stays the same (`[]metav1.Condition`); + - clients/controllers that switch on known condition types must handle unknown/new values safely. + +```go +type NetworkConfig struct { + // +kubebuilder:validation:Enum={Shared,UniquePerReplica} + SubdomainPolicy *SubdomainPolicy `json:"subdomainPolicy"` + // +optional + // +kubebuilder:default=false + PublishNotReadyAddresses bool `json:"publishNotReadyAddresses,omitempty"` +} + +type LeaderWorkerTemplate struct { + // +optional + // +kubebuilder:validation:Minimum=0 + MaxGroupRestarts *int32 `json:"maxGroupRestarts,omitempty"` +} + +const ( + LeaderWorkerSetFailed LeaderWorkerSetConditionType = "Failed" +) + +const ( + GroupRestartCountAnnotationKey = "leaderworkerset.sigs.k8s.io/group-restart-count" +) +``` + +### Controller Behavior + +1. Restart reconcile path: + for `RecreateGroupOnPodRestart`, controller checks group restart budget before leader deletion. +2. In `RecreateGroupOnPodRestart` path, before deleting leader pod: + - read `group-restart-count` annotation; + - if `count >= maxGroupRestarts`, stop recreation and mark replica failed; + - otherwise increment annotation and continue current recreation flow. +3. The counter is group-level, not init-only: any failure path that enters + `RecreateGroupOnPodRestart` contributes to the same retry budget. +4. Service reconcile sets: + `svc.Spec.PublishNotReadyAddresses = spec.networkConfig.publishNotReadyAddresses`. +5. Counter persistence is annotation-based so controller restarts do not reset state. + +### Status Semantics + +1. `maxGroupRestarts` unset: current behavior (unbounded recreation). +2. `maxGroupRestarts: 0`: first group-level failure becomes terminal. +3. `maxGroupRestarts` is effective only for `restartPolicy: RecreateGroupOnPodRestart`. +4. Any replica exceeding limit sets LWS condition `Failed=True` + with reason `MaxGroupRestartsExceeded`. +5. Failed replica is excluded from available-ready accounting. + +### Operational Notes + +1. Restart counting: + one "group restart" means one leader deletion in `RecreateGroupOnPodRestart` path. +2. Counter persistence: + `group-restart-count` annotation is used so controller restart does not reset budget. +3. After terminal failure: + no further group recreation is attempted for the failed replica; pod is retained for debugging. +4. DNS behavior: + with `publishNotReadyAddresses: false`, peer FQDN may not be resolvable in init phase. +5. `startupPolicy: LeaderReady` interaction: + worker groups are still created only after leader is ready; this proposal does not alter that gate. + +## Risks and Mitigations + +1. Strict `maxGroupRestarts` may fail transient issues. + Mitigation: opt-in, user-controlled threshold. +2. Annotation increment and delete are not fully atomic. + Mitigation: acceptable best-effort bound for fail-fast policy. +3. `publishNotReadyAddresses=true` exposes DNS records before readiness. + Mitigation: opt-in, LWS-owned scoped headless Service. + +## Drawbacks + +1. A strict restart budget can convert transient failures into terminal failure. +2. `publishNotReadyAddresses=true` publishes not-ready pod DNS records by design. +3. Keeping failed pods for debugging may hold cluster resources until user cleanup. + +## Alternatives + +1. Startup/readiness probes: rejected. + Probes run with main process and do not provide strict pre-main gating; they also + do not solve init-phase DNS publish timing without `publishNotReadyAddresses`. +2. Entrypoint wrapper script: rejected due to image/command coupling. +3. Sidecar-based checks: rejected due to lifecycle mismatch for one-shot gate. +4. User manually patches Service: rejected because controller reconciliation overwrites it. + +## Test Plan + +1. Unit: + - restart counter read/increment logic; + - threshold transition to `Failed`. + - service construction propagates `publishNotReadyAddresses`. +2. Integration: + - bounded recreate behavior with `maxGroupRestarts`; + - no regression when `maxGroupRestarts` is unset; + - DNS publish behavior with `publishNotReadyAddresses=true`. +3. e2e: + - init-container peer check succeeds with publish enabled; + - repeated init failure reaches `Failed` after configured limit. + +## Implementation History + +- 2026-04-16: Initial draft. +- 2026-04-16: Simplified structure and clarified universal scope beyond preflight-only use. From 26e4169ca0cf2cef364bc00f3418661495ca85ce Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Thu, 28 May 2026 17:35:57 +0800 Subject: [PATCH 2/4] address comments Signed-off-by: Peter Pan --- .../README.md | 21 +++++++++++-------- 1 file changed, 12 insertions(+), 9 deletions(-) rename keps/{NNN-distributed-preflight-check => 820-distributed-preflight-check}/README.md (90%) diff --git a/keps/NNN-distributed-preflight-check/README.md b/keps/820-distributed-preflight-check/README.md similarity index 90% rename from keps/NNN-distributed-preflight-check/README.md rename to keps/820-distributed-preflight-check/README.md index 2b6334d12..11a1d3182 100644 --- a/keps/NNN-distributed-preflight-check/README.md +++ b/keps/820-distributed-preflight-check/README.md @@ -1,5 +1,5 @@ -# KEP-NNN: Fail-Fast Restart Budget and Init-Phase DNS for LeaderWorkerSet +# KEP-820: Fail-Fast Restart Budget and Init-Phase DNS for LeaderWorkerSet - [Summary](#summary) @@ -122,10 +122,11 @@ API changes in this KEP: 1. **Spec field additions (new user knobs)**: - `spec.leaderWorkerTemplate.maxGroupRestarts` (*int32, optional, minimum 0) - `spec.networkConfig.publishNotReadyAddresses` (bool, default false) -2. **Status semantic extension**: +2. **Validation constraint**: `maxGroupRestarts` is only valid when `restartPolicy` is `RecreateGroupOnPodRestart`. The validating webhook rejects any LWS with `maxGroupRestarts` set but a different restart policy. +3. **Status semantic extension**: - no new status field is added; - `status.conditions[]` introduces a new condition type value: `Failed`. -3. **Compatibility**: +4. **Compatibility**: - CRD schema shape for `status.conditions` stays the same (`[]metav1.Condition`); - clients/controllers that switch on known condition types must handle unknown/new values safely. @@ -155,23 +156,25 @@ const ( ### Controller Behavior -1. Restart reconcile path: +1. **Webhook validation**: + before creating or updating an LWS, the validating webhook checks that `maxGroupRestarts` is only set when `restartPolicy: RecreateGroupOnPodRestart`. If `maxGroupRestarts` is set with a different restart policy, the webhook denies the request with a clear error message. +2. Restart reconcile path: for `RecreateGroupOnPodRestart`, controller checks group restart budget before leader deletion. -2. In `RecreateGroupOnPodRestart` path, before deleting leader pod: +3. In `RecreateGroupOnPodRestart` path, before deleting leader pod: - read `group-restart-count` annotation; - if `count >= maxGroupRestarts`, stop recreation and mark replica failed; - otherwise increment annotation and continue current recreation flow. -3. The counter is group-level, not init-only: any failure path that enters +4. The counter is group-level, not init-only: any failure path that enters `RecreateGroupOnPodRestart` contributes to the same retry budget. -4. Service reconcile sets: +5. Service reconcile sets: `svc.Spec.PublishNotReadyAddresses = spec.networkConfig.publishNotReadyAddresses`. -5. Counter persistence is annotation-based so controller restarts do not reset state. +6. Counter persistence is annotation-based so controller restarts do not reset state. ### Status Semantics 1. `maxGroupRestarts` unset: current behavior (unbounded recreation). 2. `maxGroupRestarts: 0`: first group-level failure becomes terminal. -3. `maxGroupRestarts` is effective only for `restartPolicy: RecreateGroupOnPodRestart`. +3. `maxGroupRestarts` requires `restartPolicy: RecreateGroupOnPodRestart` (enforced by webhook). 4. Any replica exceeding limit sets LWS condition `Failed=True` with reason `MaxGroupRestartsExceeded`. 5. Failed replica is excluded from available-ready accounting. From 4b844eeef598f3255b52490333d4dfd81bf24213 Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Tue, 2 Jun 2026 14:07:15 +0800 Subject: [PATCH 3/4] docs(kep): add metadata for distributed preflight check Signed-off-by: Peter Pan --- keps/820-distributed-preflight-check/kep.yaml | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) create mode 100644 keps/820-distributed-preflight-check/kep.yaml diff --git a/keps/820-distributed-preflight-check/kep.yaml b/keps/820-distributed-preflight-check/kep.yaml new file mode 100644 index 000000000..03b4d46bc --- /dev/null +++ b/keps/820-distributed-preflight-check/kep.yaml @@ -0,0 +1,18 @@ +title: Fail-Fast Restart Budget and Init-Phase DNS for LeaderWorkerSet +kep-number: 820 +authors: + - "@panpan0000" +status: provisional +creation-date: 2026-06-02 +reviewers: + - "@Edwinhr716" + - "@yankay" +approvers: + - "@Edwinhr716" + +stage: alpha + +latest-milestone: "v0.9.0" + +milestone: + alpha: "v0.9.0" \ No newline at end of file From 67a46ffcb1165fdafdef92aede07ecb46bc66332 Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Tue, 2 Jun 2026 14:52:49 +0800 Subject: [PATCH 4/4] docs(kep): update distributed preflight check toc Signed-off-by: Peter Pan --- keps/820-distributed-preflight-check/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/keps/820-distributed-preflight-check/README.md b/keps/820-distributed-preflight-check/README.md index 11a1d3182..6b2ba98ad 100644 --- a/keps/820-distributed-preflight-check/README.md +++ b/keps/820-distributed-preflight-check/README.md @@ -3,6 +3,7 @@ - [Summary](#summary) + - [Story](#story) - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal)