-
Notifications
You must be signed in to change notification settings - Fork 150
docs(kep): draft fail-fast restart budget and init-phase DNS for LWS #813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
panpan0000
wants to merge
4
commits into
kubernetes-sigs:main
Choose a base branch
from
panpan0000:kep/failed-limit-init-dns
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+255
−0
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
0590df4
docs(kep): draft fail-fast restart budget and init-phase dns for lws
panpan0000 f88b79a
address comments
panpan0000 0d142bd
docs(kep): add metadata for distributed preflight check
panpan0000 758d435
docs(kep): update distributed preflight check toc
panpan0000 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,237 @@ | ||
|
|
||
| # KEP-820: Fail-Fast Restart Budget and Init-Phase DNS for LeaderWorkerSet | ||
|
|
||
| <!-- toc --> | ||
| - [Summary](#summary) | ||
| - [Story](#story) | ||
| - [Goals](#goals) | ||
| - [Non-Goals](#non-goals) | ||
| - [Proposal](#proposal) | ||
| - [Why No Failed Before](#why-no-failed-before) | ||
| - [Design Details](#design-details) | ||
| - [API](#api) | ||
| - [Controller Behavior](#controller-behavior) | ||
| - [Status Semantics](#status-semantics) | ||
| - [Operational Notes](#operational-notes) | ||
| - [Risks and Mitigations](#risks-and-mitigations) | ||
| - [Drawbacks](#drawbacks) | ||
| - [Alternatives](#alternatives) | ||
| - [Test Plan](#test-plan) | ||
| - [Implementation History](#implementation-history) | ||
| <!-- /toc --> | ||
|
|
||
| ## Summary | ||
|
|
||
| This KEP adds two opt-in capabilities to LeaderWorkerSet (LWS): | ||
|
|
||
| 1. `leaderWorkerTemplate.maxGroupRestarts` plus a terminal `Failed` condition for bounded retries. | ||
| 2. `networkConfig.publishNotReadyAddresses` to make peer FQDN resolvable in init phase (secondary to fail-fast). | ||
|
|
||
| This proposal addresses two generic LWS gaps: | ||
| - group recreation can loop forever without a terminal failure boundary. | ||
| - peer DNS FQDN may be unavailable during init-container phase unless explicitly published. | ||
| It applies to any workload with init-phase peer communication and to any repeated | ||
| group recreation caused by failures in init-containers or main containers. | ||
|
|
||
|
|
||
| ### Story | ||
|
|
||
| When users run distributed pre-checks (for example NCCL tests) in LWS init-containers, | ||
| a failure can push the group into an infinite recreate loop in the `RecreateGroupOnPodRestart` path. | ||
| This KEP adds a fail-fast boundary after N group recreation attempts, and also ensures | ||
| leader FQDN can be resolved during init phase when explicitly enabled. | ||
|
|
||
|
|
||
| ## Goals | ||
|
|
||
| 1. Provide a native fail-fast mechanism after N group recreation attempts, | ||
| regardless of whether the trigger is init-container failure or main-container failure. | ||
| 2. Enable init-containers to resolve `LWS_LEADER_ADDRESS` during init phase. | ||
| 3. Keep both features backward compatible and opt-in (`false`/`nil` by default). | ||
| 4. Reuse existing env vars (`LWS_LEADER_ADDRESS`, `LWS_GROUP_SIZE`, `LWS_WORKER_INDEX`). | ||
|
|
||
| ## Non-Goals | ||
|
|
||
| 1. Add new env vars for this feature. | ||
| 2. Ship built-in preflight images/scripts. | ||
| 3. Change container-level restart semantics. | ||
| 4. Add automatic remediation after failure. | ||
|
|
||
| ## Proposal | ||
|
|
||
| Decision matrix: | ||
|
|
||
| | Problem | Solution | Origin | | ||
| |---|---|---| | ||
| | Infinite recreate loop, no terminal state (including main-container repeated failure) | `maxGroupRestarts` + `Failed` condition | LWS requirement | | ||
| | Init-container cannot resolve leader FQDN | `networkConfig.publishNotReadyAddresses: true` | Kubeflow Trainer PR #3417 discussion | | ||
| | Opt-in vs opt-out concern | Both fields are opt-in (`nil/false` default) | Maintainer feedback pattern in Trainer | | ||
| | "Why not startup/readiness probe?" | Keep as rejected alternative | Trainer discussion on probe semantics | | ||
| | Need more env vars? | No, reuse existing LWS envs | Existing LWS behavior | | ||
|
|
||
| User-visible behavior change: | ||
|
|
||
| 1. Default behavior is unchanged. | ||
| - If `maxGroupRestarts` is unset, group recreation remains unbounded. | ||
| - If `publishNotReadyAddresses` is unset/false, init-phase DNS behavior stays as today. | ||
| 2. If users set `leaderWorkerTemplate.maxGroupRestarts: N`, | ||
| LWS allows at most `N` group recreations in `RecreateGroupOnPodRestart` path. | ||
| 3. After the limit is exceeded, LWS sets `Failed=True` and stops further group recreation. | ||
| 4. `Failed` is a new LWS condition type added by this KEP. | ||
| - Current built-in conditions are `Available`, `Progressing`, `UpdateInProgress`. | ||
| - Controllers/clients that parse `status.conditions` should tolerate and handle the new type. | ||
| 5. If users set `networkConfig.publishNotReadyAddresses: true`, | ||
| peer FQDN can be resolved during init phase. | ||
|
|
||
| Example: | ||
|
|
||
| ```yaml | ||
| apiVersion: leaderworkerset.x-k8s.io/v1 | ||
| kind: LeaderWorkerSet | ||
| metadata: | ||
| name: serving | ||
| spec: | ||
| leaderWorkerTemplate: | ||
| restartPolicy: RecreateGroupOnPodRestart | ||
| maxGroupRestarts: 3 | ||
| networkConfig: | ||
| subdomainPolicy: UniquePerReplica | ||
| publishNotReadyAddresses: true | ||
| ``` | ||
|
|
||
| ## Why No Failed Before | ||
|
|
||
| Historically, LWS was designed with a "keep reconciling" model: | ||
|
|
||
| 1. Built-in conditions (`Available`, `Progressing`, `UpdateInProgress`) describe | ||
| availability/progress/rollout, not terminal failure. | ||
| 2. For restart paths, previous behavior assumed continuous self-healing was preferred | ||
| over introducing a terminal state at LWS level. | ||
| 3. There was no user-configured retry budget in API, so introducing `Failed` would | ||
| have been ambiguous ("failed after how many attempts?"). | ||
|
|
||
| This KEP changes that precondition by adding an explicit budget | ||
| (`maxGroupRestarts`). With a clear threshold, `Failed` now has deterministic and | ||
| user-controlled semantics, and avoids unbounded recreate loops. | ||
|
|
||
| ## Design Details | ||
|
|
||
| ### API | ||
|
|
||
| API changes in this KEP: | ||
|
|
||
| 1. **Spec field additions (new user knobs)**: | ||
| - `spec.leaderWorkerTemplate.maxGroupRestarts` (*int32, optional, minimum 0) | ||
| - `spec.networkConfig.publishNotReadyAddresses` (bool, default false) | ||
| 2. **Validation constraint**: `maxGroupRestarts` is only valid when `restartPolicy` is `RecreateGroupOnPodRestart`. The validating webhook rejects any LWS with `maxGroupRestarts` set but a different restart policy. | ||
| 3. **Status semantic extension**: | ||
| - no new status field is added; | ||
| - `status.conditions[]` introduces a new condition type value: `Failed`. | ||
| 4. **Compatibility**: | ||
| - CRD schema shape for `status.conditions` stays the same (`[]metav1.Condition`); | ||
| - clients/controllers that switch on known condition types must handle unknown/new values safely. | ||
|
|
||
| ```go | ||
| type NetworkConfig struct { | ||
| // +kubebuilder:validation:Enum={Shared,UniquePerReplica} | ||
| SubdomainPolicy *SubdomainPolicy `json:"subdomainPolicy"` | ||
| // +optional | ||
| // +kubebuilder:default=false | ||
| PublishNotReadyAddresses bool `json:"publishNotReadyAddresses,omitempty"` | ||
| } | ||
|
|
||
| type LeaderWorkerTemplate struct { | ||
| // +optional | ||
| // +kubebuilder:validation:Minimum=0 | ||
| MaxGroupRestarts *int32 `json:"maxGroupRestarts,omitempty"` | ||
| } | ||
|
|
||
| const ( | ||
| LeaderWorkerSetFailed LeaderWorkerSetConditionType = "Failed" | ||
| ) | ||
|
|
||
| const ( | ||
| GroupRestartCountAnnotationKey = "leaderworkerset.sigs.k8s.io/group-restart-count" | ||
| ) | ||
| ``` | ||
|
|
||
| ### Controller Behavior | ||
|
|
||
| 1. **Webhook validation**: | ||
| before creating or updating an LWS, the validating webhook checks that `maxGroupRestarts` is only set when `restartPolicy: RecreateGroupOnPodRestart`. If `maxGroupRestarts` is set with a different restart policy, the webhook denies the request with a clear error message. | ||
| 2. Restart reconcile path: | ||
| for `RecreateGroupOnPodRestart`, controller checks group restart budget before leader deletion. | ||
| 3. In `RecreateGroupOnPodRestart` path, before deleting leader pod: | ||
| - read `group-restart-count` annotation; | ||
| - if `count >= maxGroupRestarts`, stop recreation and mark replica failed; | ||
| - otherwise increment annotation and continue current recreation flow. | ||
| 4. The counter is group-level, not init-only: any failure path that enters | ||
| `RecreateGroupOnPodRestart` contributes to the same retry budget. | ||
| 5. Service reconcile sets: | ||
| `svc.Spec.PublishNotReadyAddresses = spec.networkConfig.publishNotReadyAddresses`. | ||
| 6. Counter persistence is annotation-based so controller restarts do not reset state. | ||
|
|
||
| ### Status Semantics | ||
|
|
||
| 1. `maxGroupRestarts` unset: current behavior (unbounded recreation). | ||
| 2. `maxGroupRestarts: 0`: first group-level failure becomes terminal. | ||
| 3. `maxGroupRestarts` requires `restartPolicy: RecreateGroupOnPodRestart` (enforced by webhook). | ||
| 4. Any replica exceeding limit sets LWS condition `Failed=True` | ||
| with reason `MaxGroupRestartsExceeded`. | ||
| 5. Failed replica is excluded from available-ready accounting. | ||
|
|
||
| ### Operational Notes | ||
|
|
||
| 1. Restart counting: | ||
| one "group restart" means one leader deletion in `RecreateGroupOnPodRestart` path. | ||
| 2. Counter persistence: | ||
| `group-restart-count` annotation is used so controller restart does not reset budget. | ||
| 3. After terminal failure: | ||
| no further group recreation is attempted for the failed replica; pod is retained for debugging. | ||
| 4. DNS behavior: | ||
| with `publishNotReadyAddresses: false`, peer FQDN may not be resolvable in init phase. | ||
| 5. `startupPolicy: LeaderReady` interaction: | ||
| worker groups are still created only after leader is ready; this proposal does not alter that gate. | ||
|
|
||
| ## Risks and Mitigations | ||
|
|
||
| 1. Strict `maxGroupRestarts` may fail transient issues. | ||
| Mitigation: opt-in, user-controlled threshold. | ||
| 2. Annotation increment and delete are not fully atomic. | ||
| Mitigation: acceptable best-effort bound for fail-fast policy. | ||
| 3. `publishNotReadyAddresses=true` exposes DNS records before readiness. | ||
| Mitigation: opt-in, LWS-owned scoped headless Service. | ||
|
|
||
| ## Drawbacks | ||
|
|
||
| 1. A strict restart budget can convert transient failures into terminal failure. | ||
| 2. `publishNotReadyAddresses=true` publishes not-ready pod DNS records by design. | ||
| 3. Keeping failed pods for debugging may hold cluster resources until user cleanup. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| 1. Startup/readiness probes: rejected. | ||
| Probes run with main process and do not provide strict pre-main gating; they also | ||
| do not solve init-phase DNS publish timing without `publishNotReadyAddresses`. | ||
| 2. Entrypoint wrapper script: rejected due to image/command coupling. | ||
| 3. Sidecar-based checks: rejected due to lifecycle mismatch for one-shot gate. | ||
| 4. User manually patches Service: rejected because controller reconciliation overwrites it. | ||
|
|
||
| ## Test Plan | ||
|
|
||
| 1. Unit: | ||
| - restart counter read/increment logic; | ||
| - threshold transition to `Failed`. | ||
| - service construction propagates `publishNotReadyAddresses`. | ||
| 2. Integration: | ||
| - bounded recreate behavior with `maxGroupRestarts`; | ||
| - no regression when `maxGroupRestarts` is unset; | ||
| - DNS publish behavior with `publishNotReadyAddresses=true`. | ||
| 3. e2e: | ||
| - init-container peer check succeeds with publish enabled; | ||
| - repeated init failure reaches `Failed` after configured limit. | ||
|
|
||
| ## Implementation History | ||
|
|
||
| - 2026-04-16: Initial draft. | ||
| - 2026-04-16: Simplified structure and clarified universal scope beyond preflight-only use. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| title: Fail-Fast Restart Budget and Init-Phase DNS for LeaderWorkerSet | ||
| kep-number: 820 | ||
| authors: | ||
| - "@panpan0000" | ||
| status: provisional | ||
| creation-date: 2026-06-02 | ||
| reviewers: | ||
| - "@Edwinhr716" | ||
| - "@yankay" | ||
| approvers: | ||
| - "@Edwinhr716" | ||
|
|
||
| stage: alpha | ||
|
|
||
| latest-milestone: "v0.9.0" | ||
|
|
||
| milestone: | ||
| alpha: "v0.9.0" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.