Skip to content

docs(kep): draft fail-fast restart budget and init-phase DNS for LWS#813

Open
panpan0000 wants to merge 4 commits into
kubernetes-sigs:mainfrom
panpan0000:kep/failed-limit-init-dns
Open

docs(kep): draft fail-fast restart budget and init-phase DNS for LWS#813
panpan0000 wants to merge 4 commits into
kubernetes-sigs:mainfrom
panpan0000:kep/failed-limit-init-dns

Conversation

@panpan0000
Copy link
Copy Markdown
Member

@panpan0000 panpan0000 commented Apr 16, 2026

Summary

  • Add KEP draft for LWS fail-fast restart budget with terminal condition.
  • Add optional support for init-phase peer DNS.
  • Prioritize fail-fast as primary motivation; DNS support is secondary.

Key Points

  • Add (opt-in, nil default).
  • Add new LWS condition type .
  • Keep current behavior unchanged when unset.
  • Reuse existing LWS env vars; no new env introduced.

Notes

  • This commit is docs-only and scoped to .

relevant to kubeflow/trainer#3417

Co-Author: GPT-5.4

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 16, 2026

Deploy Preview for kubernetes-sigs-lws canceled.

Name Link
🔨 Latest commit 758d435
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-lws/deploys/6a1e80401f8b530008f3bb3c

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: panpan0000
Once this PR has been reviewed and has the lgtm label, please assign kerthcet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 16, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2026
@panpan0000 panpan0000 force-pushed the kep/failed-limit-init-dns branch from c21c1db to de083b3 Compare April 16, 2026 10:17
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 16, 2026
@panpan0000
Copy link
Copy Markdown
Member Author

the 2nd commit( updating .github/workflows/ ) is due to the CI failure ... not sure if it works... just from AI....

@Edwinhr716
Copy link
Copy Markdown
Contributor

@panpan0000 can you create an issue which describes what this KEP is trying to address?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 20, 2026
@Edwinhr716
Copy link
Copy Markdown
Contributor

Edwinhr716 commented Apr 21, 2026

Also to fix the failing tests, we need to wait for #814 to be completed

@panpan0000
Copy link
Copy Markdown
Member Author

panpan0000 commented Apr 21, 2026

  1. linked issue here How to Support distributed preflight checks in init phase #820
  2. remove the CI run workaround and rebase origin/main

@panpan0000 panpan0000 force-pushed the kep/failed-limit-init-dns branch from 820d7ff to de083b3 Compare April 21, 2026 09:52
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 21, 2026
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
@panpan0000 panpan0000 force-pushed the kep/failed-limit-init-dns branch from de083b3 to 0590df4 Compare April 29, 2026 07:01
@panpan0000
Copy link
Copy Markdown
Member Author

rebase to trigger CI again, after #814 merged, as @Edwinhr716 metioned.

@panpan0000
Copy link
Copy Markdown
Member Author

itseems CI passed , please take a look when you have time, thank you! @Edwinhr716

@panpan0000
Copy link
Copy Markdown
Member Author

ping

@panpan0000
Copy link
Copy Markdown
Member Author

panpan0000 commented May 4, 2026

I saw another problem: @Edwinhr716

when vLLM pod crash and restarted by probe, nothing changed, except the AGE renews, silence failure....

NAME                                                            READY   STATUS                     RESTARTS       AGE
deepseek-v4-pro-llm-d-modelservice-decode-0                     0/1     Running                    0              5m20s
deepseek-v4-pro-llm-d-modelservice-decode-0-1                   1/1     Running                    0              5m20s
 kubectl -n public get lws -o wide
deepseek-v4-pro-llm-d-modelservice-decode   10d

Comment thread keps/NNN-distributed-preflight-check/README.md Outdated
Comment thread keps/NNN-distributed-preflight-check/README.md Outdated
Comment thread keps/820-distributed-preflight-check/README.md
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
@panpan0000 panpan0000 force-pushed the kep/failed-limit-init-dns branch from ff477d0 to 0d142bd Compare June 2, 2026 06:10
@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 2, 2026
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
@panpan0000 panpan0000 force-pushed the kep/failed-limit-init-dns branch from 594394e to 758d435 Compare June 2, 2026 07:03
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants