Skip to content

[slice] New workload condition with reason for eviction#1179

Open
pajakd wants to merge 3 commits intoAI-Hypercomputer:slice-mainfrom
pajakd:new_condition
Open

[slice] New workload condition with reason for eviction#1179
pajakd wants to merge 3 commits intoAI-Hypercomputer:slice-mainfrom
pajakd:new_condition

Conversation

@pajakd
Copy link
Copy Markdown
Collaborator

@pajakd pajakd commented Apr 27, 2026

Description

When evicting the workload (by setting the admission check to Retry) due to some slice issue, we would like to specify the reason for eviction in a new dedicated workload condition. I identified 4 possible issues:

  1. Slice runtime failure (slice in state FAILED).
  2. Slice formation timeout. If slice does not get ACTIVE within the specified time despite the retry mechanism being on.
  3. Slice deletion (eg. with kubectl delete ...)
  4. Slice configuration issue (eg. number of cubes not matching the topology)

Issue

Testing


LWSLeaderPodSetName = "leader"

WorkloadSliceFailureConditionType = "SliceFailure"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe TPUSliceFailure, so it is clear what type of slice we are talking about.

}

func (r *WorkloadReconciler) updateWorkloadAdmissionCheckStatus(ctx context.Context, wl *kueue.Workload, ac *kueue.AdmissionCheckState) error {
func (r *WorkloadReconciler) updateWorkloadAdmissionCheckStatus(ctx context.Context, wl *kueue.Workload, ac *kueue.AdmissionCheckState, evictedReason string) error {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function updateWorkloadAdmissionCheckStatus now also manages the SliceFailure condition on the Workload. Consider renaming it to something like updateWorkloadStatus to better reflect its expanded responsibility.

ac.State = kueue.CheckStatePending
ac.Message = api.TruncateConditionMessage(msg)
patchErr := r.updateWorkloadAdmissionCheckStatus(ctx, wl, ac)
patchErr := r.updateWorkloadAdmissionCheckStatus(ctx, wl, ac, "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the event of a slice creation failure, ac.State is set to Pending and updateWorkloadAdmissionCheckStatus is called with an empty evictedReason. This will clear any existing SliceFailure condition. Consider if we should instead report this creation failure via the SliceFailure condition as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants