feat: add annotation to exclude components from triggering restart po…#332
feat: add annotation to exclude components from triggering restart po…#332Syspretor wants to merge 9 commits into
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Pull request overview
This PR introduces a component-level annotation intended to prevent certain Pods (e.g., auxiliary/monitoring components in customComponentsPattern) from triggering restart-policy actions when they restart or are deleted.
Changes:
- Add a new annotation key/value (
rbg.workloads.x-k8s.io/restart-trigger-policy: "Ignore") and constants for it. - Gate Pod-event → RBG reconciliation enqueueing in the pod controller when the annotation is set to
"Ignore". - Update failure-handling documentation and a
customComponentsPatternexample to describe/illustrate the feature.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
internal/controller/workloads/pod_controller.go |
Skips enqueueing RBG reconcile on pod restart/delete when the new annotation is set to Ignore. |
api/workloads/constants/annotation.go |
Defines the new annotation key and allowed values (Inherit / Ignore). |
doc/features/failure-handling.md |
Documents excluding components from restart-policy triggers via the new annotation. |
examples/basic/rbg/patterns/custom-components-pattern.yaml |
Adds explanatory comments and an example component annotated to not trigger restart policy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cheyang
left a comment
There was a problem hiding this comment.
The annotation check only covers the RecreateRBGOnPodRestart path. The RecreateRoleInstanceOnPodRestart path (which is the default and more common policy) is not covered — see inline comment for details.
4d30611 to
f50648a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
pkg/reconciler/roleinstance/sync/instance_scale.go:287
containerRestartedonly checkspod.Status.ContainerStatusesand ignoresInitContainerStatuses(and potentiallyEphemeralContainerStatuses). If an init container restarts, the instance recreation policy won’t trigger even though the doc comment says “any container” and LWS semantics include init-container restarts. Consider checking init-container restart counts as well (and decide explicitly whether to include ephemeral containers).
// containerRestarted checks if any container in the pod has been restarted.
func containerRestarted(pod *v1.Pod) bool {
for i := range pod.Status.ContainerStatuses {
if pod.Status.ContainerStatuses[i].RestartCount > 0 {
return true
}
}
return false
| // Check if any container has restarted | ||
| if containerRestarted(p) { |
| RBG supports multiple failure handling policies: `None` and `RecreateRoleInstanceOnPodRestart`. | ||
|
|
||
|  | ||
|
|
||
| ## Restart Policy Types | ||
|
|
||
| | Policy | Description | | ||
| |--------|-------------| | ||
| | `None` | No automatic restart action; rely on default pod restart behavior. | | ||
| | `RecreateRBGOnPodRestart` | Recreate the entire RoleBasedGroup when any pod in this role restarts. Useful for critical roles that require all pods to be healthy. | | ||
| | `RecreateRoleInstanceOnPodRestart` | Recreate only the affected role instance when a pod restarts. More granular control for less critical roles. **This is the default policy for all patterns** (standalonePattern, leaderWorkerPattern, customComponentsPattern). | | ||
| | `None` | No automatic restart action; rely on default pod restart behavior. Failed pods are replaced through normal reconciliation. | | ||
| | `RecreateRoleInstanceOnPodRestart` | Recreate only the affected role instance when a pod fails or a container restarts. **This is the default policy for all patterns** (standalonePattern, leaderWorkerPattern, customComponentsPattern). | |
| // RestartTriggerPolicyAnnotationKey specifies whether a component's | ||
| // pod restart/failure events should trigger the role's restart policy | ||
| // (RecreateRoleInstanceOnPodRestart). | ||
| // Valid values are: | ||
| // - "Inherit" (or empty): Pod events from this component will follow the role's restart policy. | ||
| // - "Ignore": Pod events from this component will NOT trigger restart policy. | ||
| // This is useful for auxiliary components (e.g., monitoring, logging sidecars) whose | ||
| // failures should not affect the main workload. | ||
| // Example: rbg.workloads.x-k8s.io/restart-trigger-policy: "Ignore" | ||
| RestartTriggerPolicyAnnotationKey = RBGPrefix + "restart-trigger-policy" | ||
| ) |
…licy
Add ComponentNoTriggerRestartPolicyAnnotationKey to allow specific components
in customComponentsPattern to be excluded from triggering the role's restart
policy. When this annotation is set to "true" on a component's pod template,
pod restart/delete events from this component will NOT trigger
RecreateRoleInstanceOnPodRestart or RecreateRBGOnPodRestart actions.
This is useful for auxiliary components (e.g., monitoring, logging sidecars)
whose failures should not affect the main workload.
Usage:
template:
metadata:
annotations:
rbg.workloads.x-k8s.io/component-no-trigger-restart: "true"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change the annotation from boolean format to policy enum format: - Annotation key: rbg.workloads.x-k8s.io/restart-trigger-policy - Valid values: "Inherit" (default) and "Ignore" "Inherit" means the component follows the role's restart policy. "Ignore" means the component's pod events will not trigger restart policy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cheyang
left a comment
There was a problem hiding this comment.
Two issues to fix before merge:
-
containerRestarted()needs DeletionTimestamp + phase guards to match the adjacent PodFailed check — without them, Succeeded pods with RestartCount > 0 and terminating pods incorrectly trigger instance recreation. -
Docs overstate pattern scope — the annotation only takes effect for customComponentsPattern and standalonePattern. LeaderWorkerPattern delegates restart handling to the LWS controller (
lws_reconciler.go:286) which doesn't check this annotation.
The annotation design is solid, test coverage is comprehensive, and the previous review feedback about RecreateRoleInstanceOnPodRestart coverage has been properly addressed.
| return true | ||
| } | ||
| // Check if any container has restarted | ||
| if containerRestarted(p) { |
There was a problem hiding this comment.
containerRestarted(p) runs on all pods including Succeeded ones (via the inactivePods list) and pods being deleted. The sibling PodFailed check above has a DeletionTimestamp == nil guard but this doesn't.
A Succeeded pod that had a container restart during its lifetime (e.g. init crash-loop before eventual completion) would incorrectly trigger instance recreation. Same for a pod mid-termination.
Suggest:
if p.DeletionTimestamp == nil && p.Status.Phase != v1.PodSucceeded && containerRestarted(p) {
return true
}|
|
||
| ## Excluding Components from Restart Policy Trigger | ||
|
|
||
| You can prevent specific pods from triggering the role's restart policy by setting an annotation on the pod template. This is useful for auxiliary components (e.g., monitoring, logging sidecars) whose failures should not affect the main workload. The annotation works with any deployment pattern (standalonePattern, leaderWorkerPattern, or customComponentsPattern). |
There was a problem hiding this comment.
This claims the annotation works with "any deployment pattern" including LeaderWorkerPattern, but LeaderWorkerPattern maps RecreateRoleInstanceOnPodRestart to LWS RecreateGroupOnPodRestart and delegates entirely to the LWS controller (lws_reconciler.go:286). The LWS controller doesn't check this annotation.
Should be scoped to customComponentsPattern and standalonePattern.
Re-implement the component-level restart-trigger-policy annotation after rebase onto upstream/main which removed pod_controller.go. The logic now lives in shouldRecreateInstance (instance_scale.go): - Skip pods with restart-trigger-policy=Ignore when checking for Failed phase or container restart (RestartCount > 0) - Add containerRestarted check for RoleInstanceSet-managed instances - Add unit tests for hasTriggerPolicyIgnore, containerRestarted, and shouldRecreateInstance with annotation scenarios - Add e2e tests for customComponentsPattern verifying ignored vs non-ignored component pod failure behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eferences Address PR review comments: - Remove all references to RecreateRBGOnPodRestart (removed in upstream sgl-project#340) - Update annotation godoc to only mention RecreateRoleInstanceOnPodRestart - Fix failure-handling.md to reflect current restart policy types - Update example YAML annotation comment - Fix "Use Cases" section to not list ignoreRestartPolicy as a policy enum Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
726fdf2 to
150e59b
Compare
…creations When restartPolicy=RecreateRoleInstanceOnPodRestart triggers, newly recreated pods may have transient RestartCount > 0, which would re-trigger recreation in an infinite loop. Add a Restarting=True condition that is set when recreation triggers and cleared when the instance becomes Ready again, breaking the cascade. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
Replace the inline Restarting condition check (which relied on the informer cache) with a defer-based guard in shouldRecreateInstanceGuarded. The guard first checks a sync.Map cache (zero latency, immune to informer lag), then falls back to a direct API server read via apiReader. This eliminates the race window where informer cache lag could allow a cascading restart to slip through. Remove allPodsTerminating and isInstanceRestartingFromCache helpers that are superseded by this approach. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Restarting condition was being unconditionally propagated from updateInstance to newStatus on every reconcile. Since updateInstance carries the persisted status (including Restarting=True from a previous reconcile), this prevented setInstanceConditions from clearing it when the instance became Ready. Move the propagation inside the `if scaling` block so it is only carried forward while pod deletion/creation is in progress. Once scaling completes, the condition is no longer injected into newStatus, allowing setInstanceConditions to clear it via its instanceReady guard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…licy
Add ComponentNoTriggerRestartPolicyAnnotationKey to allow specific components in customComponentsPattern to be excluded from triggering the role's restart policy. When this annotation is set to "true" on a component's pod template, pod restart/delete events from this component will NOT trigger RecreateRoleInstanceOnPodRestart or RecreateRBGOnPodRestart actions.
This is useful for auxiliary components (e.g., monitoring, logging sidecars) whose failures should not affect the main workload.
Usage:
template:
metadata:
annotations:
rbg.workloads.x-k8s.io/component-no-trigger-restart: "true"
Ⅰ. Motivation
Ⅱ. Modifications
Ⅲ. Does this pull request fix one issue?
fixes #XXXX
Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
Ⅴ. Describe how to verify it
VI. Special notes for reviews
Checklist
make fmt.