feat: add annotation to exclude components from triggering restart po… by Syspretor · Pull Request #332 · sgl-project/rbg

Syspretor · 2026-05-09T02:25:53Z

…licy

Add ComponentNoTriggerRestartPolicyAnnotationKey to allow specific components in customComponentsPattern to be excluded from triggering the role's restart policy. When this annotation is set to "true" on a component's pod template, pod restart/delete events from this component will NOT trigger RecreateRoleInstanceOnPodRestart or RecreateRBGOnPodRestart actions.

This is useful for auxiliary components (e.g., monitoring, logging sidecars) whose failures should not affect the main workload.

Usage:
template:
metadata:
annotations:
rbg.workloads.x-k8s.io/component-no-trigger-restart: "true"

Ⅰ. Motivation

Ⅱ. Modifications

Ⅲ. Does this pull request fix one issue?

fixes #XXXX

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

Format your code make fmt.
Add unit tests or integration tests.
Update the documentation related to the change.

gemini-code-assist · 2026-05-09T02:25:57Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copilot

Pull request overview

This PR introduces a component-level annotation intended to prevent certain Pods (e.g., auxiliary/monitoring components in customComponentsPattern) from triggering restart-policy actions when they restart or are deleted.

Changes:

Add a new annotation key/value (rbg.workloads.x-k8s.io/restart-trigger-policy: "Ignore") and constants for it.
Gate Pod-event → RBG reconciliation enqueueing in the pod controller when the annotation is set to "Ignore".
Update failure-handling documentation and a customComponentsPattern example to describe/illustrate the feature.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
`internal/controller/workloads/pod_controller.go`	Skips enqueueing RBG reconcile on pod restart/delete when the new annotation is set to `Ignore`.
`api/workloads/constants/annotation.go`	Defines the new annotation key and allowed values (`Inherit` / `Ignore`).
`doc/features/failure-handling.md`	Documents excluding components from restart-policy triggers via the new annotation.
`examples/basic/rbg/patterns/custom-components-pattern.yaml`	Adds explanatory comments and an example component annotated to not trigger restart policy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cheyang

The annotation check only covers the RecreateRBGOnPodRestart path. The RecreateRoleInstanceOnPodRestart path (which is the default and more common policy) is not covered — see inline comment for details.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

pkg/reconciler/roleinstance/sync/instance_scale.go:287

containerRestarted only checks pod.Status.ContainerStatuses and ignores InitContainerStatuses (and potentially EphemeralContainerStatuses). If an init container restarts, the instance recreation policy won’t trigger even though the doc comment says “any container” and LWS semantics include init-container restarts. Consider checking init-container restart counts as well (and decide explicitly whether to include ephemeral containers).

// containerRestarted checks if any container in the pod has been restarted.
func containerRestarted(pod *v1.Pod) bool {
	for i := range pod.Status.ContainerStatuses {
		if pod.Status.ContainerStatuses[i].RestartCount > 0 {
			return true
		}
	}
	return false

+		// Check if any container has restarted
+		if containerRestarted(p) {


+RBG supports multiple failure handling policies: `None` and `RecreateRoleInstanceOnPodRestart`.

 ![failure-handling](../img/failure-handling.png)

 ## Restart Policy Types

 | Policy | Description |
 |--------|-------------|
-| `None` | No automatic restart action; rely on default pod restart behavior. |
-| `RecreateRBGOnPodRestart` | Recreate the entire RoleBasedGroup when any pod in this role restarts. Useful for critical roles that require all pods to be healthy. |
-| `RecreateRoleInstanceOnPodRestart` | Recreate only the affected role instance when a pod restarts. More granular control for less critical roles. **This is the default policy for all patterns** (standalonePattern, leaderWorkerPattern, customComponentsPattern). |
+| `None` | No automatic restart action; rely on default pod restart behavior. Failed pods are replaced through normal reconciliation. |
+| `RecreateRoleInstanceOnPodRestart` | Recreate only the affected role instance when a pod fails or a container restarts. **This is the default policy for all patterns** (standalonePattern, leaderWorkerPattern, customComponentsPattern). |


+	// RestartTriggerPolicyAnnotationKey specifies whether a component's
+	// pod restart/failure events should trigger the role's restart policy
+	// (RecreateRoleInstanceOnPodRestart).
+	// Valid values are:
+	//   - "Inherit" (or empty): Pod events from this component will follow the role's restart policy.
+	//   - "Ignore": Pod events from this component will NOT trigger restart policy.
+	// This is useful for auxiliary components (e.g., monitoring, logging sidecars) whose
+	// failures should not affect the main workload.
+	// Example: rbg.workloads.x-k8s.io/restart-trigger-policy: "Ignore"
+	RestartTriggerPolicyAnnotationKey = RBGPrefix + "restart-trigger-policy"
+)


…licy Add ComponentNoTriggerRestartPolicyAnnotationKey to allow specific components in customComponentsPattern to be excluded from triggering the role's restart policy. When this annotation is set to "true" on a component's pod template, pod restart/delete events from this component will NOT trigger RecreateRoleInstanceOnPodRestart or RecreateRBGOnPodRestart actions. This is useful for auxiliary components (e.g., monitoring, logging sidecars) whose failures should not affect the main workload. Usage: template: metadata: annotations: rbg.workloads.x-k8s.io/component-no-trigger-restart: "true" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change the annotation from boolean format to policy enum format: - Annotation key: rbg.workloads.x-k8s.io/restart-trigger-policy - Valid values: "Inherit" (default) and "Ignore" "Inherit" means the component follows the role's restart policy. "Ignore" means the component's pod events will not trigger restart policy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cheyang

Two issues to fix before merge:

containerRestarted() needs DeletionTimestamp + phase guards to match the adjacent PodFailed check — without them, Succeeded pods with RestartCount > 0 and terminating pods incorrectly trigger instance recreation.
Docs overstate pattern scope — the annotation only takes effect for customComponentsPattern and standalonePattern. LeaderWorkerPattern delegates restart handling to the LWS controller (lws_reconciler.go:286) which doesn't check this annotation.

The annotation design is solid, test coverage is comprehensive, and the previous review feedback about RecreateRoleInstanceOnPodRestart coverage has been properly addressed.

cheyang · 2026-05-20T09:24:07Z

 			return true
 		}
+		// Check if any container has restarted
+		if containerRestarted(p) {


containerRestarted(p) runs on all pods including Succeeded ones (via the inactivePods list) and pods being deleted. The sibling PodFailed check above has a DeletionTimestamp == nil guard but this doesn't.

A Succeeded pod that had a container restart during its lifetime (e.g. init crash-loop before eventual completion) would incorrectly trigger instance recreation. Same for a pod mid-termination.

Suggest:

if p.DeletionTimestamp == nil && p.Status.Phase != v1.PodSucceeded && containerRestarted(p) { return true }

cheyang · 2026-05-20T09:24:07Z


+## Excluding Components from Restart Policy Trigger
+
+You can prevent specific pods from triggering the role's restart policy by setting an annotation on the pod template. This is useful for auxiliary components (e.g., monitoring, logging sidecars) whose failures should not affect the main workload. The annotation works with any deployment pattern (standalonePattern, leaderWorkerPattern, or customComponentsPattern).


This claims the annotation works with "any deployment pattern" including LeaderWorkerPattern, but LeaderWorkerPattern maps RecreateRoleInstanceOnPodRestart to LWS RecreateGroupOnPodRestart and delegates entirely to the LWS controller (lws_reconciler.go:286). The LWS controller doesn't check this annotation.

Should be scoped to customComponentsPattern and standalonePattern.

Re-implement the component-level restart-trigger-policy annotation after rebase onto upstream/main which removed pod_controller.go. The logic now lives in shouldRecreateInstance (instance_scale.go): - Skip pods with restart-trigger-policy=Ignore when checking for Failed phase or container restart (RestartCount > 0) - Add containerRestarted check for RoleInstanceSet-managed instances - Add unit tests for hasTriggerPolicyIgnore, containerRestarted, and shouldRecreateInstance with annotation scenarios - Add e2e tests for customComponentsPattern verifying ignored vs non-ignored component pod failure behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…eferences Address PR review comments: - Remove all references to RecreateRBGOnPodRestart (removed in upstream sgl-project#340) - Update annotation godoc to only mention RecreateRoleInstanceOnPodRestart - Fix failure-handling.md to reflect current restart policy types - Update example YAML annotation comment - Fix "Use Cases" section to not list ignoreRestartPolicy as a policy enum Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…creations When restartPolicy=RecreateRoleInstanceOnPodRestart triggers, newly recreated pods may have transient RestartCount > 0, which would re-trigger recreation in an infinite loop. Add a Restarting=True condition that is set when recreation triggers and cleared when the instance becomes Ready again, breaking the cascade. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: 玖宇 <guotongyu.gty@alibaba-inc.com>

Replace the inline Restarting condition check (which relied on the informer cache) with a defer-based guard in shouldRecreateInstanceGuarded. The guard first checks a sync.Map cache (zero latency, immune to informer lag), then falls back to a direct API server read via apiReader. This eliminates the race window where informer cache lag could allow a cascading restart to slip through. Remove allPodsTerminating and isInstanceRestartingFromCache helpers that are superseded by this approach. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Restarting condition was being unconditionally propagated from updateInstance to newStatus on every reconcile. Since updateInstance carries the persisted status (including Restarting=True from a previous reconcile), this prevented setInstanceConditions from clearing it when the instance became Ready. Move the propagation inside the `if scaling` block so it is only carried forward while pod deletion/creation is in progress. Once scaling completes, the condition is no longer injected into newStatus, allowing setInstanceConditions to clear it via its instanceReady guard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cheyang requested a review from Copilot May 9, 2026 09:06

Copilot started reviewing on behalf of cheyang May 9, 2026 09:06 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

cheyang requested changes May 9, 2026

View reviewed changes

Comment thread internal/controller/workloads/pod_controller.go Outdated

cheyang requested a review from Copilot May 15, 2026 02:16

Copilot started reviewing on behalf of cheyang May 15, 2026 02:16 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread internal/controller/workloads/pod_controller.go Outdated

Comment thread doc/features/failure-handling.md Outdated

Comment thread examples/basic/rbg/patterns/custom-components-pattern.yaml Outdated

Syspretor force-pushed the feat/allow-ignore-some-components-restart branch 3 times, most recently from 4d30611 to f50648a Compare May 20, 2026 06:11

cheyang requested a review from Copilot May 20, 2026 08:08

Copilot started reviewing on behalf of cheyang May 20, 2026 08:08 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

cheyang self-requested a review May 20, 2026 09:06

玖宇 and others added 3 commits May 20, 2026 17:20

fix comments

9591252

cheyang requested changes May 20, 2026

View reviewed changes

玖宇 and others added 3 commits May 20, 2026 17:25

fix: add trailing newline to custom-components-pattern.yaml

150e59b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Syspretor force-pushed the feat/allow-ignore-some-components-restart branch from 726fdf2 to 150e59b Compare May 20, 2026 09:28

玖宇 and others added 3 commits May 21, 2026 10:25

cheyang requested review from cheyang and Copilot May 21, 2026 05:59

Copilot started reviewing on behalf of cheyang May 21, 2026 06:03 View session

		// Check if any container has restarted
		if containerRestarted(p) {


		## Excluding Components from Restart Policy Trigger

		You can prevent specific pods from triggering the role's restart policy by setting an annotation on the pod template. This is useful for auxiliary components (e.g., monitoring, logging sidecars) whose failures should not affect the main workload. The annotation works with any deployment pattern (standalonePattern, leaderWorkerPattern, or customComponentsPattern).

Conversation

Syspretor commented May 9, 2026

Ⅰ. Motivation

Ⅱ. Modifications

Ⅲ. Does this pull request fix one issue?

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cheyang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

cheyang left a comment

Choose a reason for hiding this comment

Uh oh!

cheyang May 20, 2026

Choose a reason for hiding this comment

Uh oh!

cheyang May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants