Skip to content

feat: add annotation to exclude components from triggering restart po…#332

Open
Syspretor wants to merge 9 commits into
sgl-project:mainfrom
Syspretor:feat/allow-ignore-some-components-restart
Open

feat: add annotation to exclude components from triggering restart po…#332
Syspretor wants to merge 9 commits into
sgl-project:mainfrom
Syspretor:feat/allow-ignore-some-components-restart

Conversation

@Syspretor
Copy link
Copy Markdown
Collaborator

…licy

Add ComponentNoTriggerRestartPolicyAnnotationKey to allow specific components in customComponentsPattern to be excluded from triggering the role's restart policy. When this annotation is set to "true" on a component's pod template, pod restart/delete events from this component will NOT trigger RecreateRoleInstanceOnPodRestart or RecreateRBGOnPodRestart actions.

This is useful for auxiliary components (e.g., monitoring, logging sidecars) whose failures should not affect the main workload.

Usage:
template:
metadata:
annotations:
rbg.workloads.x-k8s.io/component-no-trigger-restart: "true"

Ⅰ. Motivation

Ⅱ. Modifications

Ⅲ. Does this pull request fix one issue?

fixes #XXXX

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a component-level annotation intended to prevent certain Pods (e.g., auxiliary/monitoring components in customComponentsPattern) from triggering restart-policy actions when they restart or are deleted.

Changes:

  • Add a new annotation key/value (rbg.workloads.x-k8s.io/restart-trigger-policy: "Ignore") and constants for it.
  • Gate Pod-event → RBG reconciliation enqueueing in the pod controller when the annotation is set to "Ignore".
  • Update failure-handling documentation and a customComponentsPattern example to describe/illustrate the feature.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
internal/controller/workloads/pod_controller.go Skips enqueueing RBG reconcile on pod restart/delete when the new annotation is set to Ignore.
api/workloads/constants/annotation.go Defines the new annotation key and allowed values (Inherit / Ignore).
doc/features/failure-handling.md Documents excluding components from restart-policy triggers via the new annotation.
examples/basic/rbg/patterns/custom-components-pattern.yaml Adds explanatory comments and an example component annotated to not trigger restart policy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/controller/workloads/pod_controller.go Outdated
Comment thread internal/controller/workloads/pod_controller.go Outdated
Comment thread api/workloads/constants/annotation.go
Comment thread api/workloads/constants/annotation.go
Comment thread doc/features/failure-handling.md Outdated
Comment thread doc/features/failure-handling.md
Comment thread examples/basic/rbg/patterns/custom-components-pattern.yaml Outdated
Comment thread examples/basic/rbg/patterns/custom-components-pattern.yaml
Copy link
Copy Markdown
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation check only covers the RecreateRBGOnPodRestart path. The RecreateRoleInstanceOnPodRestart path (which is the default and more common policy) is not covered — see inline comment for details.

Comment thread internal/controller/workloads/pod_controller.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread internal/controller/workloads/pod_controller.go Outdated
Comment thread doc/features/failure-handling.md Outdated
Comment thread examples/basic/rbg/patterns/custom-components-pattern.yaml Outdated
@Syspretor Syspretor force-pushed the feat/allow-ignore-some-components-restart branch 3 times, most recently from 4d30611 to f50648a Compare May 20, 2026 06:11
@cheyang cheyang requested a review from Copilot May 20, 2026 08:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

pkg/reconciler/roleinstance/sync/instance_scale.go:287

  • containerRestarted only checks pod.Status.ContainerStatuses and ignores InitContainerStatuses (and potentially EphemeralContainerStatuses). If an init container restarts, the instance recreation policy won’t trigger even though the doc comment says “any container” and LWS semantics include init-container restarts. Consider checking init-container restart counts as well (and decide explicitly whether to include ephemeral containers).
// containerRestarted checks if any container in the pod has been restarted.
func containerRestarted(pod *v1.Pod) bool {
	for i := range pod.Status.ContainerStatuses {
		if pod.Status.ContainerStatuses[i].RestartCount > 0 {
			return true
		}
	}
	return false

Comment on lines +271 to +272
// Check if any container has restarted
if containerRestarted(p) {
Comment on lines +3 to +12
RBG supports multiple failure handling policies: `None` and `RecreateRoleInstanceOnPodRestart`.

![failure-handling](../img/failure-handling.png)

## Restart Policy Types

| Policy | Description |
|--------|-------------|
| `None` | No automatic restart action; rely on default pod restart behavior. |
| `RecreateRBGOnPodRestart` | Recreate the entire RoleBasedGroup when any pod in this role restarts. Useful for critical roles that require all pods to be healthy. |
| `RecreateRoleInstanceOnPodRestart` | Recreate only the affected role instance when a pod restarts. More granular control for less critical roles. **This is the default policy for all patterns** (standalonePattern, leaderWorkerPattern, customComponentsPattern). |
| `None` | No automatic restart action; rely on default pod restart behavior. Failed pods are replaced through normal reconciliation. |
| `RecreateRoleInstanceOnPodRestart` | Recreate only the affected role instance when a pod fails or a container restarts. **This is the default policy for all patterns** (standalonePattern, leaderWorkerPattern, customComponentsPattern). |
Comment on lines +142 to +152
// RestartTriggerPolicyAnnotationKey specifies whether a component's
// pod restart/failure events should trigger the role's restart policy
// (RecreateRoleInstanceOnPodRestart).
// Valid values are:
// - "Inherit" (or empty): Pod events from this component will follow the role's restart policy.
// - "Ignore": Pod events from this component will NOT trigger restart policy.
// This is useful for auxiliary components (e.g., monitoring, logging sidecars) whose
// failures should not affect the main workload.
// Example: rbg.workloads.x-k8s.io/restart-trigger-policy: "Ignore"
RestartTriggerPolicyAnnotationKey = RBGPrefix + "restart-trigger-policy"
)
@cheyang cheyang self-requested a review May 20, 2026 09:06
玖宇 and others added 3 commits May 20, 2026 17:20
…licy

Add ComponentNoTriggerRestartPolicyAnnotationKey to allow specific components
in customComponentsPattern to be excluded from triggering the role's restart
policy. When this annotation is set to "true" on a component's pod template,
pod restart/delete events from this component will NOT trigger
RecreateRoleInstanceOnPodRestart or RecreateRBGOnPodRestart actions.

This is useful for auxiliary components (e.g., monitoring, logging sidecars)
whose failures should not affect the main workload.

Usage:
  template:
    metadata:
      annotations:
        rbg.workloads.x-k8s.io/component-no-trigger-restart: "true"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change the annotation from boolean format to policy enum format:
- Annotation key: rbg.workloads.x-k8s.io/restart-trigger-policy
- Valid values: "Inherit" (default) and "Ignore"

"Inherit" means the component follows the role's restart policy.
"Ignore" means the component's pod events will not trigger restart policy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues to fix before merge:

  1. containerRestarted() needs DeletionTimestamp + phase guards to match the adjacent PodFailed check — without them, Succeeded pods with RestartCount > 0 and terminating pods incorrectly trigger instance recreation.

  2. Docs overstate pattern scope — the annotation only takes effect for customComponentsPattern and standalonePattern. LeaderWorkerPattern delegates restart handling to the LWS controller (lws_reconciler.go:286) which doesn't check this annotation.

The annotation design is solid, test coverage is comprehensive, and the previous review feedback about RecreateRoleInstanceOnPodRestart coverage has been properly addressed.

return true
}
// Check if any container has restarted
if containerRestarted(p) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containerRestarted(p) runs on all pods including Succeeded ones (via the inactivePods list) and pods being deleted. The sibling PodFailed check above has a DeletionTimestamp == nil guard but this doesn't.

A Succeeded pod that had a container restart during its lifetime (e.g. init crash-loop before eventual completion) would incorrectly trigger instance recreation. Same for a pod mid-termination.

Suggest:

if p.DeletionTimestamp == nil && p.Status.Phase != v1.PodSucceeded && containerRestarted(p) {
    return true
}


## Excluding Components from Restart Policy Trigger

You can prevent specific pods from triggering the role's restart policy by setting an annotation on the pod template. This is useful for auxiliary components (e.g., monitoring, logging sidecars) whose failures should not affect the main workload. The annotation works with any deployment pattern (standalonePattern, leaderWorkerPattern, or customComponentsPattern).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This claims the annotation works with "any deployment pattern" including LeaderWorkerPattern, but LeaderWorkerPattern maps RecreateRoleInstanceOnPodRestart to LWS RecreateGroupOnPodRestart and delegates entirely to the LWS controller (lws_reconciler.go:286). The LWS controller doesn't check this annotation.

Should be scoped to customComponentsPattern and standalonePattern.

玖宇 and others added 3 commits May 20, 2026 17:25
Re-implement the component-level restart-trigger-policy annotation after
rebase onto upstream/main which removed pod_controller.go. The logic now
lives in shouldRecreateInstance (instance_scale.go):

- Skip pods with restart-trigger-policy=Ignore when checking for Failed
  phase or container restart (RestartCount > 0)
- Add containerRestarted check for RoleInstanceSet-managed instances
- Add unit tests for hasTriggerPolicyIgnore, containerRestarted, and
  shouldRecreateInstance with annotation scenarios
- Add e2e tests for customComponentsPattern verifying ignored vs
  non-ignored component pod failure behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eferences

Address PR review comments:
- Remove all references to RecreateRBGOnPodRestart (removed in upstream sgl-project#340)
- Update annotation godoc to only mention RecreateRoleInstanceOnPodRestart
- Fix failure-handling.md to reflect current restart policy types
- Update example YAML annotation comment
- Fix "Use Cases" section to not list ignoreRestartPolicy as a policy enum

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Syspretor Syspretor force-pushed the feat/allow-ignore-some-components-restart branch from 726fdf2 to 150e59b Compare May 20, 2026 09:28
玖宇 and others added 3 commits May 21, 2026 10:25
…creations

When restartPolicy=RecreateRoleInstanceOnPodRestart triggers, newly
recreated pods may have transient RestartCount > 0, which would
re-trigger recreation in an infinite loop. Add a Restarting=True
condition that is set when recreation triggers and cleared when the
instance becomes Ready again, breaking the cascade.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
Replace the inline Restarting condition check (which relied on the
informer cache) with a defer-based guard in shouldRecreateInstanceGuarded.
The guard first checks a sync.Map cache (zero latency, immune to informer
lag), then falls back to a direct API server read via apiReader. This
eliminates the race window where informer cache lag could allow a
cascading restart to slip through.

Remove allPodsTerminating and isInstanceRestartingFromCache helpers that
are superseded by this approach.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Restarting condition was being unconditionally propagated from
updateInstance to newStatus on every reconcile. Since updateInstance
carries the persisted status (including Restarting=True from a previous
reconcile), this prevented setInstanceConditions from clearing it when
the instance became Ready.

Move the propagation inside the `if scaling` block so it is only carried
forward while pod deletion/creation is in progress. Once scaling
completes, the condition is no longer injected into newStatus, allowing
setInstanceConditions to clear it via its instanceReady guard.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants