From ea3fdc6e6d3c8fc33da000bdcbd19a0bc67d5689 Mon Sep 17 00:00:00 2001 From: Wantong Jiang Date: Wed, 12 Mar 2025 20:46:33 +0000 Subject: [PATCH 1/5] fix rollout doc --- docs/howtos/crp.md | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/docs/howtos/crp.md b/docs/howtos/crp.md index 60d2f72e8..2cb7537e6 100644 --- a/docs/howtos/crp.md +++ b/docs/howtos/crp.md @@ -372,10 +372,11 @@ unavailable as Fleet dispatches updated resources. Clusters that are no longer s resulting in lost traffic. If too many new clusters are selected and Fleet places resources on them simultaneously, your backend may become overloaded. The exact interruption pattern may vary depending on the resources you place using Fleet. -To minimize interruption, Fleet allows users to configure the rollout strategy, similar to native Kubernetes deployment, -to transition between changes as smoothly as possible. Currently, Fleet supports only one rollout strategy: rolling update. +### Default rollout strategy: Rolling Update +To minimize interruption, Fleet allows users to configure the rollout strategy. +The default strategy is rolling update, and it applies to all changes you initiate. This strategy ensures changes, including the addition or removal of selected clusters and resource refreshes, -are applied incrementally in a phased manner at a pace suitable for you. This is the default option and applies to all changes you initiate. +are applied incrementally in a phased manner at a pace suitable for you, similar to native Kubernetes deployments. This rollout strategy can be configured with the following parameters: @@ -471,6 +472,23 @@ longer to complete the rollout, in accordance with the rolling update strategy y > to some clusters. You can identify this behavior if CRP status; for more information, see > [Understanding the Status of a `ClusterResourcePlacement`](crp-status.md) How-To Guide. +### External rollout strategy and staged update run + +Fleet supports flexible rollout patterns through an `External` rollout strategy, which allows you to implement custom rollout controllers. +When configured, Fleet delegates the responsibility of resource placement to your external controller instead of using Fleet's built-in rolling update mechanism. + +One implementation of an external rollout strategy is the **Staged Update Run**. +This approach enables a controlled, stage-by-stage placement of workload resources defined in a `ClusterResourcePlacement`. + +To utilize this strategy: +1. Set `spec.strategy.type` as `External` in the `ClusterResourcePlacement` object. +2. Define your rollout process using two custom resources: + - `ClusterStagedUpdateStrategy`: A reusable template defining the rollout pattern + - `ClusterStagedUpdateRun`: The resource that triggers and manages the actual rollout process. + +For comprehensive guidance on implementing staged updates, please refer to the [Staged Update Run Concepts](../concepts/StagedUpdateRun/README.md) +and [Staged Update Run How-To Guide](updaterun.md). + ## Snapshots and revisions Internally, Fleet keeps a history of all the scheduling policies you have used with a From 27a1da5995e6307ce29578a1dc338458c319d021 Mon Sep 17 00:00:00 2001 From: Wantong Jiang Date: Thu, 13 Mar 2025 08:06:37 +0000 Subject: [PATCH 2/5] add updateRun tsg and fix rollout doc --- docs/concepts/StagedUpdateRun/README.md | 4 + docs/troubleshooting/README.md | 1 + .../clusterResourcePlacementRolloutStarted.md | 4 + docs/troubleshooting/updaterun.md | 454 ++++++++++++++++++ 4 files changed, 463 insertions(+) create mode 100644 docs/troubleshooting/updaterun.md diff --git a/docs/concepts/StagedUpdateRun/README.md b/docs/concepts/StagedUpdateRun/README.md index 32ed93b5c..4b3d5e9b8 100644 --- a/docs/concepts/StagedUpdateRun/README.md +++ b/docs/concepts/StagedUpdateRun/README.md @@ -119,3 +119,7 @@ This cluster-scoped resource requires three key parameters: the `placementName` An updateRun executes in two phases. During the initialization phase, the controller performs a one-time setup where it captures a snapshot of the updateRun strategy, collects scheduled and to-be-deleted `ClusterResourceBindings`, generates the cluster update sequence, and records all this information in the updateRun status. In the execution phase, the controller processes each stage sequentially, updates clusters within each stage one at a time, and enforces completion of after-stage tasks. It then executes a final delete stage to clean up resources from unscheduled clusters. The updateRun succeeds when all stages complete successfully. However, it will fail if any execution-affecting events occur, for example, the target ClusterResourcePlacement being deleted, and member cluster changes triggering new scheduling. In such cases, error details are recorded in the updateRun status. Remember that once initialized, an updateRun operates on its strategy snapshot, making it immune to subsequent strategy modifications. + +## Next Steps +* Learn how to [rollout CRP resources with Staged Update Run](../../howtos/updaterun.md) +* Learn how to [troubleshoot a Staged Update Run](../../troubleshooting/updaterun.md) \ No newline at end of file diff --git a/docs/troubleshooting/README.md b/docs/troubleshooting/README.md index 62880efdc..0618fba29 100644 --- a/docs/troubleshooting/README.md +++ b/docs/troubleshooting/README.md @@ -26,6 +26,7 @@ The complete progression of `ClusterResourcePlacement` is as follows: - If this condition is false, refer to [How can I debug when my CRP status is ClusterResourcePlacementScheduled condition status is set to false?](./clusterResourcePlacementScheduled.md). 2. `ClusterResourcePlacementRolloutStarted`: Indicates the rollout process has begun. - If this condition is false refer to [How can I debug when my CRP status is ClusterResourcePlacementRolloutStarted condition status is set to false?](./clusterResourcePlacementRolloutStarted.md) + - If you are triggering a rollout with a staged update run, refer to [Staged Update Run Troubleshooting Guide](./updaterun.md). 3. `ClusterResourcePlacementOverridden`: Indicates the resource has been overridden. - If this condition is false, refer to [How can I debug when my CRP status is ClusterResourcePlacementOverridden condition status is set to false?](./clusterResourcePlacementOverridden.md) 4. `ClusterResourcePlacementWorkSynchronized`: Indicates the work objects have been synchronized. diff --git a/docs/troubleshooting/clusterResourcePlacementRolloutStarted.md b/docs/troubleshooting/clusterResourcePlacementRolloutStarted.md index 0592f0498..229507dd0 100644 --- a/docs/troubleshooting/clusterResourcePlacementRolloutStarted.md +++ b/docs/troubleshooting/clusterResourcePlacementRolloutStarted.md @@ -1,6 +1,10 @@ # How can I debug when my CRP status is ClusterResourcePlacementRolloutStarted condition status is set to false? When using the `ClusterResourcePlacement` API object in Azure Kubernetes Fleet Manager to propagate resources, the selected resources aren't rolled out in all scheduled clusters and the `ClusterResourcePlacementRolloutStarted` condition status shows as `False`. + +*This TSG only applies to the `RollingUpdate` rollout strategy, which is the default strategy if you don't specify in the `ClusterResourcePlacement`.* +*To troubleshoot the update run strategy as you specify `External` in the `ClusterResourcePlacement`, please refer to the [Staged Update Run Troubleshooting Guide](updaterun.md).* + > Note: To get more information about why the rollout doesn't start, you can check the [rollout controller](https://github.com/Azure/fleet/blob/main/pkg/controllers/rollout/controller.go) to get more information on why the rollout did not start. ## Common scenarios: diff --git a/docs/troubleshooting/updaterun.md b/docs/troubleshooting/updaterun.md new file mode 100644 index 000000000..7081a95c6 --- /dev/null +++ b/docs/troubleshooting/updaterun.md @@ -0,0 +1,454 @@ +# Staged Update Run Troubleshooting Guide + +This guide provides troubleshooting steps for common issues related to Staged Update Run. + +## CRP status without Staged Update Run + +When a `ClusterResourcePlacement` is created with `spec.strategy.type` set to `External`, the rollout does not start immediately. + +A sample status of such `ClusterResourcePlacement` is as follows: + +```bash +$ kubectl describe crp example-placement +... +Status: + Conditions: + Last Transition Time: 2025-03-12T23:01:32Z + Message: found all cluster needed as specified by the scheduling policy, found 2 cluster(s) + Observed Generation: 1 + Reason: SchedulingPolicyFulfilled + Status: True + Type: ClusterResourcePlacementScheduled + Last Transition Time: 2025-03-12T23:01:32Z + Message: There are still 2 cluster(s) in the process of deciding whether to roll out the latest resources or not + Observed Generation: 1 + Reason: RolloutStartedUnknown + Status: Unknown + Type: ClusterResourcePlacementRolloutStarted + Observed Resource Index: 0 + Placement Statuses: + Cluster Name: member1 + Conditions: + Last Transition Time: 2025-03-12T23:01:32Z + Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy + Observed Generation: 1 + Reason: Scheduled + Status: True + Type: Scheduled + Last Transition Time: 2025-03-12T23:01:32Z + Message: In the process of deciding whether to roll out the latest resources or not + Observed Generation: 1 + Reason: RolloutStartedUnknown + Status: Unknown + Type: RolloutStarted + Cluster Name: member2 + Conditions: + Last Transition Time: 2025-03-12T23:01:32Z + Message: Successfully scheduled resources for placement in "member2" (affinity score: 0, topology spread score: 0): picked by scheduling policy + Observed Generation: 1 + Reason: Scheduled + Status: True + Type: Scheduled + Last Transition Time: 2025-03-12T23:01:32Z + Message: In the process of deciding whether to roll out the latest resources or not + Observed Generation: 1 + Reason: RolloutStartedUnknown + Status: Unknown + Type: RolloutStarted + Selected Resources: + ... +Events: +``` + +`SchedulingPolicyFulfilled` condition indicates the CRP has been fully scheduled, while `RolloutStartedUnknown` condition shows that the rollout has not started. + +In the `Placement Statuses` section, it displays the detailed status of each cluster. Both selected clusters are in the `Scheduled` state, but the `RolloutStarted` condition is still `Unknown` because the rollout has not kicked off yet. + +## Understand ClusterStagedUpdateRun status + +Let's take a deep look into the status of a completed `ClusterStagedUpdateRun`. It displays details about the rollout status for every clusters and stages. + +```bash +$ kubectl describe crsur run example-run +... +Status: + Conditions: + Last Transition Time: 2025-03-12T23:21:39Z + Message: ClusterStagedUpdateRun initialized successfully + Observed Generation: 1 + Reason: UpdateRunInitializedSuccessfully + Status: True + Type: Initialized + Last Transition Time: 2025-03-12T23:21:39Z + Message: + Observed Generation: 1 + Reason: UpdateRunStarted + Status: True + Type: Progressing + Last Transition Time: 2025-03-12T23:26:15Z + Message: + Observed Generation: 1 + Reason: UpdateRunSucceeded + Status: True + Type: Succeeded + Deletion Stage Status: + Clusters: + Conditions: + Last Transition Time: 2025-03-12T23:26:15Z + Message: + Observed Generation: 1 + Reason: StageUpdatingStarted + Status: True + Type: Progressing + Last Transition Time: 2025-03-12T23:26:15Z + Message: + Observed Generation: 1 + Reason: StageUpdatingSucceeded + Status: True + Type: Succeeded + End Time: 2025-03-12T23:26:15Z + Stage Name: kubernetes-fleet.io/deleteStage + Start Time: 2025-03-12T23:26:15Z + Policy Observed Cluster Count: 2 + Policy Snapshot Index Used: 0 + Staged Update Strategy Snapshot: + Stages: + After Stage Tasks: + Type: Approval + Wait Time: 0s + Type: TimedWait + Wait Time: 1m0s + Label Selector: + Match Labels: + Environment: staging + Name: staging + After Stage Tasks: + Type: Approval + Wait Time: 0s + Label Selector: + Match Labels: + Environment: canary + Name: canary + Sorting Label Key: name + After Stage Tasks: + Type: TimedWait + Wait Time: 1m0s + Type: Approval + Wait Time: 0s + Label Selector: + Match Labels: + Environment: production + Name: production + Sorting Label Key: order + Stages Status: + After Stage Task Status: + Approval Request Name: example-run-staging + Conditions: + Last Transition Time: 2025-03-12T23:21:54Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestCreated + Status: True + Type: ApprovalRequestCreated + Last Transition Time: 2025-03-12T23:22:55Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestApproved + Status: True + Type: ApprovalRequestApproved + Type: Approval + Conditions: + Last Transition Time: 2025-03-12T23:22:54Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskWaitTimeElapsed + Status: True + Type: WaitTimeElapsed + Type: TimedWait + Clusters: + Cluster Name: member1 + Conditions: + Last Transition Time: 2025-03-12T23:21:39Z + Message: + Observed Generation: 1 + Reason: ClusterUpdatingStarted + Status: True + Type: Started + Last Transition Time: 2025-03-12T23:21:54Z + Message: + Observed Generation: 1 + Reason: ClusterUpdatingSucceeded + Status: True + Type: Succeeded + Conditions: + Last Transition Time: 2025-03-12T23:21:54Z + Message: + Observed Generation: 1 + Reason: StageUpdatingWaiting + Status: False + Type: Progressing + Last Transition Time: 2025-03-12T23:22:55Z + Message: + Observed Generation: 1 + Reason: StageUpdatingSucceeded + Status: True + Type: Succeeded + End Time: 2025-03-12T23:22:55Z + Stage Name: staging + Start Time: 2025-03-12T23:21:39Z + After Stage Task Status: + Approval Request Name: example-run-canary + Conditions: + Last Transition Time: 2025-03-12T23:23:10Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestCreated + Status: True + Type: ApprovalRequestCreated + Last Transition Time: 2025-03-12T23:25:15Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestApproved + Status: True + Type: ApprovalRequestApproved + Type: Approval + Clusters: + Cluster Name: member2 + Conditions: + Last Transition Time: 2025-03-12T23:22:55Z + Message: + Observed Generation: 1 + Reason: ClusterUpdatingStarted + Status: True + Type: Started + Last Transition Time: 2025-03-12T23:23:10Z + Message: + Observed Generation: 1 + Reason: ClusterUpdatingSucceeded + Status: True + Type: Succeeded + Conditions: + Last Transition Time: 2025-03-12T23:23:10Z + Message: + Observed Generation: 1 + Reason: StageUpdatingWaiting + Status: False + Type: Progressing + Last Transition Time: 2025-03-12T23:25:15Z + Message: + Observed Generation: 1 + Reason: StageUpdatingSucceeded + Status: True + Type: Succeeded + End Time: 2025-03-12T23:25:15Z + Stage Name: canary + Start Time: 2025-03-12T23:22:55Z + After Stage Task Status: + Conditions: + Last Transition Time: 2025-03-12T23:26:15Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskWaitTimeElapsed + Status: True + Type: WaitTimeElapsed + Type: TimedWait + Approval Request Name: example-run-production + Conditions: + Last Transition Time: 2025-03-12T23:25:15Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestCreated + Status: True + Type: ApprovalRequestCreated + Last Transition Time: 2025-03-12T23:25:25Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestApproved + Status: True + Type: ApprovalRequestApproved + Type: Approval + Clusters: + Conditions: + Last Transition Time: 2025-03-12T23:25:15Z + Message: + Observed Generation: 1 + Reason: StageUpdatingWaiting + Status: False + Type: Progressing + Last Transition Time: 2025-03-12T23:26:15Z + Message: + Observed Generation: 1 + Reason: StageUpdatingSucceeded + Status: True + Type: Succeeded + End Time: 2025-03-12T23:26:15Z + Stage Name: production +Events: +``` + +### UpdateRun overall status + +At the very top, `Status.Conditions` gives the overall status of the updateRun. The execution an update run consists of two phases: initialization and execution. +During initialization, the controller performs a one-time setup where it captures a snapshot of the updateRun strategy, collects scheduled and to-be-deleted `ClusterResourceBindings`, +generates the cluster update sequence, and records all this information in the updateRun status. +The `UpdateRunInitializedSuccessfully` condition indicates the initialization is successful. + +After initialization, the controller starts executing the updateRun. The `UpdateRunStarted` condition indicates the execution has started. + +After all clusters are updated, all after-stage tasks are completed, and thus all stages are finished, the `UpdateRunSucceeded` condition is set to `True`, indicating the updateRun has succeeded. + +### Fields recorded in the updateRun status during initialization + +During initialization, the controller records the following fields in the updateRun status: +- `PolicySnapshotIndexUsed`: the index of the policy snapshot used for the updateRun, it should be the latest one. +- `PolicyObservedClusterCount`: the number of clusters selected by the scheduling policy. +- `StagedUpdateStrategySnapshot`: the snapshot of the updateRun strategy, which ensures any strategy changes will not affect executing updateRuns. + +### Stages and clusters status + +The `Stages Status` section displays the status of each stage and cluster. As shown in the strategy snapshot, the updateRun has three stages: `staging`, `canary`, and `production`. During initialization, the controller generates the rollout plan, classifies the scheduled clusters +into these three stages and dumps the plan into the updateRun status. As the execution progresses, the controller updates the status of each stage and cluster. Take the `staging` stage as an example, `member1` is included in this stage. `ClusterUpdatingStarted` condition indicates the cluster is being updated and `ClusterUpdatingSucceeded` condition shows the cluster is updated successfully. + +After all clusters are updated in a stage, the controller executes the specified after-stage tasks. Stage `staging` has two after-stage tasks: `Approval` and `TimedWait`. The `Approval` task requires the admin to manually approve a `ClusterApprovalRequest` generated by the controller. The name of the `ClusterApprovalRequest` is also included in the status, which is `example-run-staging`. `AfterStageTaskApprovalRequestCreated` condition indicates the approval request is created and `AfterStageTaskApprovalRequestApproved` condition indicates the approval request has been approved. The `TimedWait` task enforces a suspension of the rollout until the specified wait time has elapsed and in this case, the wait time is 1 minute. `AfterStageTaskWaitTimeElapsed` condition indicates the wait time has elapsed and the rollout can proceed to the next stage. + +Each stage also has its own conditions. When a stage starts, the `Progressing` condition is set to `True`. When all the cluster updates complete, the `Progressing` condition is set to `False` with reason `StageUpdatingWaiting` as shown above. It means the stage is waiting for +after-stage tasks to pass. +And thus the `lastTransitionTime` of the `Progressing` condition also serves as the start time of the wait in case there's a `TimedWait` task. When all after-stage tasks pass, the `Succeeded` condition is set to `True`. Each stage status also has `Start Time` and `End Time` fields, making it easier to read. + +There's also a `Deletion Stage Status` section, which displays the status of the deletion stage. The deletion stage is the last stage of the updateRun. It deletes resources from the unscheduled clusters. The status is pretty much the same as a normal update stage, except that there are no after-stage tasks. + +Note that all these conditions have `lastTransitionTime` set to the time when the controller updates the status. It can help debug and check +the progress of the updateRun. + +## Investigate ClusterStagedUpdateRun initialization failure + +An updateRun initialization failure can be easily detected by getting the resource: +```bash +$ kubectl get crsur example-run +NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE +example-run example-placement 1 0 False 2s +``` +The `INITIALIZED` field is `False`, indicating the initialization failed. + +Describe the updateRun to get more details: +```bash +$ kubectl describe crsur example-run +... +Status: + Conditions: + Last Transition Time: 2025-03-13T07:28:29Z + Message: cannot continue the ClusterStagedUpdateRun: failed to initialize the clusterStagedUpdateRun: failed to process the request due to a client error: no clusterResourceSnapshots with index `1` found for clusterResourcePlacement `example-placement` + Observed Generation: 1 + Reason: UpdateRunInitializedFailed + Status: False + Type: Initialized + Deletion Stage Status: + Clusters: + Stage Name: kubernetes-fleet.io/deleteStage + Policy Observed Cluster Count: 2 + Policy Snapshot Index Used: 0 +... +``` +The condition clearly indicates the initialization failed. And the condition message gives more details about the failure. +In this case, I used a not-existing resource snapshot index `1` for the updateRun. + +## Investigate ClusterStagedUpdateRun rollout stuck + +A `ClusterStagedUpdateRun` can get stuck when resource placement fails on some clusters. Describing the updateRun will show some cluster is stuck in `ClusterUpdatingStarted` condition: +```bash +$ date; kubectl describe crsur example-run +Thu Mar 13 07:44:28 UTC 2025 +... +Stages Status: + After Stage Task Status: + Approval Request Name: example-run-staging + Type: Approval + Type: TimedWait + Clusters: + Cluster Name: member1 + Conditions: + Last Transition Time: 2025-03-13T07:37:36Z + Message: + Observed Generation: 1 + Reason: ClusterUpdatingStarted + Status: True + Type: Started + Conditions: + Last Transition Time: 2025-03-13T07:37:36Z + Message: + Observed Generation: 1 + Reason: StageUpdatingStarted + Status: True + Type: Progressing + Stage Name: staging + Start Time: 2025-03-13T07:37:36Z +``` + +As you can see, the cluster updating has started about 7 minutes ago, but the `ClusterUpdatingSucceeded` condition is still not set yet. +This usually indicates something wrong happened on the cluster. To further investigate, you can check the `ClusterResourcePlacement` status: +```bash +$ kubectl describe crp example-placement +... +Placement Statuses: + Cluster Name: member1 + Conditions: + Last Transition Time: 2025-03-12T23:01:32Z + Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy + Observed Generation: 1 + Reason: Scheduled + Status: True + Type: Scheduled + Last Transition Time: 2025-03-13T07:37:36Z + Message: Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 3, clusterStagedUpdateRun: example-run + Observed Generation: 1 + Reason: RolloutStarted + Status: True + Type: RolloutStarted + Last Transition Time: 2025-03-13T07:37:36Z + Message: No override rules are configured for the selected resources + Observed Generation: 1 + Reason: NoOverrideSpecified + Status: True + Type: Overridden + Last Transition Time: 2025-03-13T07:37:36Z + Message: All of the works are synchronized to the latest + Observed Generation: 1 + Reason: AllWorkSynced + Status: True + Type: WorkSynchronized + Last Transition Time: 2025-03-13T07:37:39Z + Message: Work object example-placement-work has failed to apply + Observed Generation: 1 + Reason: NotAllWorkHaveBeenApplied + Status: False + Type: Applied + Failed Placements: + Condition: + Last Transition Time: 2025-03-13T07:37:36Z + Message: Manifest is trackable but not available yet + Observed Generation: 1 + Reason: ManifestNotAvailableYet + Status: False + Type: Available + Group: apps + Kind: Deployment + Name: nginx + Namespace: test-namespace + Version: v1 + Condition: + Last Transition Time: 2025-03-13T07:37:39Z + Message: Failed to apply manifest: failed to process the request due to a client error: resource exists and is not managed by the fleet controller and co-ownernship is disallowed + Reason: ManifestsAlreadyOwnedByOthers + Status: False + Type: Applied + Group: apps + Kind: ReplicaSet + Name: nginx-7b6df7c758 + Namespace: test-namespace + Version: v1 +... +``` + +The `Applied` condition is `False` and says not all work have been applied. And in the "failed placements" section, it shows the detailed failure. For more debugging instructions, you can refer to [CRP troubleshooting guide](./README.md). + +After resolving the issue, you can create always create a new updateRun to restart the rollout. Stuck updateRuns can be deleted. \ No newline at end of file From e290d893a33c4f1d4cfff3b8e5a282f344025a8c Mon Sep 17 00:00:00 2001 From: Wantong Jiang Date: Thu, 13 Mar 2025 23:50:29 +0000 Subject: [PATCH 3/5] fix easy comments --- docs/howtos/crp.md | 3 ++- docs/howtos/updaterun.md | 2 -- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/howtos/crp.md b/docs/howtos/crp.md index 2cb7537e6..ea720cee8 100644 --- a/docs/howtos/crp.md +++ b/docs/howtos/crp.md @@ -371,9 +371,10 @@ Most outcomes can lead to service interruptions. Apps running on member clusters unavailable as Fleet dispatches updated resources. Clusters that are no longer selected will lose all placed resources, resulting in lost traffic. If too many new clusters are selected and Fleet places resources on them simultaneously, your backend may become overloaded. The exact interruption pattern may vary depending on the resources you place using Fleet. +To minimize interruption, Fleet allows users to configure the rollout strategy. There are two types of rollout strategies we currently support. ### Default rollout strategy: Rolling Update -To minimize interruption, Fleet allows users to configure the rollout strategy. + The default strategy is rolling update, and it applies to all changes you initiate. This strategy ensures changes, including the addition or removal of selected clusters and resource refreshes, are applied incrementally in a phased manner at a pace suitable for you, similar to native Kubernetes deployments. diff --git a/docs/howtos/updaterun.md b/docs/howtos/updaterun.md index f66e9465c..5b5ee6bf8 100644 --- a/docs/howtos/updaterun.md +++ b/docs/howtos/updaterun.md @@ -208,7 +208,6 @@ status: name: staging - afterStageTasks: - type: Approval - waitTime: 1h0m0s labelSelector: matchLabels: environment: canary @@ -407,7 +406,6 @@ status: name: staging - afterStageTasks: - type: Approval - waitTime: 1h0m0s labelSelector: matchLabels: environment: canary From abcb653e1795dceef7034e1fe402268aa6578b85 Mon Sep 17 00:00:00 2001 From: Wantong Jiang Date: Mon, 17 Mar 2025 21:27:56 +0000 Subject: [PATCH 4/5] fix comments --- docs/concepts/StagedUpdateRun/README.md | 437 +++++++++++++++++++++++- docs/troubleshooting/updaterun.md | 256 -------------- 2 files changed, 435 insertions(+), 258 deletions(-) diff --git a/docs/concepts/StagedUpdateRun/README.md b/docs/concepts/StagedUpdateRun/README.md index 4b3d5e9b8..391918a22 100644 --- a/docs/concepts/StagedUpdateRun/README.md +++ b/docs/concepts/StagedUpdateRun/README.md @@ -120,6 +120,439 @@ An updateRun executes in two phases. During the initialization phase, the contro In the execution phase, the controller processes each stage sequentially, updates clusters within each stage one at a time, and enforces completion of after-stage tasks. It then executes a final delete stage to clean up resources from unscheduled clusters. The updateRun succeeds when all stages complete successfully. However, it will fail if any execution-affecting events occur, for example, the target ClusterResourcePlacement being deleted, and member cluster changes triggering new scheduling. In such cases, error details are recorded in the updateRun status. Remember that once initialized, an updateRun operates on its strategy snapshot, making it immune to subsequent strategy modifications. +## Understand ClusterStagedUpdateRun status + +Let's take a deep look into the status of a completed `ClusterStagedUpdateRun`. It displays details about the rollout status for every clusters and stages. + +```bash +$ kubectl describe crsur run example-run +... +Status: + Conditions: + Last Transition Time: 2025-03-12T23:21:39Z + Message: ClusterStagedUpdateRun initialized successfully + Observed Generation: 1 + Reason: UpdateRunInitializedSuccessfully + Status: True + Type: Initialized + Last Transition Time: 2025-03-12T23:21:39Z + Message: + Observed Generation: 1 + Reason: UpdateRunStarted + Status: True + Type: Progressing + Last Transition Time: 2025-03-12T23:26:15Z + Message: + Observed Generation: 1 + Reason: UpdateRunSucceeded + Status: True + Type: Succeeded + Deletion Stage Status: + Clusters: + Conditions: + Last Transition Time: 2025-03-12T23:26:15Z + Message: + Observed Generation: 1 + Reason: StageUpdatingStarted + Status: True + Type: Progressing + Last Transition Time: 2025-03-12T23:26:15Z + Message: + Observed Generation: 1 + Reason: StageUpdatingSucceeded + Status: True + Type: Succeeded + End Time: 2025-03-12T23:26:15Z + Stage Name: kubernetes-fleet.io/deleteStage + Start Time: 2025-03-12T23:26:15Z + Policy Observed Cluster Count: 2 + Policy Snapshot Index Used: 0 + Staged Update Strategy Snapshot: + Stages: + After Stage Tasks: + Type: Approval + Wait Time: 0s + Type: TimedWait + Wait Time: 1m0s + Label Selector: + Match Labels: + Environment: staging + Name: staging + After Stage Tasks: + Type: Approval + Wait Time: 0s + Label Selector: + Match Labels: + Environment: canary + Name: canary + Sorting Label Key: name + After Stage Tasks: + Type: TimedWait + Wait Time: 1m0s + Type: Approval + Wait Time: 0s + Label Selector: + Match Labels: + Environment: production + Name: production + Sorting Label Key: order + Stages Status: + After Stage Task Status: + Approval Request Name: example-run-staging + Conditions: + Last Transition Time: 2025-03-12T23:21:54Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestCreated + Status: True + Type: ApprovalRequestCreated + Last Transition Time: 2025-03-12T23:22:55Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestApproved + Status: True + Type: ApprovalRequestApproved + Type: Approval + Conditions: + Last Transition Time: 2025-03-12T23:22:54Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskWaitTimeElapsed + Status: True + Type: WaitTimeElapsed + Type: TimedWait + Clusters: + Cluster Name: member1 + Conditions: + Last Transition Time: 2025-03-12T23:21:39Z + Message: + Observed Generation: 1 + Reason: ClusterUpdatingStarted + Status: True + Type: Started + Last Transition Time: 2025-03-12T23:21:54Z + Message: + Observed Generation: 1 + Reason: ClusterUpdatingSucceeded + Status: True + Type: Succeeded + Conditions: + Last Transition Time: 2025-03-12T23:21:54Z + Message: + Observed Generation: 1 + Reason: StageUpdatingWaiting + Status: False + Type: Progressing + Last Transition Time: 2025-03-12T23:22:55Z + Message: + Observed Generation: 1 + Reason: StageUpdatingSucceeded + Status: True + Type: Succeeded + End Time: 2025-03-12T23:22:55Z + Stage Name: staging + Start Time: 2025-03-12T23:21:39Z + After Stage Task Status: + Approval Request Name: example-run-canary + Conditions: + Last Transition Time: 2025-03-12T23:23:10Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestCreated + Status: True + Type: ApprovalRequestCreated + Last Transition Time: 2025-03-12T23:25:15Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestApproved + Status: True + Type: ApprovalRequestApproved + Type: Approval + Clusters: + Cluster Name: member2 + Conditions: + Last Transition Time: 2025-03-12T23:22:55Z + Message: + Observed Generation: 1 + Reason: ClusterUpdatingStarted + Status: True + Type: Started + Last Transition Time: 2025-03-12T23:23:10Z + Message: + Observed Generation: 1 + Reason: ClusterUpdatingSucceeded + Status: True + Type: Succeeded + Conditions: + Last Transition Time: 2025-03-12T23:23:10Z + Message: + Observed Generation: 1 + Reason: StageUpdatingWaiting + Status: False + Type: Progressing + Last Transition Time: 2025-03-12T23:25:15Z + Message: + Observed Generation: 1 + Reason: StageUpdatingSucceeded + Status: True + Type: Succeeded + End Time: 2025-03-12T23:25:15Z + Stage Name: canary + Start Time: 2025-03-12T23:22:55Z + After Stage Task Status: + Conditions: + Last Transition Time: 2025-03-12T23:26:15Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskWaitTimeElapsed + Status: True + Type: WaitTimeElapsed + Type: TimedWait + Approval Request Name: example-run-production + Conditions: + Last Transition Time: 2025-03-12T23:25:15Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestCreated + Status: True + Type: ApprovalRequestCreated + Last Transition Time: 2025-03-12T23:25:25Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestApproved + Status: True + Type: ApprovalRequestApproved + Type: Approval + Clusters: + Conditions: + Last Transition Time: 2025-03-12T23:25:15Z + Message: + Observed Generation: 1 + Reason: StageUpdatingWaiting + Status: False + Type: Progressing + Last Transition Time: 2025-03-12T23:26:15Z + Message: + Observed Generation: 1 + Reason: StageUpdatingSucceeded + Status: True + Type: Succeeded + End Time: 2025-03-12T23:26:15Z + Stage Name: production +Events: +``` + +### UpdateRun overall status + +At the very top, `Status.Conditions` gives the overall status of the updateRun. The execution an update run consists of two phases: initialization and execution. +During initialization, the controller performs a one-time setup where it captures a snapshot of the updateRun strategy, collects scheduled and to-be-deleted `ClusterResourceBindings`, +generates the cluster update sequence, and records all this information in the updateRun status. +The `UpdateRunInitializedSuccessfully` condition indicates the initialization is successful. + +After initialization, the controller starts executing the updateRun. The `UpdateRunStarted` condition indicates the execution has started. + +After all clusters are updated, all after-stage tasks are completed, and thus all stages are finished, the `UpdateRunSucceeded` condition is set to `True`, indicating the updateRun has succeeded. + +### Fields recorded in the updateRun status during initialization + +During initialization, the controller records the following fields in the updateRun status: +- `PolicySnapshotIndexUsed`: the index of the policy snapshot used for the updateRun, it should be the latest one. +- `PolicyObservedClusterCount`: the number of clusters selected by the scheduling policy. +- `StagedUpdateStrategySnapshot`: the snapshot of the updateRun strategy, which ensures any strategy changes will not affect executing updateRuns. + +### Stages and clusters status + +The `Stages Status` section displays the status of each stage and cluster. As shown in the strategy snapshot, the updateRun has three stages: `staging`, `canary`, and `production`. During initialization, the controller generates the rollout plan, classifies the scheduled clusters +into these three stages and dumps the plan into the updateRun status. As the execution progresses, the controller updates the status of each stage and cluster. Take the `staging` stage as an example, `member1` is included in this stage. `ClusterUpdatingStarted` condition indicates the cluster is being updated and `ClusterUpdatingSucceeded` condition shows the cluster is updated successfully. + +After all clusters are updated in a stage, the controller executes the specified after-stage tasks. Stage `staging` has two after-stage tasks: `Approval` and `TimedWait`. The `Approval` task requires the admin to manually approve a `ClusterApprovalRequest` generated by the controller. The name of the `ClusterApprovalRequest` is also included in the status, which is `example-run-staging`. `AfterStageTaskApprovalRequestCreated` condition indicates the approval request is created and `AfterStageTaskApprovalRequestApproved` condition indicates the approval request has been approved. The `TimedWait` task enforces a suspension of the rollout until the specified wait time has elapsed and in this case, the wait time is 1 minute. `AfterStageTaskWaitTimeElapsed` condition indicates the wait time has elapsed and the rollout can proceed to the next stage. + +Each stage also has its own conditions. When a stage starts, the `Progressing` condition is set to `True`. When all the cluster updates complete, the `Progressing` condition is set to `False` with reason `StageUpdatingWaiting` as shown above. It means the stage is waiting for +after-stage tasks to pass. +And thus the `lastTransitionTime` of the `Progressing` condition also serves as the start time of the wait in case there's a `TimedWait` task. When all after-stage tasks pass, the `Succeeded` condition is set to `True`. Each stage status also has `Start Time` and `End Time` fields, making it easier to read. + +There's also a `Deletion Stage Status` section, which displays the status of the deletion stage. The deletion stage is the last stage of the updateRun. It deletes resources from the unscheduled clusters. The status is pretty much the same as a normal update stage, except that there are no after-stage tasks. + +Note that all these conditions have `lastTransitionTime` set to the time when the controller updates the status. It can help debug and check +the progress of the updateRun. + +## Relationship between ClusterStagedUpdateRun and ClusterResourcePlacement + +A `ClusterStagedUpdateRun` serves as the trigger mechanism for rolling out a `ClusterResourcePlacement`. The key points of this relationship are: +* The `ClusterResourcePlacement` remains in a scheduled state without being deployed until a corresponding `ClusterStagedUpdateRun` is created. +* During rollout, the `ClusterResourcePlacement` status is continuously updated with detailed information from each target cluster. +* While a `ClusterStagedUpdateRun` only indicates whether updates have started and completed for each member cluster (as described in [previous section](#understand-clusterstagedupdaterun-status)), the `ClusterResourcePlacement` provides comprehensive details including: + * Success/failure of resource creation + * Application of overrides + * Specific error messages + +For example, below is the status of an in-progress `ClusterStagedUpdateRun`: +```bash +kubectl describe crsur example-run +Name: example-run +... +Status: + Conditions: + Last Transition Time: 2025-03-17T21:37:14Z + Message: ClusterStagedUpdateRun initialized successfully + Observed Generation: 1 + Reason: UpdateRunInitializedSuccessfully + Status: True + Type: Initialized + Last Transition Time: 2025-03-17T21:37:14Z + Message: + Observed Generation: 1 + Reason: UpdateRunStarted # updateRun started + Status: True + Type: Progressing +... + Stages Status: + After Stage Task Status: + Approval Request Name: example-run-staging + Conditions: + Last Transition Time: 2025-03-17T21:37:29Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskApprovalRequestCreated + Status: True + Type: ApprovalRequestCreated + Type: Approval + Conditions: + Last Transition Time: 2025-03-17T21:38:29Z + Message: + Observed Generation: 1 + Reason: AfterStageTaskWaitTimeElapsed + Status: True + Type: WaitTimeElapsed + Type: TimedWait + Clusters: + Cluster Name: member1 + Conditions: + Last Transition Time: 2025-03-17T21:37:14Z + Message: + Observed Generation: 1 + Reason: ClusterUpdatingStarted + Status: True + Type: Started + Last Transition Time: 2025-03-17T21:37:29Z + Message: + Observed Generation: 1 + Reason: ClusterUpdatingSucceeded # member1 has updated successfully + Status: True + Type: Succeeded + Conditions: + Last Transition Time: 2025-03-17T21:37:29Z + Message: + Observed Generation: 1 + Reason: StageUpdatingWaiting # waiting for approval + Status: False + Type: Progressing + Stage Name: staging + Start Time: 2025-03-17T21:37:14Z + After Stage Task Status: + Approval Request Name: example-run-canary + Type: Approval + Clusters: + Cluster Name: member2 + Stage Name: canary + After Stage Task Status: + Type: TimedWait + Approval Request Name: example-run-production + Type: Approval + Clusters: + Stage Name: production +... +``` +In above status, member1 from stage `staging` has been updated successfully. The stage is waiting for approval to proceed to the next stage. And member2 from stage `canary` is not updated yet. + +Let's take a look at the status of the `ClusterResourcePlacement` `example-placement`: +```bash +kubectl describe crp example-placement +Name: example-placement +... +Status: + Conditions: + Last Transition Time: 2025-03-12T23:01:32Z + Message: found all cluster needed as specified by the scheduling policy, found 2 cluster(s) + Observed Generation: 1 + Reason: SchedulingPolicyFulfilled + Status: True + Type: ClusterResourcePlacementScheduled + Last Transition Time: 2025-03-13T07:35:25Z + Message: There are still 1 cluster(s) in the process of deciding whether to roll out the latest resources or not + Observed Generation: 1 + Reason: RolloutStartedUnknown + Status: Unknown + Type: ClusterResourcePlacementRolloutStarted + Observed Resource Index: 5 + Placement Statuses: + Cluster Name: member1 + Conditions: + Last Transition Time: 2025-03-12T23:01:32Z + Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy + Observed Generation: 1 + Reason: Scheduled + Status: True + Type: Scheduled + Last Transition Time: 2025-03-17T21:37:14Z + Message: Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 5, clusterStagedUpdateRun: example-run + Observed Generation: 1 + Reason: RolloutStarted + Status: True + Type: RolloutStarted + Last Transition Time: 2025-03-17T21:37:14Z + Message: No override rules are configured for the selected resources + Observed Generation: 1 + Reason: NoOverrideSpecified + Status: True + Type: Overridden + Last Transition Time: 2025-03-17T21:37:14Z + Message: All of the works are synchronized to the latest + Observed Generation: 1 + Reason: AllWorkSynced + Status: True + Type: WorkSynchronized + Last Transition Time: 2025-03-17T21:37:14Z + Message: All corresponding work objects are applied + Observed Generation: 1 + Reason: AllWorkHaveBeenApplied + Status: True + Type: Applied + Last Transition Time: 2025-03-17T21:37:14Z + Message: All corresponding work objects are available + Observed Generation: 1 + Reason: AllWorkAreAvailable # member1 is all good + Status: True + Type: Available + Cluster Name: member2 + Conditions: + Last Transition Time: 2025-03-12T23:01:32Z + Message: Successfully scheduled resources for placement in "member2" (affinity score: 0, topology spread score: 0): picked by scheduling policy + Observed Generation: 1 + Reason: Scheduled + Status: True + Type: Scheduled + Last Transition Time: 2025-03-13T07:35:25Z + Message: In the process of deciding whether to roll out the latest resources or not + Observed Generation: 1 + Reason: RolloutStartedUnknown # member2 is not updated yet + Status: Unknown + Type: RolloutStarted +... +``` +In the `Placement Statuses` section, we can see the status of each member cluster. For member1, the `RolloutStarted` condition is set to `True`, indicating the rollout has started. +In the condition message, we print the `ClusterStagedUpdateRun` name, which is `example-run`. This indicates the most recent cluster update is triggered by `example-run`. +It also displays the detailed update status: the works are synced and applied and are detected available. As a comparison, member2 is still in `Scheduled` state only. + +When troubleshooting a stalled updateRun, examining the `ClusterResourcePlacement` status offers valuable diagnostic information that can help identify the root cause. +For comprehensive troubleshooting steps, refer to the [troubleshooting guide](../../troubleshooting/updaterun.md). + +## Concurrent updateRuns + +Multiple concurrent `ClusterStagedUpdateRun`s can be created for the same `ClusterResourcePlacement`, allowing fleet administrators to pipeline the rollout of different resource versions. However, to maintain consistency across the fleet and prevent member clusters from running different resource versions simultaneously, we enforce a key constraint: all concurrent `ClusterStagedUpdateRun`s must use identical `ClusterStagedUpdateStrategy` settings. + +This strategy consistency requirement is validated during the initialization phase of each updateRun. This validation ensures predictable rollout behavior and prevents configuration drift across your cluster fleet, even when multiple updates are in progress. + ## Next Steps -* Learn how to [rollout CRP resources with Staged Update Run](../../howtos/updaterun.md) -* Learn how to [troubleshoot a Staged Update Run](../../troubleshooting/updaterun.md) \ No newline at end of file +* Learn how to [rollout and rollback CRP resources with Staged Update Run](../../howtos/updaterun.md) +* Learn how to [troubleshoot a Staged Update Run](../../troubleshooting/updaterun.md) diff --git a/docs/troubleshooting/updaterun.md b/docs/troubleshooting/updaterun.md index 7081a95c6..43535f120 100644 --- a/docs/troubleshooting/updaterun.md +++ b/docs/troubleshooting/updaterun.md @@ -64,262 +64,6 @@ Events: In the `Placement Statuses` section, it displays the detailed status of each cluster. Both selected clusters are in the `Scheduled` state, but the `RolloutStarted` condition is still `Unknown` because the rollout has not kicked off yet. -## Understand ClusterStagedUpdateRun status - -Let's take a deep look into the status of a completed `ClusterStagedUpdateRun`. It displays details about the rollout status for every clusters and stages. - -```bash -$ kubectl describe crsur run example-run -... -Status: - Conditions: - Last Transition Time: 2025-03-12T23:21:39Z - Message: ClusterStagedUpdateRun initialized successfully - Observed Generation: 1 - Reason: UpdateRunInitializedSuccessfully - Status: True - Type: Initialized - Last Transition Time: 2025-03-12T23:21:39Z - Message: - Observed Generation: 1 - Reason: UpdateRunStarted - Status: True - Type: Progressing - Last Transition Time: 2025-03-12T23:26:15Z - Message: - Observed Generation: 1 - Reason: UpdateRunSucceeded - Status: True - Type: Succeeded - Deletion Stage Status: - Clusters: - Conditions: - Last Transition Time: 2025-03-12T23:26:15Z - Message: - Observed Generation: 1 - Reason: StageUpdatingStarted - Status: True - Type: Progressing - Last Transition Time: 2025-03-12T23:26:15Z - Message: - Observed Generation: 1 - Reason: StageUpdatingSucceeded - Status: True - Type: Succeeded - End Time: 2025-03-12T23:26:15Z - Stage Name: kubernetes-fleet.io/deleteStage - Start Time: 2025-03-12T23:26:15Z - Policy Observed Cluster Count: 2 - Policy Snapshot Index Used: 0 - Staged Update Strategy Snapshot: - Stages: - After Stage Tasks: - Type: Approval - Wait Time: 0s - Type: TimedWait - Wait Time: 1m0s - Label Selector: - Match Labels: - Environment: staging - Name: staging - After Stage Tasks: - Type: Approval - Wait Time: 0s - Label Selector: - Match Labels: - Environment: canary - Name: canary - Sorting Label Key: name - After Stage Tasks: - Type: TimedWait - Wait Time: 1m0s - Type: Approval - Wait Time: 0s - Label Selector: - Match Labels: - Environment: production - Name: production - Sorting Label Key: order - Stages Status: - After Stage Task Status: - Approval Request Name: example-run-staging - Conditions: - Last Transition Time: 2025-03-12T23:21:54Z - Message: - Observed Generation: 1 - Reason: AfterStageTaskApprovalRequestCreated - Status: True - Type: ApprovalRequestCreated - Last Transition Time: 2025-03-12T23:22:55Z - Message: - Observed Generation: 1 - Reason: AfterStageTaskApprovalRequestApproved - Status: True - Type: ApprovalRequestApproved - Type: Approval - Conditions: - Last Transition Time: 2025-03-12T23:22:54Z - Message: - Observed Generation: 1 - Reason: AfterStageTaskWaitTimeElapsed - Status: True - Type: WaitTimeElapsed - Type: TimedWait - Clusters: - Cluster Name: member1 - Conditions: - Last Transition Time: 2025-03-12T23:21:39Z - Message: - Observed Generation: 1 - Reason: ClusterUpdatingStarted - Status: True - Type: Started - Last Transition Time: 2025-03-12T23:21:54Z - Message: - Observed Generation: 1 - Reason: ClusterUpdatingSucceeded - Status: True - Type: Succeeded - Conditions: - Last Transition Time: 2025-03-12T23:21:54Z - Message: - Observed Generation: 1 - Reason: StageUpdatingWaiting - Status: False - Type: Progressing - Last Transition Time: 2025-03-12T23:22:55Z - Message: - Observed Generation: 1 - Reason: StageUpdatingSucceeded - Status: True - Type: Succeeded - End Time: 2025-03-12T23:22:55Z - Stage Name: staging - Start Time: 2025-03-12T23:21:39Z - After Stage Task Status: - Approval Request Name: example-run-canary - Conditions: - Last Transition Time: 2025-03-12T23:23:10Z - Message: - Observed Generation: 1 - Reason: AfterStageTaskApprovalRequestCreated - Status: True - Type: ApprovalRequestCreated - Last Transition Time: 2025-03-12T23:25:15Z - Message: - Observed Generation: 1 - Reason: AfterStageTaskApprovalRequestApproved - Status: True - Type: ApprovalRequestApproved - Type: Approval - Clusters: - Cluster Name: member2 - Conditions: - Last Transition Time: 2025-03-12T23:22:55Z - Message: - Observed Generation: 1 - Reason: ClusterUpdatingStarted - Status: True - Type: Started - Last Transition Time: 2025-03-12T23:23:10Z - Message: - Observed Generation: 1 - Reason: ClusterUpdatingSucceeded - Status: True - Type: Succeeded - Conditions: - Last Transition Time: 2025-03-12T23:23:10Z - Message: - Observed Generation: 1 - Reason: StageUpdatingWaiting - Status: False - Type: Progressing - Last Transition Time: 2025-03-12T23:25:15Z - Message: - Observed Generation: 1 - Reason: StageUpdatingSucceeded - Status: True - Type: Succeeded - End Time: 2025-03-12T23:25:15Z - Stage Name: canary - Start Time: 2025-03-12T23:22:55Z - After Stage Task Status: - Conditions: - Last Transition Time: 2025-03-12T23:26:15Z - Message: - Observed Generation: 1 - Reason: AfterStageTaskWaitTimeElapsed - Status: True - Type: WaitTimeElapsed - Type: TimedWait - Approval Request Name: example-run-production - Conditions: - Last Transition Time: 2025-03-12T23:25:15Z - Message: - Observed Generation: 1 - Reason: AfterStageTaskApprovalRequestCreated - Status: True - Type: ApprovalRequestCreated - Last Transition Time: 2025-03-12T23:25:25Z - Message: - Observed Generation: 1 - Reason: AfterStageTaskApprovalRequestApproved - Status: True - Type: ApprovalRequestApproved - Type: Approval - Clusters: - Conditions: - Last Transition Time: 2025-03-12T23:25:15Z - Message: - Observed Generation: 1 - Reason: StageUpdatingWaiting - Status: False - Type: Progressing - Last Transition Time: 2025-03-12T23:26:15Z - Message: - Observed Generation: 1 - Reason: StageUpdatingSucceeded - Status: True - Type: Succeeded - End Time: 2025-03-12T23:26:15Z - Stage Name: production -Events: -``` - -### UpdateRun overall status - -At the very top, `Status.Conditions` gives the overall status of the updateRun. The execution an update run consists of two phases: initialization and execution. -During initialization, the controller performs a one-time setup where it captures a snapshot of the updateRun strategy, collects scheduled and to-be-deleted `ClusterResourceBindings`, -generates the cluster update sequence, and records all this information in the updateRun status. -The `UpdateRunInitializedSuccessfully` condition indicates the initialization is successful. - -After initialization, the controller starts executing the updateRun. The `UpdateRunStarted` condition indicates the execution has started. - -After all clusters are updated, all after-stage tasks are completed, and thus all stages are finished, the `UpdateRunSucceeded` condition is set to `True`, indicating the updateRun has succeeded. - -### Fields recorded in the updateRun status during initialization - -During initialization, the controller records the following fields in the updateRun status: -- `PolicySnapshotIndexUsed`: the index of the policy snapshot used for the updateRun, it should be the latest one. -- `PolicyObservedClusterCount`: the number of clusters selected by the scheduling policy. -- `StagedUpdateStrategySnapshot`: the snapshot of the updateRun strategy, which ensures any strategy changes will not affect executing updateRuns. - -### Stages and clusters status - -The `Stages Status` section displays the status of each stage and cluster. As shown in the strategy snapshot, the updateRun has three stages: `staging`, `canary`, and `production`. During initialization, the controller generates the rollout plan, classifies the scheduled clusters -into these three stages and dumps the plan into the updateRun status. As the execution progresses, the controller updates the status of each stage and cluster. Take the `staging` stage as an example, `member1` is included in this stage. `ClusterUpdatingStarted` condition indicates the cluster is being updated and `ClusterUpdatingSucceeded` condition shows the cluster is updated successfully. - -After all clusters are updated in a stage, the controller executes the specified after-stage tasks. Stage `staging` has two after-stage tasks: `Approval` and `TimedWait`. The `Approval` task requires the admin to manually approve a `ClusterApprovalRequest` generated by the controller. The name of the `ClusterApprovalRequest` is also included in the status, which is `example-run-staging`. `AfterStageTaskApprovalRequestCreated` condition indicates the approval request is created and `AfterStageTaskApprovalRequestApproved` condition indicates the approval request has been approved. The `TimedWait` task enforces a suspension of the rollout until the specified wait time has elapsed and in this case, the wait time is 1 minute. `AfterStageTaskWaitTimeElapsed` condition indicates the wait time has elapsed and the rollout can proceed to the next stage. - -Each stage also has its own conditions. When a stage starts, the `Progressing` condition is set to `True`. When all the cluster updates complete, the `Progressing` condition is set to `False` with reason `StageUpdatingWaiting` as shown above. It means the stage is waiting for -after-stage tasks to pass. -And thus the `lastTransitionTime` of the `Progressing` condition also serves as the start time of the wait in case there's a `TimedWait` task. When all after-stage tasks pass, the `Succeeded` condition is set to `True`. Each stage status also has `Start Time` and `End Time` fields, making it easier to read. - -There's also a `Deletion Stage Status` section, which displays the status of the deletion stage. The deletion stage is the last stage of the updateRun. It deletes resources from the unscheduled clusters. The status is pretty much the same as a normal update stage, except that there are no after-stage tasks. - -Note that all these conditions have `lastTransitionTime` set to the time when the controller updates the status. It can help debug and check -the progress of the updateRun. - ## Investigate ClusterStagedUpdateRun initialization failure An updateRun initialization failure can be easily detected by getting the resource: From 7bee7c87e981ab7a6c83b1b5ed4f80ad606731ef Mon Sep 17 00:00:00 2001 From: Wantong Jiang Date: Tue, 18 Mar 2025 18:08:41 +0000 Subject: [PATCH 5/5] fix mac cmd issue --- docs/concepts/StagedUpdateRun/README.md | 2 +- docs/howtos/updaterun.md | 22 +++++++++++++++++++++- 2 files changed, 22 insertions(+), 2 deletions(-) diff --git a/docs/concepts/StagedUpdateRun/README.md b/docs/concepts/StagedUpdateRun/README.md index 391918a22..bebe17353 100644 --- a/docs/concepts/StagedUpdateRun/README.md +++ b/docs/concepts/StagedUpdateRun/README.md @@ -97,7 +97,7 @@ spec: The user then need to manually approve the task by patching its status: ```bash -kubectl patch clusterapprovalrequests example-run-canary --type='merge' -p '{"status":{"conditions":[{"type":"Approved","status":"True","reason":"lgtm","message":"lgtm","lastTransitionTime":"'$(date --utc +%Y-%m-%dT%H:%M:%SZ)'","observedGeneration":1}]}}' --subresource=status +kubectl patch clusterapprovalrequests example-run-canary --type='merge' -p '{"status":{"conditions":[{"type":"Approved","status":"True","reason":"lgtm","message":"lgtm","lastTransitionTime":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","observedGeneration":1}]}}' --subresource=status ``` The updateRun will only continue to next stage after the `ClusterApprovalRequest` is approved. diff --git a/docs/howtos/updaterun.md b/docs/howtos/updaterun.md index 5b5ee6bf8..6407d8ef5 100644 --- a/docs/howtos/updaterun.md +++ b/docs/howtos/updaterun.md @@ -299,9 +299,29 @@ example-run-canary example-run canary 2m2s ``` We can approve the `ClusterApprovalRequest` by patching its status: ```bash -kubectl patch clusterapprovalrequests example-run-canary --type=merge -p {"status":{"conditions":[{"type":"Approved","status":"True","reason":"lgtm","message":"lgtm","lastTransitionTime":"'$(date --utc +%Y-%m-%dT%H:%M:%SZ)'","observedGeneration":1}]}} --subresource=status +kubectl patch clusterapprovalrequests example-run-canary --type=merge -p {"status":{"conditions":[{"type":"Approved","status":"True","reason":"lgtm","message":"lgtm","lastTransitionTime":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","observedGeneration":1}]}} --subresource=status clusterapprovalrequest.placement.kubernetes-fleet.io/example-run-canary patched ``` +This can be done equivalently by creating a json patch file and applying it: +```bash +cat << EOF > approval.json +{ + "status": { + "conditions": [ + { + "lastTransitionTime": "$(date -u +%Y-%m-%dT%H:%M:%SZ)", + "message": "lgtm", + "observedGeneration": 1, + "reason": "lgtm", + "status": "True", + "type": "Approved" + } + ] +} +EOF +kubectl patch clusterapprovalrequests example-run-canary --type='merge' --subresource=status --patch-file approval.json +``` + Then verify it's approved: ```bash kubectl get clusterapprovalrequest