Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
439 changes: 438 additions & 1 deletion docs/concepts/StagedUpdateRun/README.md

Large diffs are not rendered by default.

25 changes: 22 additions & 3 deletions docs/howtos/crp.md
Original file line number Diff line number Diff line change
Expand Up @@ -371,11 +371,13 @@ Most outcomes can lead to service interruptions. Apps running on member clusters
unavailable as Fleet dispatches updated resources. Clusters that are no longer selected will lose all placed resources,
resulting in lost traffic. If too many new clusters are selected and Fleet places resources on them simultaneously,
your backend may become overloaded. The exact interruption pattern may vary depending on the resources you place using Fleet.
To minimize interruption, Fleet allows users to configure the rollout strategy. There are two types of rollout strategies we currently support.
Comment thread
jwtty marked this conversation as resolved.

To minimize interruption, Fleet allows users to configure the rollout strategy, similar to native Kubernetes deployment,
to transition between changes as smoothly as possible. Currently, Fleet supports only one rollout strategy: rolling update.
### Default rollout strategy: Rolling Update

The default strategy is rolling update, and it applies to all changes you initiate.
This strategy ensures changes, including the addition or removal of selected clusters and resource refreshes,
are applied incrementally in a phased manner at a pace suitable for you. This is the default option and applies to all changes you initiate.
are applied incrementally in a phased manner at a pace suitable for you, similar to native Kubernetes deployments.

This rollout strategy can be configured with the following parameters:

Expand Down Expand Up @@ -471,6 +473,23 @@ longer to complete the rollout, in accordance with the rolling update strategy y
> to some clusters. You can identify this behavior if CRP status; for more information, see
> [Understanding the Status of a `ClusterResourcePlacement`](crp-status.md) How-To Guide.

### External rollout strategy and staged update run

Fleet supports flexible rollout patterns through an `External` rollout strategy, which allows you to implement custom rollout controllers.
When configured, Fleet delegates the responsibility of resource placement to your external controller instead of using Fleet's built-in rolling update mechanism.

One implementation of an external rollout strategy is the **Staged Update Run**.
This approach enables a controlled, stage-by-stage placement of workload resources defined in a `ClusterResourcePlacement`.

To utilize this strategy:
1. Set `spec.strategy.type` as `External` in the `ClusterResourcePlacement` object.
2. Define your rollout process using two custom resources:
- `ClusterStagedUpdateStrategy`: A reusable template defining the rollout pattern
- `ClusterStagedUpdateRun`: The resource that triggers and manages the actual rollout process.

For comprehensive guidance on implementing staged updates, please refer to the [Staged Update Run Concepts](../concepts/StagedUpdateRun/README.md)
and [Staged Update Run How-To Guide](updaterun.md).

## Snapshots and revisions

Internally, Fleet keeps a history of all the scheduling policies you have used with a
Expand Down
24 changes: 21 additions & 3 deletions docs/howtos/updaterun.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,6 @@ status:
name: staging
- afterStageTasks:
- type: Approval
waitTime: 1h0m0s
labelSelector:
matchLabels:
environment: canary
Expand Down Expand Up @@ -300,9 +299,29 @@ example-run-canary example-run canary 2m2s
```
We can approve the `ClusterApprovalRequest` by patching its status:
```bash
kubectl patch clusterapprovalrequests example-run-canary --type=merge -p {"status":{"conditions":[{"type":"Approved","status":"True","reason":"lgtm","message":"lgtm","lastTransitionTime":"'$(date --utc +%Y-%m-%dT%H:%M:%SZ)'","observedGeneration":1}]}} --subresource=status
kubectl patch clusterapprovalrequests example-run-canary --type=merge -p {"status":{"conditions":[{"type":"Approved","status":"True","reason":"lgtm","message":"lgtm","lastTransitionTime":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","observedGeneration":1}]}} --subresource=status
clusterapprovalrequest.placement.kubernetes-fleet.io/example-run-canary patched
```
This can be done equivalently by creating a json patch file and applying it:
```bash
cat << EOF > approval.json
{
"status": {
"conditions": [
{
"lastTransitionTime": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"message": "lgtm",
"observedGeneration": 1,
"reason": "lgtm",
"status": "True",
"type": "Approved"
}
]
}
EOF
kubectl patch clusterapprovalrequests example-run-canary --type='merge' --subresource=status --patch-file approval.json
```

Then verify it's approved:
```bash
kubectl get clusterapprovalrequest
Expand Down Expand Up @@ -407,7 +426,6 @@ status:
name: staging
- afterStageTasks:
- type: Approval
waitTime: 1h0m0s
labelSelector:
matchLabels:
environment: canary
Expand Down
1 change: 1 addition & 0 deletions docs/troubleshooting/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ The complete progression of `ClusterResourcePlacement` is as follows:
- If this condition is false, refer to [How can I debug when my CRP status is ClusterResourcePlacementScheduled condition status is set to false?](./clusterResourcePlacementScheduled.md).
2. `ClusterResourcePlacementRolloutStarted`: Indicates the rollout process has begun.
- If this condition is false refer to [How can I debug when my CRP status is ClusterResourcePlacementRolloutStarted condition status is set to false?](./clusterResourcePlacementRolloutStarted.md)
- If you are triggering a rollout with a staged update run, refer to [Staged Update Run Troubleshooting Guide](./updaterun.md).
3. `ClusterResourcePlacementOverridden`: Indicates the resource has been overridden.
- If this condition is false, refer to [How can I debug when my CRP status is ClusterResourcePlacementOverridden condition status is set to false?](./clusterResourcePlacementOverridden.md)
4. `ClusterResourcePlacementWorkSynchronized`: Indicates the work objects have been synchronized.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# How can I debug when my CRP status is ClusterResourcePlacementRolloutStarted condition status is set to false?

When using the `ClusterResourcePlacement` API object in Azure Kubernetes Fleet Manager to propagate resources, the selected resources aren't rolled out in all scheduled clusters and the `ClusterResourcePlacementRolloutStarted` condition status shows as `False`.

*This TSG only applies to the `RollingUpdate` rollout strategy, which is the default strategy if you don't specify in the `ClusterResourcePlacement`.*
*To troubleshoot the update run strategy as you specify `External` in the `ClusterResourcePlacement`, please refer to the [Staged Update Run Troubleshooting Guide](updaterun.md).*

> Note: To get more information about why the rollout doesn't start, you can check the [rollout controller](https://github.com/Azure/fleet/blob/main/pkg/controllers/rollout/controller.go) to get more information on why the rollout did not start.

## Common scenarios:
Expand Down
198 changes: 198 additions & 0 deletions docs/troubleshooting/updaterun.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Staged Update Run Troubleshooting Guide

This guide provides troubleshooting steps for common issues related to Staged Update Run.

## CRP status without Staged Update Run

When a `ClusterResourcePlacement` is created with `spec.strategy.type` set to `External`, the rollout does not start immediately.

A sample status of such `ClusterResourcePlacement` is as follows:

```bash
$ kubectl describe crp example-placement
...
Status:
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: found all cluster needed as specified by the scheduling policy, found 2 cluster(s)
Observed Generation: 1
Reason: SchedulingPolicyFulfilled
Status: True
Type: ClusterResourcePlacementScheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: There are still 2 cluster(s) in the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: ClusterResourcePlacementRolloutStarted
Observed Resource Index: 0
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
Cluster Name: member2
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member2" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
Selected Resources:
...
Events: <none>
```

`SchedulingPolicyFulfilled` condition indicates the CRP has been fully scheduled, while `RolloutStartedUnknown` condition shows that the rollout has not started.

In the `Placement Statuses` section, it displays the detailed status of each cluster. Both selected clusters are in the `Scheduled` state, but the `RolloutStarted` condition is still `Unknown` because the rollout has not kicked off yet.

## Investigate ClusterStagedUpdateRun initialization failure

An updateRun initialization failure can be easily detected by getting the resource:
```bash
$ kubectl get crsur example-run
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
example-run example-placement 1 0 False 2s
```
The `INITIALIZED` field is `False`, indicating the initialization failed.

Describe the updateRun to get more details:
```bash
$ kubectl describe crsur example-run
...
Status:
Conditions:
Last Transition Time: 2025-03-13T07:28:29Z
Message: cannot continue the ClusterStagedUpdateRun: failed to initialize the clusterStagedUpdateRun: failed to process the request due to a client error: no clusterResourceSnapshots with index `1` found for clusterResourcePlacement `example-placement`
Observed Generation: 1
Reason: UpdateRunInitializedFailed
Status: False
Type: Initialized
Deletion Stage Status:
Clusters:
Stage Name: kubernetes-fleet.io/deleteStage
Policy Observed Cluster Count: 2
Policy Snapshot Index Used: 0
...
```
The condition clearly indicates the initialization failed. And the condition message gives more details about the failure.
In this case, I used a not-existing resource snapshot index `1` for the updateRun.

## Investigate ClusterStagedUpdateRun rollout stuck

A `ClusterStagedUpdateRun` can get stuck when resource placement fails on some clusters. Describing the updateRun will show some cluster is stuck in `ClusterUpdatingStarted` condition:
```bash
$ date; kubectl describe crsur example-run
Thu Mar 13 07:44:28 UTC 2025
...
Stages Status:
After Stage Task Status:
Approval Request Name: example-run-staging
Type: Approval
Type: TimedWait
Clusters:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-13T07:37:36Z
Message:
Observed Generation: 1
Reason: ClusterUpdatingStarted
Status: True
Type: Started
Conditions:
Last Transition Time: 2025-03-13T07:37:36Z
Message:
Observed Generation: 1
Reason: StageUpdatingStarted
Status: True
Type: Progressing
Stage Name: staging
Start Time: 2025-03-13T07:37:36Z
```

As you can see, the cluster updating has started about 7 minutes ago, but the `ClusterUpdatingSucceeded` condition is still not set yet.
This usually indicates something wrong happened on the cluster. To further investigate, you can check the `ClusterResourcePlacement` status:
```bash
$ kubectl describe crp example-placement
...
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-13T07:37:36Z
Message: Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 3, clusterStagedUpdateRun: example-run
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2025-03-13T07:37:36Z
Message: No override rules are configured for the selected resources
Observed Generation: 1
Reason: NoOverrideSpecified
Status: True
Type: Overridden
Last Transition Time: 2025-03-13T07:37:36Z
Message: All of the works are synchronized to the latest
Observed Generation: 1
Reason: AllWorkSynced
Status: True
Type: WorkSynchronized
Last Transition Time: 2025-03-13T07:37:39Z
Message: Work object example-placement-work has failed to apply
Observed Generation: 1
Reason: NotAllWorkHaveBeenApplied
Status: False
Type: Applied
Failed Placements:
Condition:
Last Transition Time: 2025-03-13T07:37:36Z
Message: Manifest is trackable but not available yet
Observed Generation: 1
Reason: ManifestNotAvailableYet
Status: False
Type: Available
Group: apps
Kind: Deployment
Name: nginx
Namespace: test-namespace
Version: v1
Condition:
Last Transition Time: 2025-03-13T07:37:39Z
Message: Failed to apply manifest: failed to process the request due to a client error: resource exists and is not managed by the fleet controller and co-ownernship is disallowed
Reason: ManifestsAlreadyOwnedByOthers
Status: False
Type: Applied
Group: apps
Kind: ReplicaSet
Name: nginx-7b6df7c758
Namespace: test-namespace
Version: v1
...
```

The `Applied` condition is `False` and says not all work have been applied. And in the "failed placements" section, it shows the detailed failure. For more debugging instructions, you can refer to [CRP troubleshooting guide](./README.md).

After resolving the issue, you can create always create a new updateRun to restart the rollout. Stuck updateRuns can be deleted.