SEP75 Automatic Workload Migration#850
Conversation
6f842cc to
1f95a6b
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #850 +/- ##
==========================================
+ Coverage 76.70% 77.48% +0.78%
==========================================
Files 44 44
Lines 2640 2834 +194
==========================================
+ Hits 2025 2196 +171
Misses 529 529
- Partials 86 109 +23 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
1f95a6b to
7a66e00
Compare
| // The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion | ||
| // will not trigger workload migration. | ||
| // If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream. | ||
| MaximumVersion *string `json:maximumVersion,omitempty"` |
There was a problem hiding this comment.
Q: Does it means that the maximum version set could be a version that is number of versions ahead of the current version running and not just the next one?
But from other side is to set a specific version I would not want to get over?
There was a problem hiding this comment.
What would be an usage of this? To avoid automatic workload migration when upgrading to new minor version?
There was a problem hiding this comment.
Yes- basically by default it will only update patch versions, unless you set a higher MaximumVersion (now that I'm reading it again, maybe MaxVersion is better)
There was a problem hiding this comment.
I added a bunch of explanations for this field (and also renamed it maxVersion)
| // The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion | ||
| // will not trigger workload migration. | ||
| // If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream. | ||
| MaximumVersion *string `json:maximumVersion,omitempty"` |
There was a problem hiding this comment.
What would be an usage of this? To avoid automatic workload migration when upgrading to new minor version?
|
Update the doc to now use an enum ( |
7a66e00 to
1e6c512
Compare
| // The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion | ||
| // will not trigger workload migration. | ||
| // If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream. | ||
| MaximumVersion *string `json:"maximumVersion,omitempty"` |
There was a problem hiding this comment.
If you want the operator to upgrade your workloads when going from say 1.25 --> 1.26, you'd need to set spec.version = 1.26 and spec.updateStrategy.maximumVersion = 1.26? For each minor upgrade you'd need to keep setting spec.updateStrategy.maximumVersion?
There was a problem hiding this comment.
Yes, effectively. Which is a good thing I believe, as minor version upgrades could break your config. Or you set it to something like 1.999.0 to always upgrade
There was a problem hiding this comment.
I actually did change this as I think it was confusing. Now, if you don't set maxVersion, there simply is no maximum, which I think is more what you would expect as a user.
1e6c512 to
266f40e
Compare
e1e8399 to
e029cc3
Compare
skriss
left a comment
There was a problem hiding this comment.
Had a few thoughts mainly around the idea of supporting more of a canary upgrade/migration process. You could make the argument that that could be a separate strategy (canary instead of batched) with its own configuration. From my perspective it would probably be pretty appealing to users. In general though this seems like a pretty compelling feature and great differentiator for the operator!
|
|
||
| ### Failure Policy | ||
|
|
||
| The initial implementation will continue processing all batches even if individual workloads fail to migrate, reporting all failures at the end. Some users may want different behavior when migrations fail. We could implement a `failurePolicy` field that allows users to specify what the operator should do in case a workload does not become ready. |
There was a problem hiding this comment.
The initial implementation is somewhat at odds with a "canary upgrade" where you'd want to ensure a small subset of upgraded workloads are successful before proceeding, so IMO adding something like a failurePolicy will likely be pretty important for users, to be able to have a controlled upgrade / limit blast radius of any issues.
| Potential `failurePolicy` values: | ||
| * `ContinueOnError`: Continue migrating all workloads, report failures at end | ||
| * `FailFast`: Stop immediately on first failure |
There was a problem hiding this comment.
A configurable % of workloads that are allowed to fail before stopping the migration might be nice as it would give finer-grained control over success/failure conditions (i.e. "continue migration as long as no more than 5% of workloads failed upgrade").
| **Potential annotations:** | ||
| - `sailoperator.io/readiness-timeout: 10s`: Override the readiness timeout for specific workloads | ||
| - `sailoperator.io/skip-migration: true`: Exclude specific workloads from automatic migration | ||
| - `sailoperator.io/migration-batch: 3`: Assign workload to a specific migration priority (integer, lower values migrate first) |
There was a problem hiding this comment.
Something like this would be very helpful for supporting canary-style upgrades as it would allow the user to identify the initial set of workloads (i.e. less critical) to try to migrate. Otherwise, it's possible that critical workloads could be ~randomly selected for migration first which would not be ideal.
This adds an enhancement proposal on Automatic Workload Migration, which helps users by automatically restarting their workloads when updating their mesh. This feature will only be available in sidecar mode. Signed-off-by: Daniel Grimm <dgrimm@redhat.com>
e029cc3 to
df43c23
Compare
|
|
||
| type WorkloadMigrationStatus struct { | ||
| // State represents the current state of the migration process. | ||
| // +kubebuilder:validation:Enum=Idle;InProgress;Completed;Failed |
There was a problem hiding this comment.
What would be a difference between Idle and Completed?
There was a problem hiding this comment.
maybe NotStarted is a better name
| // Name of the failed workload. | ||
| Name string `json:"name"` | ||
|
|
||
| // Kind of the failed workload (e.g., "Deployment"). |
There was a problem hiding this comment.
Is this needed if we only migrate Deployments and nothing else?
There was a problem hiding this comment.
good call. we can leave it out for now and only add it once we support other kinds
nrfox
left a comment
There was a problem hiding this comment.
I had a few more questions/comments but overall this looks good
| // - MaxVersion: unset -> migrate to any version (default behavior) | ||
| // - MaxVersion: "1.24.999" -> migrate within 1.24.x only | ||
| // - MaxVersion: "1.26.0" -> migrate up to and including 1.26.0 | ||
| // - MaxVersion: "1.999.0" -> migrate to any 1.x version, stop at 2.0 |
There was a problem hiding this comment.
It's a little awkward having to specify 1.24.999 where 999 is just an arbitrarily large number.
The behavior could be that leaving off the patch version means "migrate within 1.24.x only"? e.g.
MaxVersion: "1.24.5" -> migrate up to and including 1.24.5
MaxVersion: "1.24" -> migrate within 1.24.x only
MaxVersion: "1" -> migrate to any 1.x version, stop at 2.0
| MaxVersion *string `json:"maxVersion,omitempty"` | ||
|
|
||
| // Maximum time to wait for a deployment to become ready after restart. | ||
| // Defaults to 5m. | ||
| // +kubebuilder:default="5m" | ||
| ReadinessTimeout *metav1.Duration `json:"readinessTimeout,omitempty"` |
There was a problem hiding this comment.
Will these fields be relevant to other workload strategies as well? If we keep them here then we'll need to duplicate them on the other strategies. Should they be moved up into WorkloadMigrationConfig?
|
|
||
| type WorkloadMigrationStatus struct { | ||
| // State represents the current state of the migration process. | ||
| // +kubebuilder:validation:Enum=Idle;InProgress;Completed;Failed |
There was a problem hiding this comment.
Does Failed mean "there was an error. won't retry" or "there was an error. will retry"?
| ### Performance Impact | ||
|
|
||
| * **Discovery Overhead**: The operator lists all namespaces and deployments once per migration | ||
| * **Batch Processing**: Migration impact is controlled through configurable batch sizes and delays | ||
| * **Memory Usage**: Minimal additional memory for tracking migration state | ||
| * **Network Traffic**: Standard Kubernetes API calls for object updates |
There was a problem hiding this comment.
If the whole migration is performed in a single reconcile loop, I'm more worried about the operator being tied up with the migration for a particular Istio resource and unable to respond to updates to that Istio resource until the migration is finished which could take awhile given a large amount of workloads and needing to wait until each pod has restarted.
|
as per our discussion regarding ZTunnel upgrades, this should be extended to be flexible enough to enable restarting ambient workloads when the ZTunnel instance is updated |
|
@dgn would having an implementation of this be helpful in moving this along? |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a first-draft SEP describing “Automatic Workload Migration” for Sail Operator’s revision-based Istio upgrades, including API shape, status reporting, migration flow, and a test plan.
Changes:
- Specifies a new
workloadMigrationconfiguration underIstioUpdateStrategywith a “Batched” strategy and tunables (batching, delays, readiness timeout, max version boundary). - Proposes status reporting (
WorkloadMigrationStatus) and describes discovery/migration phases. - Adds an initial implementation plan and detailed test scenarios.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | Status | Authors | Created | | ||
| |---|---|---| | ||
| | WIP | @dgn | 2025-05-22 | |
| #### Deprecation / Removal of `updateStrategy.updateWorkloads` field | ||
|
|
||
| We accidentally included this field in our v1 CRD even though it never had any functionality. It doesn't really follow our existing API standards so we won't use it for this functionality going forward, and will remove it instead. As it was never functional to begin with, removing it should cause no harm. | ||
|
|
| The operator already has the required permission to update `Deployment` resources. | ||
|
|
| 2. **Discovery**: The operator discovers workloads using old revisions by: | ||
| - Listing all namespaces and checking their `istio.io/rev` or `istio-injection` labels | ||
| - Listing all deployments and checking their pod template labels | ||
| - Listing all pods and checking their pod annotations to detect injected revision |
| - Listing all pods and checking their pod annotations to detect injected revision | ||
| - Comparing current annotations against the active revision name |
| * **Discovery Overhead**: The operator lists all namespaces and deployments once per migration | ||
| * **Batch Processing**: Migration impact is controlled through configurable batch sizes and delays | ||
| * **Memory Usage**: Minimal additional memory for tracking migration state | ||
| * **Network Traffic**: Standard Kubernetes API calls for object updates |
| - Requires updating `maxVersion` when ready for 1.27 | ||
| - Best for staged rollouts with validation between minor versions | ||
|
|
||
| **Implementation Note:** Version comparison uses semantic versioning rules. When `maxVersion` is set, the operator compares `spec.version` against `maxVersion` using standard semver comparison (`<=`). When unset, all versions are accepted. |
| metadata: | ||
| name: default | ||
| spec: | ||
| version: v1.26.0 |
| workloadMigration: | ||
| strategy: Batched | ||
| batched: | ||
| maxVersion: "1.26.999" # Only migrate within 1.26.x, stop at 1.27 |
First draft