Skip to content

SEP75 Automatic Workload Migration#850

Open
dgn wants to merge 1 commit into
istio-ecosystem:mainfrom
dgn:sep-workload-migration
Open

SEP75 Automatic Workload Migration#850
dgn wants to merge 1 commit into
istio-ecosystem:mainfrom
dgn:sep-workload-migration

Conversation

@dgn
Copy link
Copy Markdown
Collaborator

@dgn dgn commented May 26, 2025

First draft

@dgn dgn requested a review from a team as a code owner May 26, 2025 13:11
@dgn dgn force-pushed the sep-workload-migration branch from 6f842cc to 1f95a6b Compare May 26, 2025 13:13
@dgn dgn changed the title SEP74 Automatic Workload Migration SEP75 Automatic Workload Migration May 26, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.48%. Comparing base (eb6389f) to head (df43c23).
⚠️ Report is 196 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #850      +/-   ##
==========================================
+ Coverage   76.70%   77.48%   +0.78%     
==========================================
  Files          44       44              
  Lines        2640     2834     +194     
==========================================
+ Hits         2025     2196     +171     
  Misses        529      529              
- Partials       86      109      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dgn dgn force-pushed the sep-workload-migration branch from 1f95a6b to 7a66e00 Compare May 26, 2025 13:25
Comment thread enhancements/SEP75-automatic-workload-migration.md
// The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion
// will not trigger workload migration.
// If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream.
MaximumVersion *string `json:maximumVersion,omitempty"`
Copy link
Copy Markdown
Collaborator

@MaxBab MaxBab May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Does it means that the maximum version set could be a version that is number of versions ahead of the current version running and not just the next one?
But from other side is to set a specific version I would not want to get over?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be an usage of this? To avoid automatic workload migration when upgrading to new minor version?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes- basically by default it will only update patch versions, unless you set a higher MaximumVersion (now that I'm reading it again, maybe MaxVersion is better)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a bunch of explanations for this field (and also renamed it maxVersion)

Comment thread enhancements/SEP75-automatic-workload-migration.md Outdated
Comment thread enhancements/SEP75-automatic-workload-migration.md
// The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion
// will not trigger workload migration.
// If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream.
MaximumVersion *string `json:maximumVersion,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be an usage of this? To avoid automatic workload migration when upgrading to new minor version?

Comment thread enhancements/SEP75-automatic-workload-migration.md
Comment thread enhancements/SEP75-automatic-workload-migration.md
@dgn
Copy link
Copy Markdown
Collaborator Author

dgn commented May 27, 2025

Update the doc to now use an enum (strategy) instead of a bool (enabled) for enablement. Not sure I like the stuttering (updateStrategy.workloadMigration.strategy is not great) but I couldn't come up with a better term right now

@dgn dgn force-pushed the sep-workload-migration branch from 7a66e00 to 1e6c512 Compare May 28, 2025 08:41
Comment thread enhancements/SEP75-automatic-workload-migration.md
Comment thread enhancements/SEP75-automatic-workload-migration.md
Comment thread enhancements/SEP75-automatic-workload-migration.md
Comment on lines +66 to +69
// The highest version that the Sail Operator will update to. Updates to versions later than MaximumVersion
// will not trigger workload migration.
// If unset, the operator will only trigger workload migration for new patch versions on the current minor version stream.
MaximumVersion *string `json:"maximumVersion,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want the operator to upgrade your workloads when going from say 1.25 --> 1.26, you'd need to set spec.version = 1.26 and spec.updateStrategy.maximumVersion = 1.26? For each minor upgrade you'd need to keep setting spec.updateStrategy.maximumVersion?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, effectively. Which is a good thing I believe, as minor version upgrades could break your config. Or you set it to something like 1.999.0 to always upgrade

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did change this as I think it was confusing. Now, if you don't set maxVersion, there simply is no maximum, which I think is more what you would expect as a user.

Comment thread enhancements/SEP75-automatic-workload-migration.md
Comment thread enhancements/SEP75-automatic-workload-migration.md
Comment thread enhancements/SEP75-automatic-workload-migration.md
Comment thread enhancements/SEP75-automatic-workload-migration.md
Comment thread enhancements/SEP75-automatic-workload-migration.md
@dgn dgn force-pushed the sep-workload-migration branch from 1e6c512 to 266f40e Compare October 22, 2025 11:58
@dgn dgn force-pushed the sep-workload-migration branch 4 times, most recently from e1e8399 to e029cc3 Compare October 22, 2025 12:12
Copy link
Copy Markdown
Collaborator

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few thoughts mainly around the idea of supporting more of a canary upgrade/migration process. You could make the argument that that could be a separate strategy (canary instead of batched) with its own configuration. From my perspective it would probably be pretty appealing to users. In general though this seems like a pretty compelling feature and great differentiator for the operator!


### Failure Policy

The initial implementation will continue processing all batches even if individual workloads fail to migrate, reporting all failures at the end. Some users may want different behavior when migrations fail. We could implement a `failurePolicy` field that allows users to specify what the operator should do in case a workload does not become ready.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial implementation is somewhat at odds with a "canary upgrade" where you'd want to ensure a small subset of upgraded workloads are successful before proceeding, so IMO adding something like a failurePolicy will likely be pretty important for users, to be able to have a controlled upgrade / limit blast radius of any issues.

Comment on lines +591 to +593
Potential `failurePolicy` values:
* `ContinueOnError`: Continue migrating all workloads, report failures at end
* `FailFast`: Stop immediately on first failure
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A configurable % of workloads that are allowed to fail before stopping the migration might be nice as it would give finer-grained control over success/failure conditions (i.e. "continue migration as long as no more than 5% of workloads failed upgrade").

**Potential annotations:**
- `sailoperator.io/readiness-timeout: 10s`: Override the readiness timeout for specific workloads
- `sailoperator.io/skip-migration: true`: Exclude specific workloads from automatic migration
- `sailoperator.io/migration-batch: 3`: Assign workload to a specific migration priority (integer, lower values migrate first)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this would be very helpful for supporting canary-style upgrades as it would allow the user to identify the initial set of workloads (i.e. less critical) to try to migrate. Otherwise, it's possible that critical workloads could be ~randomly selected for migration first which would not be ideal.

This adds an enhancement proposal on Automatic Workload Migration, which
helps users by automatically restarting their workloads when updating
their mesh. This feature will only be available in sidecar mode.

Signed-off-by: Daniel Grimm <dgrimm@redhat.com>
@dgn dgn force-pushed the sep-workload-migration branch from e029cc3 to df43c23 Compare October 23, 2025 09:26

type WorkloadMigrationStatus struct {
// State represents the current state of the migration process.
// +kubebuilder:validation:Enum=Idle;InProgress;Completed;Failed
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be a difference between Idle and Completed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe NotStarted is a better name

// Name of the failed workload.
Name string `json:"name"`

// Kind of the failed workload (e.g., "Deployment").
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed if we only migrate Deployments and nothing else?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call. we can leave it out for now and only add it once we support other kinds

Copy link
Copy Markdown
Contributor

@nrfox nrfox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few more questions/comments but overall this looks good

Comment on lines +88 to +91
// - MaxVersion: unset -> migrate to any version (default behavior)
// - MaxVersion: "1.24.999" -> migrate within 1.24.x only
// - MaxVersion: "1.26.0" -> migrate up to and including 1.26.0
// - MaxVersion: "1.999.0" -> migrate to any 1.x version, stop at 2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little awkward having to specify 1.24.999 where 999 is just an arbitrarily large number.

The behavior could be that leaving off the patch version means "migrate within 1.24.x only"? e.g.

MaxVersion: "1.24.5" -> migrate up to and including 1.24.5
MaxVersion: "1.24" -> migrate within 1.24.x only
MaxVersion: "1" -> migrate to any 1.x version, stop at 2.0

Comment on lines +94 to +99
MaxVersion *string `json:"maxVersion,omitempty"`

// Maximum time to wait for a deployment to become ready after restart.
// Defaults to 5m.
// +kubebuilder:default="5m"
ReadinessTimeout *metav1.Duration `json:"readinessTimeout,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these fields be relevant to other workload strategies as well? If we keep them here then we'll need to duplicate them on the other strategies. Should they be moved up into WorkloadMigrationConfig?


type WorkloadMigrationStatus struct {
// State represents the current state of the migration process.
// +kubebuilder:validation:Enum=Idle;InProgress;Completed;Failed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Failed mean "there was an error. won't retry" or "there was an error. will retry"?

Comment on lines +367 to +372
### Performance Impact

* **Discovery Overhead**: The operator lists all namespaces and deployments once per migration
* **Batch Processing**: Migration impact is controlled through configurable batch sizes and delays
* **Memory Usage**: Minimal additional memory for tracking migration state
* **Network Traffic**: Standard Kubernetes API calls for object updates
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the whole migration is performed in a single reconcile loop, I'm more worried about the operator being tied up with the migration for a particular Istio resource and unable to respond to updates to that Istio resource until the migration is finished which could take awhile given a large amount of workloads and needing to wait until each pod has restarted.

@dgn
Copy link
Copy Markdown
Collaborator Author

dgn commented Dec 4, 2025

as per our discussion regarding ZTunnel upgrades, this should be extended to be flexible enough to enable restarting ambient workloads when the ZTunnel instance is updated

@nrfox
Copy link
Copy Markdown
Contributor

nrfox commented Mar 19, 2026

@dgn would having an implementation of this be helpful in moving this along?

@dgn dgn requested a review from Copilot April 16, 2026 14:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a first-draft SEP describing “Automatic Workload Migration” for Sail Operator’s revision-based Istio upgrades, including API shape, status reporting, migration flow, and a test plan.

Changes:

  • Specifies a new workloadMigration configuration under IstioUpdateStrategy with a “Batched” strategy and tunables (batching, delays, readiness timeout, max version boundary).
  • Proposes status reporting (WorkloadMigrationStatus) and describes discovery/migration phases.
  • Adds an initial implementation plan and detailed test scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +3
| Status | Authors | Created |
|---|---|---|
| WIP | @dgn | 2025-05-22 |
Comment on lines +45 to +48
#### Deprecation / Removal of `updateStrategy.updateWorkloads` field

We accidentally included this field in our v1 CRD even though it never had any functionality. It doesn't really follow our existing API standards so we won't use it for this functionality going forward, and will remove it instead. As it was never functional to begin with, removing it should cause no harm.

Comment on lines +112 to +113
The operator already has the required permission to update `Deployment` resources.

Comment on lines +123 to +126
2. **Discovery**: The operator discovers workloads using old revisions by:
- Listing all namespaces and checking their `istio.io/rev` or `istio-injection` labels
- Listing all deployments and checking their pod template labels
- Listing all pods and checking their pod annotations to detect injected revision
Comment on lines +126 to +127
- Listing all pods and checking their pod annotations to detect injected revision
- Comparing current annotations against the active revision name
Comment on lines +369 to +372
* **Discovery Overhead**: The operator lists all namespaces and deployments once per migration
* **Batch Processing**: Migration impact is controlled through configurable batch sizes and delays
* **Memory Usage**: Minimal additional memory for tracking migration state
* **Network Traffic**: Standard Kubernetes API calls for object updates
- Requires updating `maxVersion` when ready for 1.27
- Best for staged rollouts with validation between minor versions

**Implementation Note:** Version comparison uses semantic versioning rules. When `maxVersion` is set, the operator compares `spec.version` against `maxVersion` using standard semver comparison (`<=`). When unset, all versions are accepted.
metadata:
name: default
spec:
version: v1.26.0
workloadMigration:
strategy: Batched
batched:
maxVersion: "1.26.999" # Only migrate within 1.26.x, stop at 1.27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants