Skip to content

feature: zalando.org/forward-backend annotation support to enable migration to eks#716

Merged
mikkeloscar merged 25 commits into
masterfrom
feature/annotation-migration-support
Dec 15, 2025
Merged

feature: zalando.org/forward-backend annotation support to enable migration to eks#716
mikkeloscar merged 25 commits into
masterfrom
feature/annotation-migration-support

Conversation

@szuecs

@szuecs szuecs commented Oct 15, 2025

Copy link
Copy Markdown
Member

feature: zalando.org/forward-backend annotation support to enable migration to eks

Usage:

apiVersion: zalando.org/v1
kind: StackSet
metadata:
  annotations:
    zalando.org/forward-backend: eks migration # the value does not matter
..

This will execute the migration preparation, such that the next traffic switch will send the traffic to the forward backend.

@szuecs

szuecs commented Nov 6, 2025

Copy link
Copy Markdown
Member Author

changes should be moved to the GenerateRouteGroup/GenerateDeployment etc. functions to fit into the current model

…ration to eks

Usage:

```
apiVersion: zalando.org/v1
kind: StackSet
metadata:
  annotations:
    zalando.org/forward-backend: eks migration
..
```

This will execute the migration preparation, such that the next traffic switch will send the traffic to the forward backend.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
…gration, too

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
feature: reduce resource waste by setting replicas to 1 for the new unused stack, that will not receive any traffic

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
… the future why we scale down to 1

fix: ExternalIngress has also to get the anntoation to be set, such that it can execute a migration as expected

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@szuecs szuecs force-pushed the feature/annotation-migration-support branch from a11ccdc to ae8b458 Compare November 6, 2025 18:55
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Comment thread pkg/core/stack_resources.go Outdated
// GenerateHPA generates a hpa as configured in the
// stack. On cluster migrations set by stackset annotation
// "zalando.org/forward-backend", the hpa will be set to
// minReplicas = maxReplicass = 1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just skip generating an HPA in this case, this is similar to if sc.ScaleDown() case below.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fear about removing objects we don't need is that we create alerts for the users and they overreact.
It's also not checking for traffic as we do in ScaleDown()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We anyway have to document how this feature works and what to expect.

They could also "overreact" because of minReplicas/maxReplicas = 1 :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think now?
I dropped HPA and deployment to not waste money.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@szuecs szuecs added the major label Nov 7, 2025
@szuecs

szuecs commented Nov 7, 2025

Copy link
Copy Markdown
Member Author

not sure if this is major or minor, I tend to minor but the number of files of the change feels a bit more than minor.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@szuecs

szuecs commented Nov 17, 2025

Copy link
Copy Markdown
Member Author

The current PR crashes with recovered panics at runtime.
I think stackset-controller wants to have a deployment.

time="2025-11-17T20:27:34Z" level=info msg="Event(v1.ObjectReference{Kind:\"StackSet\", Namespace:\"default\", Name:\"test-migration-app\", UID:\"11a0c501-1333-4c7a-bb8a-3b58eb68156e\", APIVersion:\"zalando.org/v1\", ResourceVersion:\"2692037662\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedManageStackSet' panic: runtime error: invalid memory address or nil pointer dereference"
time="2025-11-17T20:27:43Z" level=error msg="Encountered a panic while processing a stackset: runtime error: invalid memory address or nil pointer dereference\ngoroutine 153 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:26 +0x5e\ngithub.com/zalando-incubator/stackset-controller/controller.(*StackSetController).ReconcileStackSet.func1()\n\t/workspace/controller/stackset.go:1089 +0x1e5\npanic({0x1c556c0?, 0x32f91b0?})\n\t/usr/local/go/src/runtime/panic.go:783 +0x132\ngithub.com/zalando-incubator/stackset-controller/controller.(*StackSetController).ReconcileStackDeployment(0xc000000540, {0x2237130, 0xc000109b30}, 0xc000245808, 0x0, 0x0?)\n\t/workspace/controller/stack_resources.go:49 +0x3c1\ngithub.com/zalando-incubator/stackset-controller/controller.(*StackSetController).ReconcileStackResources(0xc000000540, {0x2237130, 0xc000109b30}, 0xc0004aee80?, 0xc0000c7ba0)\n\t/workspace/controller/stackset.go:1065 +0x265\ngithub.com/zalando-incubator/stackset-controller/controller.(*StackSetController).ReconcileStackSet(0xc000000540, {0x2237130, 0xc000109b30}, 0xc0004aee80)\n\t/workspace/controller/stackset.go:1155 +0xb29\ngithub.com/zalando-incubator/stackset-controller/controller.(*StackSetController).Run.func2()\n\t/workspace/controller/stackset.go:188 +0xa5\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\t/go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:93 +0x50\ncreated by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1\n\t/go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:78 +0x95\n" controller=stackset namespace=default stackset=test-migration-app
time="2025-11-17T20:27:43Z" level=error msg="unable to reconcile a stackset: panic: runtime error: invalid memory address or nil pointer dereference" controller=stackset namespace=default stackset=test-migration-app
time="2025-11-17T20:27:43Z" level=error msg="Failed waiting for reconcilers: panic: runtime error: invalid memory address or nil pointer dereference" controller=stackset
time="2025-11-17T20:27:43Z" level=info msg="Event(v1.ObjectReference{Kind:\"StackSet\", Namespace:\"default\", Name:\"test-migration-app\", UID:\"11a0c501-1333-4c7a-bb8a-3b58eb68156e\", APIVersion:\"zalando.org/v1\", ResourceVersion:\"2692037662\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedManageStackSet' panic: runtime error: invalid memory address or nil pointer dereference"

This was fixed by the commit after the comment

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@zalando-robot

Copy link
Copy Markdown

Deployment Checklist

This change falls under the deployment policy.

💁 Since Nov 10th, we are in the RED deployment zone. This means all changes released to production must adhere to the following requirements:

  • Detailed release notes are provided in this PR’s description.
  • Thorough load-testing has been performed, and is documented in the description/comment.
  • You can enable/disable the change via feature toggles, and have confirmed these toggles work as expected.
  • Technical review: A Principal Engineer, Engineering Manager or Head of Engineering have green-lit your changes, and the reviewer is named in the description/comments.
  • Application Owner (Director+) approval is given about the PR, and the approver is named in the description/comments.

👉 Regardless of which boxes you click in this comment, merge/deployment will not be blocked.
Reports about deployment policy adherence will be circulated daily.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
…cluster migration patched routegroups

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@szuecs szuecs requested review from linki and mikkeloscar November 20, 2025 17:17
@szuecs

szuecs commented Nov 20, 2025

Copy link
Copy Markdown
Member Author

I am done with this PR form my side. It was tested in pet cluster with zkubectl traffic.

@linki

linki commented Dec 9, 2025

Copy link
Copy Markdown
Member

👍

Comment thread pkg/core/stackset.go
@tcondeixa

Copy link
Copy Markdown

👍

Comment thread pkg/core/stackset.go Outdated
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
Comment thread pkg/apis/zalando.org/v1/types.go Outdated
// +k8s:deepcopy-gen=true
type StackSetExternalIngressSpec struct {
BackendPort intstr.IntOrString `json:"backendPort"`
EmbeddedObjectMetaWithAnnotations `json:"metadata,omitempty"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was not used in the end because the annotation is set at stackset level and doesn't need to be at externalIngress level.

So we can drop this and re-generate to not change the CRD.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I will drop it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: fabricgateway will get its own annotation similar to the stackset annotation

@mikkeloscar

Copy link
Copy Markdown
Contributor

Can we add an e2e test which shows that when the annotation is set it doesn't create stack resources (like deployment etc.) on the stacks which are created after the annotation is set?

@szuecs

szuecs commented Dec 15, 2025

Copy link
Copy Markdown
Member Author

Can we add an e2e test which shows that when the annotation is set it doesn't create stack resources (like deployment etc.) on the stacks which are created after the annotation is set?

we create a deployment, because everything relies on having a deployment.
I think e2e test should cover a proper migration call path, but that's not really easily possible with our current infrastructure for e2e tests, so I would say e2e is out of scope, because e2e case should cover the 80% value case and not all edge cases.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@szuecs

szuecs commented Dec 15, 2025

Copy link
Copy Markdown
Member Author

👍

@mikkeloscar

Copy link
Copy Markdown
Contributor

Can we add an e2e test which shows that when the annotation is set it doesn't create stack resources (like deployment etc.) on the stacks which are created after the annotation is set?

we create a deployment, because everything relies on having a deployment. I think e2e test should cover a proper migration call path, but that's not really easily possible with our current infrastructure for e2e tests, so I would say e2e is out of scope, because e2e case should cover the 80% value case and not all edge cases.

I think it's possible without too much effort, the existing e2e framework provides most of it.

But let's iterate on this in a separate PR, we likely need to iterate a bit on this to make it all work with CDP etc.

@mikkeloscar

Copy link
Copy Markdown
Contributor

👍

@mikkeloscar mikkeloscar merged commit 55467ba into master Dec 15, 2025
8 checks passed
@mikkeloscar mikkeloscar deleted the feature/annotation-migration-support branch December 15, 2025 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants