fix: have all flavors delete themselves even in failure by tommartensen · Pull Request #1508 · stackrox/infra

tommartensen · 2025-03-05T12:04:11Z

Some already did that. Some already clean up if they fail during create (like OCP). Still better to be consistent. This is a patch on an issue that should be solved properly.

rhacs-bot · 2025-03-05T12:12:44Z

A single node development cluster (infra-pr-1508) was allocated in production infra for this PR.

CI will attempt to deploy quay.io/rhacs-eng/infra-server:0.10.86-1-gdaee46d474 to it.

🔌 You can connect to this cluster with:

gcloud container clusters get-credentials infra-pr-1508 --zone us-central1-a --project acs-team-temp-dev

🛠️ And pull infractl from the deployed dev infra-server with:

nohup kubectl -n infra port-forward svc/infra-server-service 8443:8443 &
make pull-infractl-from-dev-server

🚲 You can then use the dev infra instance e.g.:

bin/infractl -k -e localhost:8443 whoami

⚠️ Any clusters that you start using your dev infra instance should have a lifespan shorter then the development cluster instance. Otherwise they will not be destroyed when the dev infra instance ceases to exist when the development cluster is deleted. ⚠️

Further Development

☕ If you make changes, you can commit and push and CI will take care of updating the development cluster.

🚀 If you only modify configuration (chart/infra-server/configuration) or templates (chart/infra-server/{static,templates}), you can get a faster update with:

make helm-deploy

Logs

Logs for the development infra depending on your @redhat.com authuser:

Or:

kubectl -n infra logs -l app=infra-server --tail=1 -f

davdhacs

tldr: I don't know enough about argo cd to know if we should add onExit now. On some create failures, do the workflows exit and never run the destroy step and delete the workflow before the cleanup runs?

detail:
There may be reasons not to add this?
onExit's were removed here from infra workflows, #320
And the argocd code marks onExit as deprecated (https://github.com/argoproj/argo-workflows/blame/68fde4ffcbe9b84dfad969a45133cd4cc659d346/pkg/apis/workflow/v1alpha1/workflow_types.go#L1573).

The infra periodic cleanup code *should find clusters that failed, argocd workflow paused, and resume those workflows to run their destroy steps (added in https://github.com/stackrox/infra/pull/47/files#diff-c18c4c95a88e0fc7894470522a609f99R639 and scheduled at https://github.com/stackrox/infra/blame/master/service/cluster/cluster.go#L131).
Maybe argocd has changed in some way that this is not valid or not always working?

tommartensen · 2025-03-26T17:08:46Z

We could prevent the issues from #320 by

linting the workflows
running our smoke tests against the PR cluster, which would create a cluster with each workflow and also run the onExit handler.

The deprecation you found only refers to a specific onExit handler in https://github.com/argoproj/argo-workflows/blob/main/pkg/apis/workflow/v1alpha1/workflow_types.go#L1563-L1567, example:

spec:
  templates:
    - name: start
      steps:
        - - name: create
            template: create
            onExit: another-template

I don't think onExit was generally deprecated, because it is still quite prominent on https://argo-workflows.readthedocs.io/en/latest/walk-through/exit-handlers/ and in code https://github.com/argoproj/argo-workflows/blob/main/pkg/apis/workflow/v1alpha1/workflow_types.go#L349-L352 to be used like

spec:
  entrypoint: start
  onExit: destroy

I will check the second part of your reply (infra periodic cleanup code) later.

tommartensen · 2025-03-27T13:21:34Z

The second part of your question:

The infra periodic cleanup code *should find clusters that failed, argocd workflow paused, and resume those workflows to run their destroy steps (added in https://github.com/stackrox/infra/pull/47/files#diff-c18c4c95a88e0fc7894470522a609f99R639 and scheduled at https://github.com/stackrox/infra/blame/master/service/cluster/cluster.go#L131).
Maybe argocd has changed in some way that this is not valid or not always working?

#47 is a good find, but here is the actual behavior, as I found out with the debugger in a test cluster:

Normal lifecycle:
a. workflow skips the loop, because not READY.
a. workflow is READY, goes to expiration check, is not expired, skips the loop
a. workflow is READY and expired, loop resumes it, destroy step runs and workflow is eventually FINISHED. This is the same as when we delete a cluster, because in that case we only set the remaining lifetime to 0 and resume it.
Failed during create
a. workflows skips the loop, because not READY.
a. workflow fails eventually.
a. workflow still skips the loop, also the expiration is not checked, because it's already skipping before. but argo workflow controller knows it's a failed workflow and deletes it after .spec.ttlStrategy.secondsAfterFailure, which is 30 days. The destroy step never runs.

I tried to "Resume" a failed workflow, it doesn't do anything, ie doesn't start the destroy step.

How I observe it happening with onExit:

Example workflow

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: simulate-
spec:
  entrypoint: start
  onExit: stop
  arguments:
    parameters:
      - name: name
      - name: create-delay-seconds
        value: ""
      - name: create-outcome
        value: ""
      - name: destroy-delay-seconds
        value: ""
      - name: destroy-outcome
        value: ""

  templates:
    - name: start
      steps:
        - - name: create
            template: simulate
            arguments:
              parameters:
              - name: delay-seconds
                value: '{{ "{{" }}workflow.parameters.create-delay-seconds{{ "}}" }}'
              - name: outcome
                value: '{{ "{{" }}workflow.parameters.create-outcome{{ "}}" }}'
        - - name: wait
            template: wait

    - name: stop
      steps:
        - - name: destroy
            template: simulate
            arguments:
              parameters:
              - name: delay-seconds
                value: '{{ "{{" }}workflow.parameters.destroy-delay-seconds{{ "}}" }}'
              - name: outcome
                value: '{{ "{{" }}workflow.parameters.destroy-outcome{{ "}}" }}'

    - name: simulate
      inputs:
        parameters:
        - name: delay-seconds
        - name: outcome
      script:
        image: debian:9.4
        command: [bash]
        source: |
          set -x
          start=0
          while sleep 1; do
            if [[ $((start++)) -ge {{ "{{" }}inputs.parameters.delay-seconds{{ "}}" }} ]]; then
              break
            fi
          done
          [[ "{{ "{{" }}inputs.parameters.outcome{{ "}}" }}" == "success" ]] || exit 1

    - name: wait
      suspend: {}

Normal lifecycle:
a. workflow starts and server skips the loop, because not READY.
a. workflow is READY and in wait stage where it is suspended, goes to expiration check, is not expired, skips the loop
a. workflow is READY and expired, loop resumes it, runs the onExit handler: the destroy step runs and workflow is eventually FINISHED. This is the same as when we delete a cluster, because in that case we only set the remaining lifetime to 0 and resume it.
Failed during create
a. workflows skips the loop, because not READY.
a. workflow fails eventually. The onExit handler runs the destroy step and the workflow is still marked as FAILED and logs available.
a. workflow still skips the loop, also the expiration is not checked, because it's already skipping before. Argo workflow controller knows it's a failed workflow and deletes it after .spec.ttlStrategy.secondsAfterFailure, which is 30 days. The destroy step has already run as soon as the workflow failed.

tommartensen · 2025-03-27T13:24:59Z

IMO we should take another attempt on this by

updating test-simulate workflow
enhancing the e2e test to
a. ensure that destroy step runs
i. when a creation fails
i. part of the usual lifecycle.
a. cleanupExpiredCluster actually does what it's supposed to (resume workflows that are expired).
Later, update all other workflows.

I think this is the cleanest solution to prevent leakages and we should pursue it, because the DELETE runs in the same context (environment and time) as the failed CREATE and has all data from previous steps (e.g. the dotenv files still available.

tommartensen · 2025-03-27T15:08:11Z

One more thought: why don't we see more leaked clusters?
A: We see them frequently! Apart from clusters created in infra PR clusters, which delete before all children are dead, in my experience the most common reason is that the creation failed and there is no cleanup running in the creation step itself, like OCP 4 does. There, we even have a toggle to disable the cleanup (keep failed cluster) for investigation.

tommartensen · 2025-04-24T09:38:09Z

Superseded by #1543 .

fix: have all flavors delete themselves even in failure

daee46d

tommartensen self-assigned this Mar 5, 2025

tommartensen marked this pull request as ready for review March 5, 2025 12:04

tommartensen requested a review from a team as a code owner March 5, 2025 12:04

davdhacs requested changes Mar 6, 2025

View reviewed changes

tommartensen marked this pull request as draft March 27, 2025 15:11

tommartensen closed this Apr 24, 2025

tommartensen deleted the tm/hotfix-rosa-hcp-deletion branch April 24, 2025 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: have all flavors delete themselves even in failure#1508

fix: have all flavors delete themselves even in failure#1508
tommartensen wants to merge 1 commit into
masterfrom
tm/hotfix-rosa-hcp-deletion

tommartensen commented Mar 5, 2025

Uh oh!

rhacs-bot commented Mar 5, 2025

Uh oh!

davdhacs left a comment •

edited

Loading

Uh oh!

tommartensen commented Mar 26, 2025 •

edited

Loading

Uh oh!

tommartensen commented Mar 27, 2025

Uh oh!

tommartensen commented Mar 27, 2025 •

edited

Loading

Uh oh!

tommartensen commented Mar 27, 2025

Uh oh!

tommartensen commented Apr 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tommartensen commented Mar 5, 2025

Uh oh!

rhacs-bot commented Mar 5, 2025

Further Development

Logs

Uh oh!

davdhacs left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tommartensen commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tommartensen commented Mar 27, 2025

How I observe it happening with onExit:

Example workflow

Uh oh!

tommartensen commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tommartensen commented Mar 27, 2025

Uh oh!

tommartensen commented Apr 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davdhacs left a comment •

edited

Loading

tommartensen commented Mar 26, 2025 •

edited

Loading

tommartensen commented Mar 27, 2025 •

edited

Loading