Skip to content

fix: have all flavors delete themselves even in failure#1508

Closed
tommartensen wants to merge 1 commit into
masterfrom
tm/hotfix-rosa-hcp-deletion
Closed

fix: have all flavors delete themselves even in failure#1508
tommartensen wants to merge 1 commit into
masterfrom
tm/hotfix-rosa-hcp-deletion

Conversation

@tommartensen
Copy link
Copy Markdown
Contributor

Some already did that. Some already clean up if they fail during create (like OCP). Still better to be consistent. This is a patch on an issue that should be solved properly.

@tommartensen tommartensen self-assigned this Mar 5, 2025
@tommartensen tommartensen marked this pull request as ready for review March 5, 2025 12:04
@tommartensen tommartensen requested a review from a team as a code owner March 5, 2025 12:04
@rhacs-bot
Copy link
Copy Markdown
Contributor

A single node development cluster (infra-pr-1508) was allocated in production infra for this PR.

CI will attempt to deploy quay.io/rhacs-eng/infra-server:0.10.86-1-gdaee46d474 to it.

🔌 You can connect to this cluster with:

gcloud container clusters get-credentials infra-pr-1508 --zone us-central1-a --project acs-team-temp-dev

🛠️ And pull infractl from the deployed dev infra-server with:

nohup kubectl -n infra port-forward svc/infra-server-service 8443:8443 &
make pull-infractl-from-dev-server

🚲 You can then use the dev infra instance e.g.:

bin/infractl -k -e localhost:8443 whoami

⚠️ Any clusters that you start using your dev infra instance should have a lifespan shorter then the development cluster instance. Otherwise they will not be destroyed when the dev infra instance ceases to exist when the development cluster is deleted. ⚠️

Further Development

☕ If you make changes, you can commit and push and CI will take care of updating the development cluster.

🚀 If you only modify configuration (chart/infra-server/configuration) or templates (chart/infra-server/{static,templates}), you can get a faster update with:

make helm-deploy

Logs

Logs for the development infra depending on your @redhat.com authuser:

Or:

kubectl -n infra logs -l app=infra-server --tail=1 -f

Copy link
Copy Markdown
Contributor

@davdhacs davdhacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tldr: I don't know enough about argo cd to know if we should add onExit now. On some create failures, do the workflows exit and never run the destroy step and delete the workflow before the cleanup runs?

detail:
There may be reasons not to add this?
onExit's were removed here from infra workflows, #320
And the argocd code marks onExit as deprecated (https://github.com/argoproj/argo-workflows/blame/68fde4ffcbe9b84dfad969a45133cd4cc659d346/pkg/apis/workflow/v1alpha1/workflow_types.go#L1573).

The infra periodic cleanup code *should find clusters that failed, argocd workflow paused, and resume those workflows to run their destroy steps (added in https://github.com/stackrox/infra/pull/47/files#diff-c18c4c95a88e0fc7894470522a609f99R639 and scheduled at https://github.com/stackrox/infra/blame/master/service/cluster/cluster.go#L131).
Maybe argocd has changed in some way that this is not valid or not always working?

@tommartensen
Copy link
Copy Markdown
Contributor Author

tommartensen commented Mar 26, 2025

We could prevent the issues from #320 by

  1. linting the workflows
  2. running our smoke tests against the PR cluster, which would create a cluster with each workflow and also run the onExit handler.

The deprecation you found only refers to a specific onExit handler in https://github.com/argoproj/argo-workflows/blob/main/pkg/apis/workflow/v1alpha1/workflow_types.go#L1563-L1567, example:

spec:
  templates:
    - name: start
      steps:
        - - name: create
            template: create
            onExit: another-template

I don't think onExit was generally deprecated, because it is still quite prominent on https://argo-workflows.readthedocs.io/en/latest/walk-through/exit-handlers/ and in code https://github.com/argoproj/argo-workflows/blob/main/pkg/apis/workflow/v1alpha1/workflow_types.go#L349-L352 to be used like

spec:
  entrypoint: start
  onExit: destroy

I will check the second part of your reply (infra periodic cleanup code) later.

@tommartensen
Copy link
Copy Markdown
Contributor Author

The second part of your question:

The infra periodic cleanup code *should find clusters that failed, argocd workflow paused, and resume those workflows to run their destroy steps (added in https://github.com/stackrox/infra/pull/47/files#diff-c18c4c95a88e0fc7894470522a609f99R639 and scheduled at https://github.com/stackrox/infra/blame/master/service/cluster/cluster.go#L131).
Maybe argocd has changed in some way that this is not valid or not always working?

#47 is a good find, but here is the actual behavior, as I found out with the debugger in a test cluster:

  1. Normal lifecycle:
    a. workflow skips the loop, because not READY.
    a. workflow is READY, goes to expiration check, is not expired, skips the loop
    a. workflow is READY and expired, loop resumes it, destroy step runs and workflow is eventually FINISHED. This is the same as when we delete a cluster, because in that case we only set the remaining lifetime to 0 and resume it.

  2. Failed during create
    a. workflows skips the loop, because not READY.
    a. workflow fails eventually.
    a. workflow still skips the loop, also the expiration is not checked, because it's already skipping before. but argo workflow controller knows it's a failed workflow and deletes it after .spec.ttlStrategy.secondsAfterFailure, which is 30 days. The destroy step never runs.

I tried to "Resume" a failed workflow, it doesn't do anything, ie doesn't start the destroy step.

How I observe it happening with onExit:

Example workflow

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: simulate-
spec:
  entrypoint: start
  onExit: stop
  arguments:
    parameters:
      - name: name
      - name: create-delay-seconds
        value: ""
      - name: create-outcome
        value: ""
      - name: destroy-delay-seconds
        value: ""
      - name: destroy-outcome
        value: ""

  templates:
    - name: start
      steps:
        - - name: create
            template: simulate
            arguments:
              parameters:
              - name: delay-seconds
                value: '{{ "{{" }}workflow.parameters.create-delay-seconds{{ "}}" }}'
              - name: outcome
                value: '{{ "{{" }}workflow.parameters.create-outcome{{ "}}" }}'
        - - name: wait
            template: wait

    - name: stop
      steps:
        - - name: destroy
            template: simulate
            arguments:
              parameters:
              - name: delay-seconds
                value: '{{ "{{" }}workflow.parameters.destroy-delay-seconds{{ "}}" }}'
              - name: outcome
                value: '{{ "{{" }}workflow.parameters.destroy-outcome{{ "}}" }}'

    - name: simulate
      inputs:
        parameters:
        - name: delay-seconds
        - name: outcome
      script:
        image: debian:9.4
        command: [bash]
        source: |
          set -x
          start=0
          while sleep 1; do
            if [[ $((start++)) -ge {{ "{{" }}inputs.parameters.delay-seconds{{ "}}" }} ]]; then
              break
            fi
          done
          [[ "{{ "{{" }}inputs.parameters.outcome{{ "}}" }}" == "success" ]] || exit 1

    - name: wait
      suspend: {}
  1. Normal lifecycle:
    a. workflow starts and server skips the loop, because not READY.
    a. workflow is READY and in wait stage where it is suspended, goes to expiration check, is not expired, skips the loop
    a. workflow is READY and expired, loop resumes it, runs the onExit handler: the destroy step runs and workflow is eventually FINISHED. This is the same as when we delete a cluster, because in that case we only set the remaining lifetime to 0 and resume it.

  2. Failed during create
    a. workflows skips the loop, because not READY.
    a. workflow fails eventually. The onExit handler runs the destroy step and the workflow is still marked as FAILED and logs available.
    a. workflow still skips the loop, also the expiration is not checked, because it's already skipping before. Argo workflow controller knows it's a failed workflow and deletes it after .spec.ttlStrategy.secondsAfterFailure, which is 30 days. The destroy step has already run as soon as the workflow failed.

@tommartensen
Copy link
Copy Markdown
Contributor Author

tommartensen commented Mar 27, 2025

IMO we should take another attempt on this by

  1. updating test-simulate workflow
  2. enhancing the e2e test to
    a. ensure that destroy step runs
    i. when a creation fails
    i. part of the usual lifecycle.
    a. cleanupExpiredCluster actually does what it's supposed to (resume workflows that are expired).
  3. Later, update all other workflows.

I think this is the cleanest solution to prevent leakages and we should pursue it, because the DELETE runs in the same context (environment and time) as the failed CREATE and has all data from previous steps (e.g. the dotenv files still available.

@tommartensen
Copy link
Copy Markdown
Contributor Author

One more thought: why don't we see more leaked clusters?
A: We see them frequently! Apart from clusters created in infra PR clusters, which delete before all children are dead, in my experience the most common reason is that the creation failed and there is no cleanup running in the creation step itself, like OCP 4 does. There, we even have a toggle to disable the cleanup (keep failed cluster) for investigation.

@tommartensen tommartensen marked this pull request as draft March 27, 2025 15:11
@tommartensen
Copy link
Copy Markdown
Contributor Author

Superseded by #1543 .

@tommartensen tommartensen deleted the tm/hotfix-rosa-hcp-deletion branch April 24, 2025 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants