fix: have all flavors delete themselves even in failure#1508
fix: have all flavors delete themselves even in failure#1508tommartensen wants to merge 1 commit into
Conversation
|
A single node development cluster (infra-pr-1508) was allocated in production infra for this PR. CI will attempt to deploy 🔌 You can connect to this cluster with: 🛠️ And pull infractl from the deployed dev infra-server with: 🚲 You can then use the dev infra instance e.g.: Further Development☕ If you make changes, you can commit and push and CI will take care of updating the development cluster. 🚀 If you only modify configuration (chart/infra-server/configuration) or templates (chart/infra-server/{static,templates}), you can get a faster update with: LogsLogs for the development infra depending on your @redhat.com authuser: Or: |
There was a problem hiding this comment.
tldr: I don't know enough about argo cd to know if we should add onExit now. On some create failures, do the workflows exit and never run the destroy step and delete the workflow before the cleanup runs?
detail:
There may be reasons not to add this?
onExit's were removed here from infra workflows, #320
And the argocd code marks onExit as deprecated (https://github.com/argoproj/argo-workflows/blame/68fde4ffcbe9b84dfad969a45133cd4cc659d346/pkg/apis/workflow/v1alpha1/workflow_types.go#L1573).
The infra periodic cleanup code *should find clusters that failed, argocd workflow paused, and resume those workflows to run their destroy steps (added in https://github.com/stackrox/infra/pull/47/files#diff-c18c4c95a88e0fc7894470522a609f99R639 and scheduled at https://github.com/stackrox/infra/blame/master/service/cluster/cluster.go#L131).
Maybe argocd has changed in some way that this is not valid or not always working?
|
We could prevent the issues from #320 by
The deprecation you found only refers to a specific onExit handler in https://github.com/argoproj/argo-workflows/blob/main/pkg/apis/workflow/v1alpha1/workflow_types.go#L1563-L1567, example: spec:
templates:
- name: start
steps:
- - name: create
template: create
onExit: another-templateI don't think spec:
entrypoint: start
onExit: destroyI will check the second part of your reply (infra periodic cleanup code) later. |
|
The second part of your question:
#47 is a good find, but here is the actual behavior, as I found out with the debugger in a test cluster:
I tried to "Resume" a failed workflow, it doesn't do anything, ie doesn't start the destroy step. How I observe it happening with onExit:Example workflowapiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: simulate-
spec:
entrypoint: start
onExit: stop
arguments:
parameters:
- name: name
- name: create-delay-seconds
value: ""
- name: create-outcome
value: ""
- name: destroy-delay-seconds
value: ""
- name: destroy-outcome
value: ""
templates:
- name: start
steps:
- - name: create
template: simulate
arguments:
parameters:
- name: delay-seconds
value: '{{ "{{" }}workflow.parameters.create-delay-seconds{{ "}}" }}'
- name: outcome
value: '{{ "{{" }}workflow.parameters.create-outcome{{ "}}" }}'
- - name: wait
template: wait
- name: stop
steps:
- - name: destroy
template: simulate
arguments:
parameters:
- name: delay-seconds
value: '{{ "{{" }}workflow.parameters.destroy-delay-seconds{{ "}}" }}'
- name: outcome
value: '{{ "{{" }}workflow.parameters.destroy-outcome{{ "}}" }}'
- name: simulate
inputs:
parameters:
- name: delay-seconds
- name: outcome
script:
image: debian:9.4
command: [bash]
source: |
set -x
start=0
while sleep 1; do
if [[ $((start++)) -ge {{ "{{" }}inputs.parameters.delay-seconds{{ "}}" }} ]]; then
break
fi
done
[[ "{{ "{{" }}inputs.parameters.outcome{{ "}}" }}" == "success" ]] || exit 1
- name: wait
suspend: {}
|
|
IMO we should take another attempt on this by
I think this is the cleanest solution to prevent leakages and we should pursue it, because the DELETE runs in the same context (environment and time) as the failed CREATE and has all data from previous steps (e.g. the |
|
One more thought: why don't we see more leaked clusters? |
|
Superseded by #1543 . |
Some already did that. Some already clean up if they fail during create (like OCP). Still better to be consistent. This is a patch on an issue that should be solved properly.