Better Operator Failure Handling

DISCLAIMER: Wasn't sure where to file this, it somewhat seems like a bug but may be more of a feature request.

### Current behavior

There's a couple things at play here that all revolve around the status of a deployed resource and the behavior on failure. I refer to `ConsoleDefender` in most of this issue since that is what I am working with, but I assume these issues/improvements exist/could be made on the other objects as well.

When I deploy a `ConsoleDefender` and it fails on one of the tasks, let's say for example "Create Defender YAML file": the operator outputs the failure log, then starts running the tasks again from the start. This can make it difficult to parse/follow the log output.

Additionally there is no ability (that I can see) to grab a status from the `ConsoleDefender` object, the only indication I have of failure for most tasks is through viewing the logs which as mentioned is tricky to do.

Finally, there appears to be poor handling of required values. I have dug myself into a lot of holes while trying to get an initial deployment going because I was missing one of the required pieces of the spec, but there was no validation done at "`kubectl apply`" time that blocked me from applying it.

### Steps to reproduce

A couple scenarios to try:

The "poor validation" issue:
1. Deploy the operator
2. Deploy a `ConsoleDefender` that is missing the `orchestrator` value 
3. Notice that on the surface from a view of your cluster everything looks fine: The `ConsoleDefender` got created despite having missing pieces of the spec.

The "lack of status" issue:
1. Deploy the operator
2. Deploy a `ConsoleDefender` that is missing the `accessToken` value
3. Notice that again on the surface everything looks fine. This time the console pod goes to running as well, which might mislead sometime to think that everything is working fine despite the admin account/license not being set up.
4. Validate that there is no way to check status aside from the logs (I did a yaml dump of the `ConsoleDefender` and there is no status field).

The "messy logs" issue:
1. Follow the above steps 1-4
2. Follow the logs of the operator pod
3. You should see that it continually loops repeating all of the tasks despite failure.

### Possible solutions

A couple suggestions:
- Exposing a status field on the `ConsoleDefender` that tells the current task and when necessary the most recent failed task + error "log"
- Extend the current validation on CRDs to include the required construct on any fields that are required and will cause creation to fail when they don't exist
- Implement a better way of handling failure for the operator: The operator should have some knowledge of what failures can be fixed with a retry and what failures should result in a pause until the `ConsoleDefender` is updated. Example of this: If the operator fails to deploy the license because it is invalid, it should not retry. On other failures where a retry might work, the operator should be smarter in how it retries - rather than repeating all tasks it should (in general) just retry the failed task, unless the `ConsoleDefender` has changed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Operator Failure Handling #11

Current behavior

Steps to reproduce

Possible solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Better Operator Failure Handling #11

Description

Current behavior

Steps to reproduce

Possible solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions