|
| 1 | +--- |
| 2 | +title: "20260319 - Differentiated exit codes for the Kosli CLI" |
| 3 | +description: "Differentiate the existing uniform exit-1-on-all-errors into structured exit codes that distinguish compliance failures, server errors, config errors, and usage errors" |
| 4 | +status: "Proposed" |
| 5 | +date: "2026-03-19" |
| 6 | +--- |
| 7 | + |
| 8 | +# 20260319 - Differentiated exit codes for the Kosli CLI |
| 9 | + |
| 10 | +## Overview |
| 11 | + |
| 12 | +Introduce structured, semantic exit codes to the Kosli CLI so that callers (CI/CD pipelines, shell scripts) can distinguish between a compliance/policy violation and other failure modes without parsing stderr output. |
| 13 | + |
| 14 | +## Context |
| 15 | + |
| 16 | +Prior to this change, the Kosli CLI exits with code `1` for **all** errors. This is because `logger.Error()` in `internal/logger/logger.go` calls `log.Fatalf()`, which unconditionally calls `os.Exit(1)`: |
| 17 | + |
| 18 | +```go |
| 19 | +// internal/logger/logger.go — current behaviour |
| 20 | +func (l *Logger) Error(format string, v ...interface{}) { |
| 21 | + format = fmt.Sprintf("Error: %s\n", format) |
| 22 | + l.errLog.Fatalf(format, v...) // always exits with code 1 |
| 23 | +} |
| 24 | +``` |
| 25 | + |
| 26 | +In `cmd/kosli/main.go`, every error flows through this path: |
| 27 | + |
| 28 | +```go |
| 29 | +if err != nil { |
| 30 | + logger.Error(err.Error()) // exits 1 via Fatalf — always |
| 31 | +} |
| 32 | +``` |
| 33 | + |
| 34 | +This means: |
| 35 | +- A script running `kosli assert artifact` gets exit 1 whether the artifact is non-compliant, the server is down, or the API token is wrong. |
| 36 | +- A CI pipeline cannot distinguish a network outage (retry-able) from a policy denial (actionable) from a bad token (fix config). |
| 37 | +- The only way to differentiate failures is to parse stderr output. |
| 38 | + |
| 39 | +## Decision Drivers |
| 40 | + |
| 41 | +- CI/CD pipelines need to fail fast on compliance violations without parsing stderr |
| 42 | +- Operators need to distinguish transient server failures (retry) from auth failures (fix token) from compliance violations (fix the artifact/policy) |
| 43 | +- Shell scripts using `$?` should be able to act on the specific class of failure |
| 44 | +- The change should preserve the existing exit-1 behaviour for compliance failures (the most common case) to minimise disruption |
| 45 | + |
| 46 | +## Options Considered |
| 47 | + |
| 48 | +### Option A: Keep uniform exit 1 for all errors (status quo) |
| 49 | +- **Pro**: No change; no migration needed |
| 50 | +- **Con**: Cannot distinguish failure types without parsing stderr; limits CI/CD automation |
| 51 | + |
| 52 | +### Option B: Structured exit codes — 0/1/2/3/4 with semantic meaning (chosen) |
| 53 | +- **Pro**: Rich signalling; enables CI/CD to route failures to the right handler; aligns with how tools like `git`, `curl`, and policy engines behave |
| 54 | +- **Con**: Scripts checking `[ $? -eq 1 ]` as a proxy for "any error" will miss exit codes 2, 3, 4 |
| 55 | +- **Mitigation**: Exit 1 remains the most common failure code (compliance violations AND the default for unclassified errors), so most existing `[ $? -eq 1 ]` checks will continue to work for the dominant use case |
| 56 | + |
| 57 | +## Decision |
| 58 | + |
| 59 | +**Option B — structured exit codes.** |
| 60 | + |
| 61 | +| Code | Meaning | Error type | |
| 62 | +|------|---------|------------| |
| 63 | +| 0 | Success | — | |
| 64 | +| 1 | Compliance/policy violation (also: unclassified errors) | `ErrCompliance` | |
| 65 | +| 2 | Server unreachable or 5xx | `ErrServer` | |
| 66 | +| 3 | Auth/config error (401/403) | `ErrConfig` | |
| 67 | +| 4 | CLI usage error (unknown flag, missing required flag, wrong arg count) | `ErrUsage` | |
| 68 | + |
| 69 | +Unknown/unclassified errors fall back to exit 1 (same as before). |
| 70 | + |
| 71 | +Commands affected: |
| 72 | +- **Assert commands** (8): `assert artifact`, `assert approval`, `assert snapshot`, `assert pullrequest *` — exit 1 on compliance failure (unchanged) |
| 73 | +- **Attest commands with `--assert`** (5): `attest pullrequest *`, `attest jira` — exit 1 when assertion fails (unchanged) |
| 74 | +- **Evaluate commands** (2): `evaluate trail`, `evaluate trails` — exit 1 when policy denies (unchanged) |
| 75 | +- **All commands**: exit 2 on server/network failure, exit 3 on 401/403, exit 4 on usage errors (NEW — previously these all exited 1) |
| 76 | +- Special case: `assert status` exits 0 (responsive) or 2 (unreachable) — never 1 |
| 77 | + |
| 78 | +## Consequences |
| 79 | + |
| 80 | +### Positive |
| 81 | +- Compliance assertions remain usable as hard gates (exit 1, same as before) |
| 82 | +- Operators can write scripts that retry on exit 2, fix config on exit 3, and triage policy violations on exit 1 |
| 83 | +- Exit code meaning is documented per-command in generated CLI docs (via `commandExitCodes` in `cmd/kosli/docs.go`) |
| 84 | +- Dry-run mode is preserved: errors in `--dry-run` still exit 0 |
| 85 | + |
| 86 | +### Negative |
| 87 | +- Scripts checking `[ $? -eq 1 ]` as a proxy for "any error" will stop catching server errors (now exit 2), auth errors (now exit 3), and usage errors (now exit 4). Scripts using `[ $? -ne 0 ]` are unaffected. |
| 88 | +- Whether this warrants a major version bump is debatable — exit code 1 was never documented as meaning "any error", but it was the de facto implicit contract via `Fatalf`. |
| 89 | + |
| 90 | +### Neutral |
| 91 | +- Cobra built-in usage errors (unknown flags, missing required flags, wrong argument count) are auto-detected via string pattern matching and classified as `ErrUsage` without requiring explicit wrapping in each command. |
| 92 | +- HTTP 4xx errors other than 401/403 (e.g. 400, 404) return plain `fmt.Errorf` and fall through to exit 1 — consistent with treating them as data/compliance failures. |
| 93 | + |
| 94 | +## Implementation |
| 95 | + |
| 96 | +Error types are thin wrappers in `internal/errors/errors.go`: |
| 97 | + |
| 98 | +```go |
| 99 | +type ErrCompliance struct{ msg string } |
| 100 | +type ErrServer struct{ msg string } |
| 101 | +type ErrConfig struct{ msg string } |
| 102 | +type ErrUsage struct{ msg string } |
| 103 | + |
| 104 | +func ExitCodeFor(err error) int { |
| 105 | + if err == nil { return 0 } |
| 106 | + var e1 ErrCompliance; if errors.As(err, &e1) { return 1 } |
| 107 | + var e2 ErrServer; if errors.As(err, &e2) { return 2 } |
| 108 | + var e3 ErrConfig; if errors.As(err, &e3) { return 3 } |
| 109 | + var e4 ErrUsage; if errors.As(err, &e4) { return 4 } |
| 110 | + if isCobraUsageError(err) { return 4 } |
| 111 | + return 1 // safe default |
| 112 | +} |
| 113 | +``` |
| 114 | + |
| 115 | +`errors.As` correctly unwraps chained errors (`fmt.Errorf("...: %w", ErrCompliance{...})`). |
| 116 | + |
| 117 | +The exit code dispatch in `cmd/kosli/main.go` bypasses `logger.Error()` (which calls `log.Fatalf` → `os.Exit(1)`) and writes to stderr directly: |
| 118 | + |
| 119 | +```go |
| 120 | +if err != nil { |
| 121 | + fmt.Fprintf(os.Stderr, "Error: %s\n", err.Error()) |
| 122 | + os.Exit(kosliErrors.ExitCodeFor(err)) |
| 123 | +} |
| 124 | +``` |
| 125 | + |
| 126 | +This is necessary because ~40 command factory functions call `logger.Error()` for fatal setup errors and rely on `Fatalf` to halt — changing `logger.Error` itself would require auditing all those call sites. |
| 127 | + |
| 128 | +## Review Process |
| 129 | + |
| 130 | +This branch was reviewed by a five-agent code review swarm (Opus 4.6) before the fixes were applied. |
| 131 | + |
| 132 | +### Bugs found and fixed |
| 133 | + |
| 134 | +1. **Exit code dispatch was dead code** — `logger.Error()` in `main.go` called `log.Fatalf` → `os.Exit(1)` before `ExitCodeFor` could run. All errors exited 1 regardless of type. Tests didn't catch this because they call `ExitCodeFor(err)` on the returned error directly, never going through `main()`. Fixed by replacing `logger.Error` with `fmt.Fprintf` in `main.go` only. |
| 135 | + |
| 136 | +2. **Same issue in `innerMain()`** — `logger.Error` at the unknown-subcommand path exited before `return ErrUsage` could execute. Fixed the same way. |
| 137 | + |
| 138 | +3. **HTTP 5xx not classified as `ErrServer`** — In `internal/requests/requests.go`, 5xx responses (when retries are exhausted) were returned as plain `fmt.Errorf`, falling through to exit 1 instead of exit 2. Fixed by adding a `resp.StatusCode >= 500` check. |
| 139 | + |
| 140 | +### Verified after fixes |
| 141 | + |
| 142 | +``` |
| 143 | +$ ./kosli version # exit 0 |
| 144 | +$ KOSLI_API_TOKEN=bad ./kosli list environments --org x # exit 3 |
| 145 | +$ ./kosli list environments --host http://localhost:1 ... # exit 2 |
| 146 | +$ ./kosli list environments --bogus-flag # exit 4 |
| 147 | +``` |
| 148 | + |
| 149 | +### Remaining work — test coverage |
| 150 | + |
| 151 | +Only 11 `wantExitCode` annotations exist across ~134 error test cases. The test infrastructure (`wantExitCode` field in `cmdTestCase`, assertion in `runTestCmd`) is in place and working, but most existing command tests haven't been updated yet. Missing from: `assertArtifact_test.go`, `assertApproval_test.go`, all `assertPR*_test.go`, all `attestPR*_test.go`, `attestJira_test.go`, `evaluateTrails_test.go`. |
| 152 | + |
| 153 | +No end-to-end test verifies the actual compiled binary's process exit code. The current tests validate error classification logic in isolation. |
| 154 | + |
| 155 | +## Related Decisions |
| 156 | + |
| 157 | +- [20260302 - Client-side policy evaluation](./20260302-client-side-policy-evaluation.md) — the evaluate command whose exit behaviour this ADR formalises |
0 commit comments