Skip to content

Commit 976a39e

Browse files
committed
docs: add ADR for differentiated exit codes
Documents the decision to split the uniform exit-1-on-all-errors into structured codes (1=compliance, 2=server, 3=config, 4=usage), the review process that found the Fatalf dead-code bug, and remaining test coverage work.
1 parent de73844 commit 976a39e

1 file changed

Lines changed: 157 additions & 0 deletions

File tree

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
---
2+
title: "20260319 - Differentiated exit codes for the Kosli CLI"
3+
description: "Differentiate the existing uniform exit-1-on-all-errors into structured exit codes that distinguish compliance failures, server errors, config errors, and usage errors"
4+
status: "Proposed"
5+
date: "2026-03-19"
6+
---
7+
8+
# 20260319 - Differentiated exit codes for the Kosli CLI
9+
10+
## Overview
11+
12+
Introduce structured, semantic exit codes to the Kosli CLI so that callers (CI/CD pipelines, shell scripts) can distinguish between a compliance/policy violation and other failure modes without parsing stderr output.
13+
14+
## Context
15+
16+
Prior to this change, the Kosli CLI exits with code `1` for **all** errors. This is because `logger.Error()` in `internal/logger/logger.go` calls `log.Fatalf()`, which unconditionally calls `os.Exit(1)`:
17+
18+
```go
19+
// internal/logger/logger.go — current behaviour
20+
func (l *Logger) Error(format string, v ...interface{}) {
21+
format = fmt.Sprintf("Error: %s\n", format)
22+
l.errLog.Fatalf(format, v...) // always exits with code 1
23+
}
24+
```
25+
26+
In `cmd/kosli/main.go`, every error flows through this path:
27+
28+
```go
29+
if err != nil {
30+
logger.Error(err.Error()) // exits 1 via Fatalf — always
31+
}
32+
```
33+
34+
This means:
35+
- A script running `kosli assert artifact` gets exit 1 whether the artifact is non-compliant, the server is down, or the API token is wrong.
36+
- A CI pipeline cannot distinguish a network outage (retry-able) from a policy denial (actionable) from a bad token (fix config).
37+
- The only way to differentiate failures is to parse stderr output.
38+
39+
## Decision Drivers
40+
41+
- CI/CD pipelines need to fail fast on compliance violations without parsing stderr
42+
- Operators need to distinguish transient server failures (retry) from auth failures (fix token) from compliance violations (fix the artifact/policy)
43+
- Shell scripts using `$?` should be able to act on the specific class of failure
44+
- The change should preserve the existing exit-1 behaviour for compliance failures (the most common case) to minimise disruption
45+
46+
## Options Considered
47+
48+
### Option A: Keep uniform exit 1 for all errors (status quo)
49+
- **Pro**: No change; no migration needed
50+
- **Con**: Cannot distinguish failure types without parsing stderr; limits CI/CD automation
51+
52+
### Option B: Structured exit codes — 0/1/2/3/4 with semantic meaning (chosen)
53+
- **Pro**: Rich signalling; enables CI/CD to route failures to the right handler; aligns with how tools like `git`, `curl`, and policy engines behave
54+
- **Con**: Scripts checking `[ $? -eq 1 ]` as a proxy for "any error" will miss exit codes 2, 3, 4
55+
- **Mitigation**: Exit 1 remains the most common failure code (compliance violations AND the default for unclassified errors), so most existing `[ $? -eq 1 ]` checks will continue to work for the dominant use case
56+
57+
## Decision
58+
59+
**Option B — structured exit codes.**
60+
61+
| Code | Meaning | Error type |
62+
|------|---------|------------|
63+
| 0 | Success ||
64+
| 1 | Compliance/policy violation (also: unclassified errors) | `ErrCompliance` |
65+
| 2 | Server unreachable or 5xx | `ErrServer` |
66+
| 3 | Auth/config error (401/403) | `ErrConfig` |
67+
| 4 | CLI usage error (unknown flag, missing required flag, wrong arg count) | `ErrUsage` |
68+
69+
Unknown/unclassified errors fall back to exit 1 (same as before).
70+
71+
Commands affected:
72+
- **Assert commands** (8): `assert artifact`, `assert approval`, `assert snapshot`, `assert pullrequest *` — exit 1 on compliance failure (unchanged)
73+
- **Attest commands with `--assert`** (5): `attest pullrequest *`, `attest jira` — exit 1 when assertion fails (unchanged)
74+
- **Evaluate commands** (2): `evaluate trail`, `evaluate trails` — exit 1 when policy denies (unchanged)
75+
- **All commands**: exit 2 on server/network failure, exit 3 on 401/403, exit 4 on usage errors (NEW — previously these all exited 1)
76+
- Special case: `assert status` exits 0 (responsive) or 2 (unreachable) — never 1
77+
78+
## Consequences
79+
80+
### Positive
81+
- Compliance assertions remain usable as hard gates (exit 1, same as before)
82+
- Operators can write scripts that retry on exit 2, fix config on exit 3, and triage policy violations on exit 1
83+
- Exit code meaning is documented per-command in generated CLI docs (via `commandExitCodes` in `cmd/kosli/docs.go`)
84+
- Dry-run mode is preserved: errors in `--dry-run` still exit 0
85+
86+
### Negative
87+
- Scripts checking `[ $? -eq 1 ]` as a proxy for "any error" will stop catching server errors (now exit 2), auth errors (now exit 3), and usage errors (now exit 4). Scripts using `[ $? -ne 0 ]` are unaffected.
88+
- Whether this warrants a major version bump is debatable — exit code 1 was never documented as meaning "any error", but it was the de facto implicit contract via `Fatalf`.
89+
90+
### Neutral
91+
- Cobra built-in usage errors (unknown flags, missing required flags, wrong argument count) are auto-detected via string pattern matching and classified as `ErrUsage` without requiring explicit wrapping in each command.
92+
- HTTP 4xx errors other than 401/403 (e.g. 400, 404) return plain `fmt.Errorf` and fall through to exit 1 — consistent with treating them as data/compliance failures.
93+
94+
## Implementation
95+
96+
Error types are thin wrappers in `internal/errors/errors.go`:
97+
98+
```go
99+
type ErrCompliance struct{ msg string }
100+
type ErrServer struct{ msg string }
101+
type ErrConfig struct{ msg string }
102+
type ErrUsage struct{ msg string }
103+
104+
func ExitCodeFor(err error) int {
105+
if err == nil { return 0 }
106+
var e1 ErrCompliance; if errors.As(err, &e1) { return 1 }
107+
var e2 ErrServer; if errors.As(err, &e2) { return 2 }
108+
var e3 ErrConfig; if errors.As(err, &e3) { return 3 }
109+
var e4 ErrUsage; if errors.As(err, &e4) { return 4 }
110+
if isCobraUsageError(err) { return 4 }
111+
return 1 // safe default
112+
}
113+
```
114+
115+
`errors.As` correctly unwraps chained errors (`fmt.Errorf("...: %w", ErrCompliance{...})`).
116+
117+
The exit code dispatch in `cmd/kosli/main.go` bypasses `logger.Error()` (which calls `log.Fatalf``os.Exit(1)`) and writes to stderr directly:
118+
119+
```go
120+
if err != nil {
121+
fmt.Fprintf(os.Stderr, "Error: %s\n", err.Error())
122+
os.Exit(kosliErrors.ExitCodeFor(err))
123+
}
124+
```
125+
126+
This is necessary because ~40 command factory functions call `logger.Error()` for fatal setup errors and rely on `Fatalf` to halt — changing `logger.Error` itself would require auditing all those call sites.
127+
128+
## Review Process
129+
130+
This branch was reviewed by a five-agent code review swarm (Opus 4.6) before the fixes were applied.
131+
132+
### Bugs found and fixed
133+
134+
1. **Exit code dispatch was dead code**`logger.Error()` in `main.go` called `log.Fatalf``os.Exit(1)` before `ExitCodeFor` could run. All errors exited 1 regardless of type. Tests didn't catch this because they call `ExitCodeFor(err)` on the returned error directly, never going through `main()`. Fixed by replacing `logger.Error` with `fmt.Fprintf` in `main.go` only.
135+
136+
2. **Same issue in `innerMain()`**`logger.Error` at the unknown-subcommand path exited before `return ErrUsage` could execute. Fixed the same way.
137+
138+
3. **HTTP 5xx not classified as `ErrServer`** — In `internal/requests/requests.go`, 5xx responses (when retries are exhausted) were returned as plain `fmt.Errorf`, falling through to exit 1 instead of exit 2. Fixed by adding a `resp.StatusCode >= 500` check.
139+
140+
### Verified after fixes
141+
142+
```
143+
$ ./kosli version # exit 0
144+
$ KOSLI_API_TOKEN=bad ./kosli list environments --org x # exit 3
145+
$ ./kosli list environments --host http://localhost:1 ... # exit 2
146+
$ ./kosli list environments --bogus-flag # exit 4
147+
```
148+
149+
### Remaining work — test coverage
150+
151+
Only 11 `wantExitCode` annotations exist across ~134 error test cases. The test infrastructure (`wantExitCode` field in `cmdTestCase`, assertion in `runTestCmd`) is in place and working, but most existing command tests haven't been updated yet. Missing from: `assertArtifact_test.go`, `assertApproval_test.go`, all `assertPR*_test.go`, all `attestPR*_test.go`, `attestJira_test.go`, `evaluateTrails_test.go`.
152+
153+
No end-to-end test verifies the actual compiled binary's process exit code. The current tests validate error classification logic in isolation.
154+
155+
## Related Decisions
156+
157+
- [20260302 - Client-side policy evaluation](./20260302-client-side-policy-evaluation.md) — the evaluate command whose exit behaviour this ADR formalises

0 commit comments

Comments
 (0)