|
| 1 | +# Error Model |
| 2 | + |
| 3 | +> Status: design rationale for cachekit's panic-vs-`Result` discipline, |
| 4 | +> the four error types in the public API, and the debug-only invariant |
| 5 | +> checks. Companion to [`design.md`](design.md) and [`src/error.rs`](../../src/error.rs). |
| 6 | +
|
| 7 | +cachekit treats error handling as a design question, not an ergonomics |
| 8 | +question. The rule is: |
| 9 | + |
| 10 | +> **Panic on programming errors. Return `Result` for user-supplied |
| 11 | +> input. Reserve invariant checks for `debug_assertions`.** |
| 12 | +
|
| 13 | +This document explains where each side of that rule applies, why the |
| 14 | +four shipped error types each exist as separate types, and what |
| 15 | +discipline a new error type needs to follow. |
| 16 | + |
| 17 | +## The three tiers |
| 18 | + |
| 19 | +cachekit divides every failure mode into one of three tiers, each with |
| 20 | +its own response: |
| 21 | + |
| 22 | +| Tier | Cause | Response | Example | |
| 23 | +|---|---|---|---| |
| 24 | +| 1. Programming error | Bug in the caller's code, statically detectable in principle | Panic | `LruK::with_k(10, 0)` (k = 0) | |
| 25 | +| 2. User-supplied input | Configuration arriving from outside the program | `Result<_, ErrorType>` | `S3FifoCache::try_with_ratios(_, 2.0, _)` | |
| 26 | +| 3. Invariant violation | Internal data-structure corruption (cannot reach in normal use) | `debug_assert` + `InvariantError` (test/debug only) | `pop_front` while queue length is zero | |
| 27 | + |
| 28 | +The tiers are not opinions — they map to specific Rust constructs and |
| 29 | +runtime behaviours. Mixing them (panicking on tier 2, returning |
| 30 | +`Result` from tier 3) produces APIs that are either ergonomically |
| 31 | +heavy or operationally unsafe. |
| 32 | + |
| 33 | +## Tier 1: panic on programming errors |
| 34 | + |
| 35 | +A "programming error" is a precondition violation the caller could |
| 36 | +have prevented with a `if` or a type. cachekit panics in this case |
| 37 | +rather than returning `Result`, because: |
| 38 | + |
| 39 | +- The bug is in **the caller's code**, not in untrusted input the |
| 40 | + caller is forwarding. |
| 41 | +- The right fix is for the caller to fix their code, not to handle |
| 42 | + an error path at the call site. |
| 43 | +- Forcing every call site to handle `Result<_, "you passed 0 for capacity">` |
| 44 | + for a bug they could have prevented adds friction without |
| 45 | + catching anything new. |
| 46 | + |
| 47 | +The shipped examples: |
| 48 | + |
| 49 | +- `CacheBuilder::build` panics on `capacity == 0`, `k == 0` for LRU-K, |
| 50 | + and `probation_frac > 1.0` for 2Q. The validation is centralised in |
| 51 | + `validate_policy` ([`src/builder.rs`](../../src/builder.rs)). |
| 52 | +- Direct constructors (`LruCore::new`, `S3FifoCache::new`) panic on |
| 53 | + invalid arguments. The fallible counterparts (`try_with_ratios`, |
| 54 | + `try_with_capacity`) exist for tier 2. |
| 55 | +- `assert!(*k > 0, "LruK: k must be greater than 0")` in |
| 56 | + `CacheBuilder::validate_policy` is the canonical shape: a clear |
| 57 | + message that identifies the parameter and the constraint. |
| 58 | + |
| 59 | +The cost is that a panicking call site terminates under the crate's |
| 60 | +default `panic = "abort"` release profile. This is intentional — |
| 61 | +cachekit's `panic = "abort"` is documented in the |
| 62 | +[`Cargo.toml`](../../Cargo.toml) release profile, and the rationale |
| 63 | +is that a panic in cache code under load is a bug worth surfacing |
| 64 | +through the supervisor / restart strategy, not unwinding. |
| 65 | + |
| 66 | +## Tier 2: `Result` for user-supplied input |
| 67 | + |
| 68 | +When the failure mode is "user passes us configuration we don't |
| 69 | +recognise as valid," return `Result`. The shipped error types each |
| 70 | +cover a specific surface: |
| 71 | + |
| 72 | +### `ConfigError` — invalid configuration parameters |
| 73 | + |
| 74 | +```rust,ignore |
| 75 | +pub struct ConfigError(String); |
| 76 | +``` |
| 77 | + |
| 78 | +Defined in [`src/error.rs`](../../src/error.rs). Returned by fallible |
| 79 | +constructors that accept user-tunable knobs: |
| 80 | + |
| 81 | +- `S3FifoCache::try_with_ratios(capacity, small_ratio, ghost_ratio)` |
| 82 | +- Future `try_build` variants on `CacheBuilder` |
| 83 | + |
| 84 | +The contained `String` carries a human-readable description of which |
| 85 | +parameter failed validation. By convention messages are lowercase, |
| 86 | +unpunctuated, and identify the parameter: `"capacity must be greater |
| 87 | +than zero"`, `"small_ratio must be in 0.0..=1.0"`. |
| 88 | + |
| 89 | +`ConfigError`'s presence on a constructor signals that the parameter |
| 90 | +set can legitimately come from outside the program — a config file, |
| 91 | +a CLI flag, an HTTP request — and the caller should handle invalid |
| 92 | +input gracefully rather than crashing the process. |
| 93 | + |
| 94 | +### `StoreFull` — capacity-bound failure |
| 95 | + |
| 96 | +```rust,ignore |
| 97 | +pub struct StoreFull; |
| 98 | +``` |
| 99 | + |
| 100 | +Zero-sized type defined in |
| 101 | +[`src/store/traits.rs`](../../src/store/traits.rs). Returned by |
| 102 | +`StoreMut::try_insert` and `ConcurrentStore::try_insert` when the |
| 103 | +store is at capacity and the insert would exceed it. The contract: |
| 104 | + |
| 105 | +- **`StoreFull` is not a panic.** A full store under capacity |
| 106 | + pressure is the **expected** outcome of `try_insert`. The caller — |
| 107 | + typically a policy layered on top — must respond by evicting and |
| 108 | + retrying. |
| 109 | +- **The store does not evict on its own.** `StoreFull` is the |
| 110 | + signal that says "you, policy, decide who to evict." This is the |
| 111 | + core of the policy/storage separation rule from |
| 112 | + [`design.md`](design.md) §7. |
| 113 | +- **The error carries no data.** The caller knows what they tried |
| 114 | + to insert; `StoreFull` adds nothing useful by retaining it. |
| 115 | + |
| 116 | +`StoreFull` is **not** in `src/error.rs` despite being an error |
| 117 | +type. It lives alongside the trait that returns it because the |
| 118 | +two are co-evolving and the surface is small enough that the |
| 119 | +co-location aids readability. |
| 120 | + |
| 121 | +### `LazyMinHeapError` — `ds`-layer fallible construction |
| 122 | + |
| 123 | +```rust,ignore |
| 124 | +pub enum LazyMinHeapError { |
| 125 | + CapacityTooLarge { requested: usize, max: usize }, |
| 126 | + Allocation(std::collections::TryReserveError), |
| 127 | +} |
| 128 | +``` |
| 129 | + |
| 130 | +Defined in [`src/ds/lazy_heap.rs`](../../src/ds/lazy_heap.rs). |
| 131 | +Returned by `LazyMinHeap::try_with_capacity` when: |
| 132 | + |
| 133 | +- The requested capacity exceeds the internal `MAX_CAPACITY` bound, |
| 134 | + or |
| 135 | +- The allocator cannot satisfy the reservation. |
| 136 | + |
| 137 | +The enum exposes both failure modes distinctly because a caller may |
| 138 | +want to retry on `Allocation` (transient memory pressure) but not on |
| 139 | +`CapacityTooLarge` (logic bug or genuinely-too-big request that |
| 140 | +won't recover). |
| 141 | + |
| 142 | +The pattern generalises: a future "fallible-construction" error type |
| 143 | +on any `ds` primitive that pre-allocates should distinguish "you |
| 144 | +asked for too much" from "we couldn't get what you asked for." |
| 145 | + |
| 146 | +### `std::collections::TryReserveError` — passthrough |
| 147 | + |
| 148 | +Some `try_new` constructors (`HashMapStore::try_new`, |
| 149 | +`ConcurrentHashMapStore::try_new`) return the standard |
| 150 | +`TryReserveError` directly rather than wrapping it. The reason: the |
| 151 | +only failure mode is allocator pressure, and `TryReserveError` |
| 152 | +already says exactly that. Wrapping it would add a layer for no |
| 153 | +information. |
| 154 | + |
| 155 | +The shape is: if cachekit has a distinct failure mode of its own |
| 156 | +(`CapacityTooLarge`, `StoreFull`), wrap or define a new type; if the |
| 157 | +only failure mode is "the allocator said no," return the standard |
| 158 | +type and let the caller's error-handling stack absorb it. |
| 159 | + |
| 160 | +## Tier 3: invariant checks (debug-only) |
| 161 | + |
| 162 | +```rust,ignore |
| 163 | +pub struct InvariantError(String); |
| 164 | +``` |
| 165 | + |
| 166 | +Defined in [`src/error.rs`](../../src/error.rs). Returned by |
| 167 | +`check_invariants` methods on internal data structures: |
| 168 | + |
| 169 | +```rust,ignore |
| 170 | +impl<K, V> S3FifoCache<K, V> { |
| 171 | + #[cfg(any(debug_assertions, test))] |
| 172 | + pub fn check_invariants(&self) -> Result<(), InvariantError> { |
| 173 | + if self.small.len() + self.main.len() != self.map.len() { |
| 174 | + return Err(InvariantError::new("queue length mismatch")); |
| 175 | + } |
| 176 | + // … |
| 177 | + Ok(()) |
| 178 | + } |
| 179 | +} |
| 180 | +``` |
| 181 | + |
| 182 | +Three properties define the tier: |
| 183 | + |
| 184 | +- **Off the hot path.** `check_invariants` is called from tests, |
| 185 | + fuzz harnesses, and `debug_assertions` paths. It is never called |
| 186 | + from normal `insert` / `get` / `evict`. |
| 187 | +- **Internal-only.** The invariants are about data-structure |
| 188 | + integrity: "the queue length matches the map length", "the heap |
| 189 | + is in heap order", "the ghost list hasn't grown past its bound." |
| 190 | + No caller program would meaningfully react to one of these |
| 191 | + failing — the cache is corrupted, the right response is to |
| 192 | + capture state and bail. |
| 193 | +- **Returns `Result`, not panics.** Counter-intuitive given the |
| 194 | + tier-1 rule. The reason: `check_invariants` is called by |
| 195 | + diagnostic code that wants to **report** the violation (in a test |
| 196 | + failure message, a fuzz reproducer, a debug-mode assertion's |
| 197 | + output) rather than crash. Returning `Result` lets the caller |
| 198 | + format the failure; if they want to panic, they `unwrap()`. |
| 199 | + |
| 200 | +`InvariantError` carries the same `String`-message shape as |
| 201 | +`ConfigError`, by the same convention: lowercase, unpunctuated, |
| 202 | +identifying the specific invariant. |
| 203 | + |
| 204 | +## Why four error types, not one |
| 205 | + |
| 206 | +A single `CachekitError` enum could in principle subsume all four. |
| 207 | +cachekit doesn't ship one, deliberately. Three reasons: |
| 208 | + |
| 209 | +- **Each surface has different recovery semantics.** `StoreFull` |
| 210 | + means "evict and retry"; `ConfigError` means "fix your config"; |
| 211 | + `LazyMinHeapError::Allocation` means "back off and retry"; |
| 212 | + `InvariantError` means "we have a bug, capture state." A unified |
| 213 | + enum forces every caller to either match exhaustively (most of |
| 214 | + which can't happen at their call site) or use a catch-all that |
| 215 | + loses information. |
| 216 | +- **Each lives near the trait that uses it.** `StoreFull` lives in |
| 217 | + `src/store/traits.rs`; `LazyMinHeapError` lives in |
| 218 | + `src/ds/lazy_heap.rs`; `ConfigError` and `InvariantError` live |
| 219 | + in `src/error.rs`. Co-location helps maintenance — adding a new |
| 220 | + failure mode to one surface doesn't ripple through the others. |
| 221 | +- **Sum types compose poorly across abstractions.** A unified |
| 222 | + enum would propagate every variant up through every layer that |
| 223 | + touched it. The current shape lets a layer convert (or |
| 224 | + re-wrap) only the errors it cares about. |
| 225 | + |
| 226 | +The cost is that downstream code wanting to catch "any cachekit |
| 227 | +error" has to enumerate all four. The mitigation is that no |
| 228 | +realistic downstream code wants that — each call site touches one |
| 229 | +surface at a time and handles that surface's error. |
| 230 | + |
| 231 | +## Operational contract: panic profile |
| 232 | + |
| 233 | +The crate's release profile sets `panic = "abort"`: |
| 234 | + |
| 235 | +```toml |
| 236 | +[profile.release] |
| 237 | +panic = "abort" |
| 238 | +``` |
| 239 | + |
| 240 | +Two implications worth naming: |
| 241 | + |
| 242 | +- **A panic terminates the process.** No unwind, no destructors, |
| 243 | + no observer recovery. A panicking weight function in |
| 244 | + `ConcurrentWeightStore` (see |
| 245 | + [`weighted-eviction.md`](weighted-eviction.md)) kills the |
| 246 | + process; a `parking_lot` lock-poisoning concern is moot under |
| 247 | + `panic = "abort"` because the process is gone before any |
| 248 | + observer can read poisoned state. |
| 249 | +- **Callers who override the profile take on more contract.** |
| 250 | + Callers building with `panic = "unwind"` get unwind safety up |
| 251 | + to the documented invariants. The |
| 252 | + [`weighted-eviction.md`](weighted-eviction.md) clear-ordering |
| 253 | + rule and the |
| 254 | + [`concurrency.md`](concurrency.md#failure-modes) panic-safety |
| 255 | + notes apply only to this mode. |
| 256 | + |
| 257 | +The interplay matters for error model design: under `abort`, tier 1 |
| 258 | +panics are terminal and need to be debugged at development time; |
| 259 | +under `unwind`, they are catchable but should still be treated as |
| 260 | +bugs because the cache may be in an unspecified-but-not-corrupt |
| 261 | +state. |
| 262 | + |
| 263 | +## What `Result` does **not** cover |
| 264 | + |
| 265 | +Three failure modes are deliberately not represented as `Result`: |
| 266 | + |
| 267 | +- **OOM in non-`try_*` constructors.** `LruCore::new(huge)` aborts |
| 268 | + on allocator failure. Use `try_with_capacity` to get a `Result` |
| 269 | + surface (where available). |
| 270 | +- **Logic errors in policy code.** Eviction picking the wrong |
| 271 | + victim is a bug, not a return value. Detected (when detected) by |
| 272 | + `check_invariants` or by the policy's tests. |
| 273 | +- **Concurrent contention.** `parking_lot::RwLock` doesn't poison, |
| 274 | + doesn't time out by default, and doesn't return `Result`. A |
| 275 | + contended cache blocks until it can proceed. Callers who need |
| 276 | + timeouts wrap the cache themselves with a wider locking |
| 277 | + discipline. |
| 278 | + |
| 279 | +## Adding a new error |
| 280 | + |
| 281 | +Checklist for a new failure mode: |
| 282 | + |
| 283 | +1. **Decide the tier.** Programming error, user-supplied input, or |
| 284 | + internal invariant? |
| 285 | +2. **Pick or define the type.** |
| 286 | + - Tier 1: use `assert!` / `debug_assert!` / `panic!`. No new |
| 287 | + type needed. |
| 288 | + - Tier 2: define a new type if the failure has data the caller |
| 289 | + needs and no existing type fits. Otherwise reuse `ConfigError` |
| 290 | + (with a clear message) or pass through `TryReserveError`. |
| 291 | + - Tier 3: add a `check_invariants` method on the affected type |
| 292 | + that returns `Result<(), InvariantError>`. |
| 293 | +3. **Co-locate.** Types specific to a trait live with the trait |
| 294 | + (`StoreFull` in `src/store/traits.rs`). Types specific to a |
| 295 | + primitive live with the primitive (`LazyMinHeapError`). |
| 296 | + Cross-cutting types (`ConfigError`, `InvariantError`) live in |
| 297 | + `src/error.rs`. |
| 298 | +4. **Implement `Display` and `Error`.** Both are required for |
| 299 | + `?` interop with `Box<dyn Error>`. The convention is: |
| 300 | + ```rust,ignore |
| 301 | + impl fmt::Display for MyError { … } |
| 302 | + impl std::error::Error for MyError {} |
| 303 | + ``` |
| 304 | + `Display` writes the message; `Error` is empty unless the type |
| 305 | + wraps another error (then `source` returns the inner error). |
| 306 | +5. **`Send + Sync + Clone`.** All existing error types satisfy this. |
| 307 | + The convention is `#[derive(Debug, Clone, PartialEq, Eq, Hash)]` |
| 308 | + for value types and matching impls for enums. Errors that flow |
| 309 | + between threads must be `Send + Sync`; errors that get cloned |
| 310 | + into snapshots / test fixtures must be `Clone`. |
| 311 | + |
| 312 | +## Compatibility with `?` and `anyhow`/`thiserror` |
| 313 | + |
| 314 | +The cachekit error types are intentionally **plain types, not |
| 315 | +`thiserror`-derived**, to avoid forcing a `thiserror` dependency on |
| 316 | +downstream users. They implement `std::error::Error` directly, so |
| 317 | +they work with `?`, `Box<dyn Error>`, and any error-aggregation |
| 318 | +crate (including `anyhow` and `thiserror::Error` in user code). |
| 319 | + |
| 320 | +A downstream `thiserror`-derived enum that includes a `#[from] |
| 321 | +cachekit::ConfigError` works. A downstream `anyhow::Result<_>` that |
| 322 | +absorbs cachekit errors via `?` works. The choice not to bundle |
| 323 | +either crate keeps the error layer dependency-free and gives |
| 324 | +downstream the standard `From` and `Display` shape they expect. |
| 325 | + |
| 326 | +## See also |
| 327 | + |
| 328 | +- [Design overview](design.md) — §12 frames failure modes at the |
| 329 | + principles level |
| 330 | +- [Concurrency](concurrency.md) — `parking_lot` non-poisoning, |
| 331 | + atomic check-and-act, lock-acquisition failure modes |
| 332 | +- [Builder and runtime dispatch](builder-and-dyn-dispatch.md) — |
| 333 | + panic-in-`build` validation, `try_build`-deliberately-absent |
| 334 | + rationale |
| 335 | +- [Weighted eviction](weighted-eviction.md) — `StoreFull`'s role |
| 336 | + and unwind-safety in `clear` |
| 337 | +- [`src/error.rs`](../../src/error.rs) — `ConfigError`, |
| 338 | + `InvariantError` |
| 339 | +- [`src/store/traits.rs`](../../src/store/traits.rs) — `StoreFull` |
| 340 | +- [`src/ds/lazy_heap.rs`](../../src/ds/lazy_heap.rs) — |
| 341 | + `LazyMinHeapError` |
0 commit comments