Skip to content

Commit afe12b4

Browse files
committed
docs: expand design documentation with new sections on concurrency, metrics, error model, and weighted eviction
- Added detailed documentation on concurrency strategies, outlining the design rationale for concurrent cache types and their usage. - Introduced a metrics section to explain the metrics infrastructure, including recording, snapshotting, and exporting metrics for observability. - Documented the error model, clarifying the panic vs. `Result` discipline and the handling of different error types. - Included a comprehensive overview of weighted eviction strategies, detailing the implementation of `WeightStore` and `ConcurrentWeightStore`. These additions enhance the overall documentation, providing clearer guidance on design principles and usage patterns for developers.
1 parent d54410a commit afe12b4

8 files changed

Lines changed: 2623 additions & 52 deletions

File tree

docs/design/builder-and-dyn-dispatch.md

Lines changed: 436 additions & 0 deletions
Large diffs are not rendered by default.

docs/design/concurrency.md

Lines changed: 344 additions & 0 deletions
Large diffs are not rendered by default.

docs/design/design.md

Lines changed: 185 additions & 52 deletions
Large diffs are not rendered by default.

docs/design/error-model.md

Lines changed: 341 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,341 @@
1+
# Error Model
2+
3+
> Status: design rationale for cachekit's panic-vs-`Result` discipline,
4+
> the four error types in the public API, and the debug-only invariant
5+
> checks. Companion to [`design.md`](design.md) and [`src/error.rs`](../../src/error.rs).
6+
7+
cachekit treats error handling as a design question, not an ergonomics
8+
question. The rule is:
9+
10+
> **Panic on programming errors. Return `Result` for user-supplied
11+
> input. Reserve invariant checks for `debug_assertions`.**
12+
13+
This document explains where each side of that rule applies, why the
14+
four shipped error types each exist as separate types, and what
15+
discipline a new error type needs to follow.
16+
17+
## The three tiers
18+
19+
cachekit divides every failure mode into one of three tiers, each with
20+
its own response:
21+
22+
| Tier | Cause | Response | Example |
23+
|---|---|---|---|
24+
| 1. Programming error | Bug in the caller's code, statically detectable in principle | Panic | `LruK::with_k(10, 0)` (k = 0) |
25+
| 2. User-supplied input | Configuration arriving from outside the program | `Result<_, ErrorType>` | `S3FifoCache::try_with_ratios(_, 2.0, _)` |
26+
| 3. Invariant violation | Internal data-structure corruption (cannot reach in normal use) | `debug_assert` + `InvariantError` (test/debug only) | `pop_front` while queue length is zero |
27+
28+
The tiers are not opinions — they map to specific Rust constructs and
29+
runtime behaviours. Mixing them (panicking on tier 2, returning
30+
`Result` from tier 3) produces APIs that are either ergonomically
31+
heavy or operationally unsafe.
32+
33+
## Tier 1: panic on programming errors
34+
35+
A "programming error" is a precondition violation the caller could
36+
have prevented with a `if` or a type. cachekit panics in this case
37+
rather than returning `Result`, because:
38+
39+
- The bug is in **the caller's code**, not in untrusted input the
40+
caller is forwarding.
41+
- The right fix is for the caller to fix their code, not to handle
42+
an error path at the call site.
43+
- Forcing every call site to handle `Result<_, "you passed 0 for capacity">`
44+
for a bug they could have prevented adds friction without
45+
catching anything new.
46+
47+
The shipped examples:
48+
49+
- `CacheBuilder::build` panics on `capacity == 0`, `k == 0` for LRU-K,
50+
and `probation_frac > 1.0` for 2Q. The validation is centralised in
51+
`validate_policy` ([`src/builder.rs`](../../src/builder.rs)).
52+
- Direct constructors (`LruCore::new`, `S3FifoCache::new`) panic on
53+
invalid arguments. The fallible counterparts (`try_with_ratios`,
54+
`try_with_capacity`) exist for tier 2.
55+
- `assert!(*k > 0, "LruK: k must be greater than 0")` in
56+
`CacheBuilder::validate_policy` is the canonical shape: a clear
57+
message that identifies the parameter and the constraint.
58+
59+
The cost is that a panicking call site terminates under the crate's
60+
default `panic = "abort"` release profile. This is intentional —
61+
cachekit's `panic = "abort"` is documented in the
62+
[`Cargo.toml`](../../Cargo.toml) release profile, and the rationale
63+
is that a panic in cache code under load is a bug worth surfacing
64+
through the supervisor / restart strategy, not unwinding.
65+
66+
## Tier 2: `Result` for user-supplied input
67+
68+
When the failure mode is "user passes us configuration we don't
69+
recognise as valid," return `Result`. The shipped error types each
70+
cover a specific surface:
71+
72+
### `ConfigError` — invalid configuration parameters
73+
74+
```rust,ignore
75+
pub struct ConfigError(String);
76+
```
77+
78+
Defined in [`src/error.rs`](../../src/error.rs). Returned by fallible
79+
constructors that accept user-tunable knobs:
80+
81+
- `S3FifoCache::try_with_ratios(capacity, small_ratio, ghost_ratio)`
82+
- Future `try_build` variants on `CacheBuilder`
83+
84+
The contained `String` carries a human-readable description of which
85+
parameter failed validation. By convention messages are lowercase,
86+
unpunctuated, and identify the parameter: `"capacity must be greater
87+
than zero"`, `"small_ratio must be in 0.0..=1.0"`.
88+
89+
`ConfigError`'s presence on a constructor signals that the parameter
90+
set can legitimately come from outside the program — a config file,
91+
a CLI flag, an HTTP request — and the caller should handle invalid
92+
input gracefully rather than crashing the process.
93+
94+
### `StoreFull` — capacity-bound failure
95+
96+
```rust,ignore
97+
pub struct StoreFull;
98+
```
99+
100+
Zero-sized type defined in
101+
[`src/store/traits.rs`](../../src/store/traits.rs). Returned by
102+
`StoreMut::try_insert` and `ConcurrentStore::try_insert` when the
103+
store is at capacity and the insert would exceed it. The contract:
104+
105+
- **`StoreFull` is not a panic.** A full store under capacity
106+
pressure is the **expected** outcome of `try_insert`. The caller —
107+
typically a policy layered on top — must respond by evicting and
108+
retrying.
109+
- **The store does not evict on its own.** `StoreFull` is the
110+
signal that says "you, policy, decide who to evict." This is the
111+
core of the policy/storage separation rule from
112+
[`design.md`](design.md) §7.
113+
- **The error carries no data.** The caller knows what they tried
114+
to insert; `StoreFull` adds nothing useful by retaining it.
115+
116+
`StoreFull` is **not** in `src/error.rs` despite being an error
117+
type. It lives alongside the trait that returns it because the
118+
two are co-evolving and the surface is small enough that the
119+
co-location aids readability.
120+
121+
### `LazyMinHeapError``ds`-layer fallible construction
122+
123+
```rust,ignore
124+
pub enum LazyMinHeapError {
125+
CapacityTooLarge { requested: usize, max: usize },
126+
Allocation(std::collections::TryReserveError),
127+
}
128+
```
129+
130+
Defined in [`src/ds/lazy_heap.rs`](../../src/ds/lazy_heap.rs).
131+
Returned by `LazyMinHeap::try_with_capacity` when:
132+
133+
- The requested capacity exceeds the internal `MAX_CAPACITY` bound,
134+
or
135+
- The allocator cannot satisfy the reservation.
136+
137+
The enum exposes both failure modes distinctly because a caller may
138+
want to retry on `Allocation` (transient memory pressure) but not on
139+
`CapacityTooLarge` (logic bug or genuinely-too-big request that
140+
won't recover).
141+
142+
The pattern generalises: a future "fallible-construction" error type
143+
on any `ds` primitive that pre-allocates should distinguish "you
144+
asked for too much" from "we couldn't get what you asked for."
145+
146+
### `std::collections::TryReserveError` — passthrough
147+
148+
Some `try_new` constructors (`HashMapStore::try_new`,
149+
`ConcurrentHashMapStore::try_new`) return the standard
150+
`TryReserveError` directly rather than wrapping it. The reason: the
151+
only failure mode is allocator pressure, and `TryReserveError`
152+
already says exactly that. Wrapping it would add a layer for no
153+
information.
154+
155+
The shape is: if cachekit has a distinct failure mode of its own
156+
(`CapacityTooLarge`, `StoreFull`), wrap or define a new type; if the
157+
only failure mode is "the allocator said no," return the standard
158+
type and let the caller's error-handling stack absorb it.
159+
160+
## Tier 3: invariant checks (debug-only)
161+
162+
```rust,ignore
163+
pub struct InvariantError(String);
164+
```
165+
166+
Defined in [`src/error.rs`](../../src/error.rs). Returned by
167+
`check_invariants` methods on internal data structures:
168+
169+
```rust,ignore
170+
impl<K, V> S3FifoCache<K, V> {
171+
#[cfg(any(debug_assertions, test))]
172+
pub fn check_invariants(&self) -> Result<(), InvariantError> {
173+
if self.small.len() + self.main.len() != self.map.len() {
174+
return Err(InvariantError::new("queue length mismatch"));
175+
}
176+
// …
177+
Ok(())
178+
}
179+
}
180+
```
181+
182+
Three properties define the tier:
183+
184+
- **Off the hot path.** `check_invariants` is called from tests,
185+
fuzz harnesses, and `debug_assertions` paths. It is never called
186+
from normal `insert` / `get` / `evict`.
187+
- **Internal-only.** The invariants are about data-structure
188+
integrity: "the queue length matches the map length", "the heap
189+
is in heap order", "the ghost list hasn't grown past its bound."
190+
No caller program would meaningfully react to one of these
191+
failing — the cache is corrupted, the right response is to
192+
capture state and bail.
193+
- **Returns `Result`, not panics.** Counter-intuitive given the
194+
tier-1 rule. The reason: `check_invariants` is called by
195+
diagnostic code that wants to **report** the violation (in a test
196+
failure message, a fuzz reproducer, a debug-mode assertion's
197+
output) rather than crash. Returning `Result` lets the caller
198+
format the failure; if they want to panic, they `unwrap()`.
199+
200+
`InvariantError` carries the same `String`-message shape as
201+
`ConfigError`, by the same convention: lowercase, unpunctuated,
202+
identifying the specific invariant.
203+
204+
## Why four error types, not one
205+
206+
A single `CachekitError` enum could in principle subsume all four.
207+
cachekit doesn't ship one, deliberately. Three reasons:
208+
209+
- **Each surface has different recovery semantics.** `StoreFull`
210+
means "evict and retry"; `ConfigError` means "fix your config";
211+
`LazyMinHeapError::Allocation` means "back off and retry";
212+
`InvariantError` means "we have a bug, capture state." A unified
213+
enum forces every caller to either match exhaustively (most of
214+
which can't happen at their call site) or use a catch-all that
215+
loses information.
216+
- **Each lives near the trait that uses it.** `StoreFull` lives in
217+
`src/store/traits.rs`; `LazyMinHeapError` lives in
218+
`src/ds/lazy_heap.rs`; `ConfigError` and `InvariantError` live
219+
in `src/error.rs`. Co-location helps maintenance — adding a new
220+
failure mode to one surface doesn't ripple through the others.
221+
- **Sum types compose poorly across abstractions.** A unified
222+
enum would propagate every variant up through every layer that
223+
touched it. The current shape lets a layer convert (or
224+
re-wrap) only the errors it cares about.
225+
226+
The cost is that downstream code wanting to catch "any cachekit
227+
error" has to enumerate all four. The mitigation is that no
228+
realistic downstream code wants that — each call site touches one
229+
surface at a time and handles that surface's error.
230+
231+
## Operational contract: panic profile
232+
233+
The crate's release profile sets `panic = "abort"`:
234+
235+
```toml
236+
[profile.release]
237+
panic = "abort"
238+
```
239+
240+
Two implications worth naming:
241+
242+
- **A panic terminates the process.** No unwind, no destructors,
243+
no observer recovery. A panicking weight function in
244+
`ConcurrentWeightStore` (see
245+
[`weighted-eviction.md`](weighted-eviction.md)) kills the
246+
process; a `parking_lot` lock-poisoning concern is moot under
247+
`panic = "abort"` because the process is gone before any
248+
observer can read poisoned state.
249+
- **Callers who override the profile take on more contract.**
250+
Callers building with `panic = "unwind"` get unwind safety up
251+
to the documented invariants. The
252+
[`weighted-eviction.md`](weighted-eviction.md) clear-ordering
253+
rule and the
254+
[`concurrency.md`](concurrency.md#failure-modes) panic-safety
255+
notes apply only to this mode.
256+
257+
The interplay matters for error model design: under `abort`, tier 1
258+
panics are terminal and need to be debugged at development time;
259+
under `unwind`, they are catchable but should still be treated as
260+
bugs because the cache may be in an unspecified-but-not-corrupt
261+
state.
262+
263+
## What `Result` does **not** cover
264+
265+
Three failure modes are deliberately not represented as `Result`:
266+
267+
- **OOM in non-`try_*` constructors.** `LruCore::new(huge)` aborts
268+
on allocator failure. Use `try_with_capacity` to get a `Result`
269+
surface (where available).
270+
- **Logic errors in policy code.** Eviction picking the wrong
271+
victim is a bug, not a return value. Detected (when detected) by
272+
`check_invariants` or by the policy's tests.
273+
- **Concurrent contention.** `parking_lot::RwLock` doesn't poison,
274+
doesn't time out by default, and doesn't return `Result`. A
275+
contended cache blocks until it can proceed. Callers who need
276+
timeouts wrap the cache themselves with a wider locking
277+
discipline.
278+
279+
## Adding a new error
280+
281+
Checklist for a new failure mode:
282+
283+
1. **Decide the tier.** Programming error, user-supplied input, or
284+
internal invariant?
285+
2. **Pick or define the type.**
286+
- Tier 1: use `assert!` / `debug_assert!` / `panic!`. No new
287+
type needed.
288+
- Tier 2: define a new type if the failure has data the caller
289+
needs and no existing type fits. Otherwise reuse `ConfigError`
290+
(with a clear message) or pass through `TryReserveError`.
291+
- Tier 3: add a `check_invariants` method on the affected type
292+
that returns `Result<(), InvariantError>`.
293+
3. **Co-locate.** Types specific to a trait live with the trait
294+
(`StoreFull` in `src/store/traits.rs`). Types specific to a
295+
primitive live with the primitive (`LazyMinHeapError`).
296+
Cross-cutting types (`ConfigError`, `InvariantError`) live in
297+
`src/error.rs`.
298+
4. **Implement `Display` and `Error`.** Both are required for
299+
`?` interop with `Box<dyn Error>`. The convention is:
300+
```rust,ignore
301+
impl fmt::Display for MyError { … }
302+
impl std::error::Error for MyError {}
303+
```
304+
`Display` writes the message; `Error` is empty unless the type
305+
wraps another error (then `source` returns the inner error).
306+
5. **`Send + Sync + Clone`.** All existing error types satisfy this.
307+
The convention is `#[derive(Debug, Clone, PartialEq, Eq, Hash)]`
308+
for value types and matching impls for enums. Errors that flow
309+
between threads must be `Send + Sync`; errors that get cloned
310+
into snapshots / test fixtures must be `Clone`.
311+
312+
## Compatibility with `?` and `anyhow`/`thiserror`
313+
314+
The cachekit error types are intentionally **plain types, not
315+
`thiserror`-derived**, to avoid forcing a `thiserror` dependency on
316+
downstream users. They implement `std::error::Error` directly, so
317+
they work with `?`, `Box<dyn Error>`, and any error-aggregation
318+
crate (including `anyhow` and `thiserror::Error` in user code).
319+
320+
A downstream `thiserror`-derived enum that includes a `#[from]
321+
cachekit::ConfigError` works. A downstream `anyhow::Result<_>` that
322+
absorbs cachekit errors via `?` works. The choice not to bundle
323+
either crate keeps the error layer dependency-free and gives
324+
downstream the standard `From` and `Display` shape they expect.
325+
326+
## See also
327+
328+
- [Design overview](design.md) — §12 frames failure modes at the
329+
principles level
330+
- [Concurrency](concurrency.md)`parking_lot` non-poisoning,
331+
atomic check-and-act, lock-acquisition failure modes
332+
- [Builder and runtime dispatch](builder-and-dyn-dispatch.md)
333+
panic-in-`build` validation, `try_build`-deliberately-absent
334+
rationale
335+
- [Weighted eviction](weighted-eviction.md)`StoreFull`'s role
336+
and unwind-safety in `clear`
337+
- [`src/error.rs`](../../src/error.rs)`ConfigError`,
338+
`InvariantError`
339+
- [`src/store/traits.rs`](../../src/store/traits.rs)`StoreFull`
340+
- [`src/ds/lazy_heap.rs`](../../src/ds/lazy_heap.rs)
341+
`LazyMinHeapError`

0 commit comments

Comments
 (0)