docs(board): E-RELIABILITY-IS-CHECKLIST-COVERAGE — cheap RISC alternative: reliability = required-rung/checklist coverage over a knowns/unknowns SoA bitmask (AND-test, one cycle), no float/corpus

claude · claude · commit b5073aaf8ee0 · 2026-05-30T23:12:16.000Z
User's cheap alternative to psychometric calibration: reliability becomes structural+prior = coverage of a normalized eval set. 10-rung ladder (rung:u8 ALREADY on ThoughtCtx/SPO/ CausalEdge64, E-LADDER doctrine) = depth axis; per-domain class_id-keyed checklist (HHTL- inherited) = eval axis; knowns/unknowns = presence bitmask (cognitive-risc-classes N3). Commit=required&known==required (SIMD AND+popcount, one cycle); Plan=named known-unknowns remain; Prune=unsatisfiable. DISSOLVES the 0.2/0.8/0.15/0.35 threshold problem (no calibrate, no Jirak) -> retires the iron-rule VIOLATES. Unknown-unknowns stay the cold Stockfish validity gate (checklist-completeness audit). Reuses rung+bitmask+class_id(#439); no float, no thinking-engine dep -> lighter than the psychometric slice. Complementary: cheap hot gate + psychometric offline audit. https://claude.ai/code/session_01R9AWgFa65uPnLyS2my2d2R
diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md
@@ -1,3 +1,24 @@
+## 2026-05-30 — E-RELIABILITY-IS-CHECKLIST-COVERAGE — the cheap RISC alternative to psychometric calibration: reliability = (required rungs/checklist-items COVERED) over a knowns/unknowns SoA bitmask, AND-tested in one cycle. No float, no corpus.
+
+**Status:** FINDING + BUILD-DIRECTION (user 2026-05-30: "cheap and efficient alternative — 10-layer rungs ladder + a checklist per domain of what needs evaluation; reasoning has a normalized set of info with validation across knowns/unknowns as SoA"). The cheap structural alternative to E-CALIBRATE-RELIABILITY-PSYCHOMETRICALLY — complementary, not competing.
+
+**The reframe:** reliability becomes STRUCTURAL + PRIOR, not a post-hoc statistic. Instead of measuring Cronbach α over a corpus, make it COVERAGE of a normalized evaluation set:
+- **10-rung ladder = the normalized DEPTH axis.** `pearl_rung: u8` (1..=9, +0) ALREADY exists on `ThoughtCtx` (recipe_kernels.rs:36-37), proprioception, world_map, cognitive_shader (0..9), SPO triplet, CausalEdge64; `recipes::Recipe.tier` = Sun et al. reasoning-ladder difficulty. Doctrine: E-LADDER-SERVES-MAILBOX. The ladder is real + threaded; this REUSES it.
+- **Per-domain checklist = the EVALUATION axis (x).** Each domain (class_id) declares WHICH rungs/items must be evaluated. The checklist is `class_id`-keyed and inherited along the HHTL path (like labels/columns/templates — the cognitive-risc-classes triangle), so domains don't hand-roll it.
+- **knowns/unknowns as SoA = a presence bitmask** (= cognitive-risc-classes N3 "stable per-class bitmask, append-only, bit=field-N-populated"). Reliability-to-Commit = `required & present == required` — a SIMD batch-AND popcount over the SoA column, ONE cycle (the 0xFFF/facet-AND efficiency).
+
+**Why it's better-fit than calibration here:** (1) NO float, NO offline corpus, NO calibration pass — just bitmask coverage. (2) DISSOLVES the threshold problem the iron-rule-savant flagged: no 0.2/0.8/0.15/0.35 to calibrate OR Jirak-bound — the Rubicon 3-way maps onto COVERAGE STATE not a magnitude: Commit = required checklist covered; Plan = named known-UNKNOWNS remain (re-deliberate to fill them); Prune = checklist cannot be satisfied. (3) It's the `class_id`→checklist projection — one more payoff off the discriminator the SoA already needs (N1).
+
+**knowns vs unknowns = the enumerable-gap axis (MUL/Dunning-Kruger):** a *known-unknown* = an unchecked box you can NAME (→ Plan); the checklist makes unknowns ENUMERABLE — you can't be confidently-wrong about a box you know is empty. This is reliability-as-coverage doing the epistemic-humility work the φ⁻¹ ceiling did, but structurally + cheaply.
+
+**Honest tension (not glossed):** a checklist only covers known categories — an UNKNOWN-UNKNOWN (a required-but-UNLISTED item) is invisible to coverage. That is EXACTLY the cold-path VALIDITY gate's job (E-RELIABILITY-NOT-VALIDITY): the bring-up test (chess/Stockfish, domain ≥2) FALSIFIES the checklist — finds the box nobody listed. So coverage = cheap HOT reliability gate; Stockfish/oracle = cold validity gate that audits checklist COMPLETENESS. Complementary to the psychometric path: use checklist-coverage for the hot per-cycle gate; reserve Cronbach/ICC for offline auditing whether the checklist items themselves cohere.
+
+**Cheap build (vs the calibration build):** add a `class_id`-keyed per-domain checklist (which rungs/items required) + a `coverage` bitmask column on the SoA (knowns); the Rubicon gate = `required & known == required` AND-test + popcount for the Plan/gap signal. Reuses rung (exists), bitmask (N3 exists), class_id (N1, #439 hook exists). NO new float, NO thinking-engine dep — lighter than the psychometric slice. Sequence: this is the DEFAULT cheap gate; psychometric calibration is the heavier offline audit when a domain's checklist itself is in question.
+
+**Cross-ref:** E-RELIABILITY-NOT-VALIDITY (reliability vs validity split — this is the cheap reliability gate, Stockfish stays the validity gate); E-CALIBRATE-RELIABILITY-PSYCHOMETRICALLY (the heavier alternative this undercuts for the hot path); E-LADDER-SERVES-MAILBOX (the rung doctrine); cognitive-risc-classes N1 (class_id) + N3 (stable per-class presence bitmask) + the HHTL-inherited checklist; `recipe_kernels.rs:36 ThoughtCtx.rung`; `recipes::Recipe.tier`; MUL/Dunning-Kruger (known-unknown enumeration); iron-rule-savant VIOLATES-I-NOISE-FLOOR-JIRAK (dissolved, not calibrated).
+
+---
+
 ## 2026-05-30 — E-CALIBRATE-RELIABILITY-PSYCHOMETRICALLY — replace the hand-tuned Rubicon (f,c)/SD thresholds with MEASURED psychometric reliability (Cronbach α / ICC / Spearman / Pearson) — the existing crates, applied brutally to the gate
 
 **Status:** FINDING + BUILD-DIRECTION (user 2026-05-30: "be brutal and use psychometry calibration"). Follows directly from E-RELIABILITY-NOT-VALIDITY: if (f,c) is a RELIABILITY coefficient, calibrate it with real reliability statistics, don't hand-tune it. Resolves the iron-rule-savant's VIOLATES-I-NOISE-FLOOR-JIRAK (uncited 0.2/0.8/0.15/0.35 thresholds).