|
| 1 | +## 2026-05-30 — E-RELIABILITY-IS-CHECKLIST-COVERAGE — the cheap RISC alternative to psychometric calibration: reliability = (required rungs/checklist-items COVERED) over a knowns/unknowns SoA bitmask, AND-tested in one cycle. No float, no corpus. |
| 2 | + |
| 3 | +**Status:** FINDING + BUILD-DIRECTION (user 2026-05-30: "cheap and efficient alternative — 10-layer rungs ladder + a checklist per domain of what needs evaluation; reasoning has a normalized set of info with validation across knowns/unknowns as SoA"). The cheap structural alternative to E-CALIBRATE-RELIABILITY-PSYCHOMETRICALLY — complementary, not competing. |
| 4 | + |
| 5 | +**The reframe:** reliability becomes STRUCTURAL + PRIOR, not a post-hoc statistic. Instead of measuring Cronbach α over a corpus, make it COVERAGE of a normalized evaluation set: |
| 6 | +- **10-rung ladder = the normalized DEPTH axis.** `pearl_rung: u8` (1..=9, +0) ALREADY exists on `ThoughtCtx` (recipe_kernels.rs:36-37), proprioception, world_map, cognitive_shader (0..9), SPO triplet, CausalEdge64; `recipes::Recipe.tier` = Sun et al. reasoning-ladder difficulty. Doctrine: E-LADDER-SERVES-MAILBOX. The ladder is real + threaded; this REUSES it. |
| 7 | +- **Per-domain checklist = the EVALUATION axis (x).** Each domain (class_id) declares WHICH rungs/items must be evaluated. The checklist is `class_id`-keyed and inherited along the HHTL path (like labels/columns/templates — the cognitive-risc-classes triangle), so domains don't hand-roll it. |
| 8 | +- **knowns/unknowns as SoA = a presence bitmask** (= cognitive-risc-classes N3 "stable per-class bitmask, append-only, bit=field-N-populated"). Reliability-to-Commit = `required & present == required` — a SIMD batch-AND popcount over the SoA column, ONE cycle (the 0xFFF/facet-AND efficiency). |
| 9 | + |
| 10 | +**Why it's better-fit than calibration here:** (1) NO float, NO offline corpus, NO calibration pass — just bitmask coverage. (2) DISSOLVES the threshold problem the iron-rule-savant flagged: no 0.2/0.8/0.15/0.35 to calibrate OR Jirak-bound — the Rubicon 3-way maps onto COVERAGE STATE not a magnitude: Commit = required checklist covered; Plan = named known-UNKNOWNS remain (re-deliberate to fill them); Prune = checklist cannot be satisfied. (3) It's the `class_id`→checklist projection — one more payoff off the discriminator the SoA already needs (N1). |
| 11 | + |
| 12 | +**knowns vs unknowns = the enumerable-gap axis (MUL/Dunning-Kruger):** a *known-unknown* = an unchecked box you can NAME (→ Plan); the checklist makes unknowns ENUMERABLE — you can't be confidently-wrong about a box you know is empty. This is reliability-as-coverage doing the epistemic-humility work the φ⁻¹ ceiling did, but structurally + cheaply. |
| 13 | + |
| 14 | +**Honest tension (not glossed):** a checklist only covers known categories — an UNKNOWN-UNKNOWN (a required-but-UNLISTED item) is invisible to coverage. That is EXACTLY the cold-path VALIDITY gate's job (E-RELIABILITY-NOT-VALIDITY): the bring-up test (chess/Stockfish, domain ≥2) FALSIFIES the checklist — finds the box nobody listed. So coverage = cheap HOT reliability gate; Stockfish/oracle = cold validity gate that audits checklist COMPLETENESS. Complementary to the psychometric path: use checklist-coverage for the hot per-cycle gate; reserve Cronbach/ICC for offline auditing whether the checklist items themselves cohere. |
| 15 | + |
| 16 | +**Cheap build (vs the calibration build):** add a `class_id`-keyed per-domain checklist (which rungs/items required) + a `coverage` bitmask column on the SoA (knowns); the Rubicon gate = `required & known == required` AND-test + popcount for the Plan/gap signal. Reuses rung (exists), bitmask (N3 exists), class_id (N1, #439 hook exists). NO new float, NO thinking-engine dep — lighter than the psychometric slice. Sequence: this is the DEFAULT cheap gate; psychometric calibration is the heavier offline audit when a domain's checklist itself is in question. |
| 17 | + |
| 18 | +**Cross-ref:** E-RELIABILITY-NOT-VALIDITY (reliability vs validity split — this is the cheap reliability gate, Stockfish stays the validity gate); E-CALIBRATE-RELIABILITY-PSYCHOMETRICALLY (the heavier alternative this undercuts for the hot path); E-LADDER-SERVES-MAILBOX (the rung doctrine); cognitive-risc-classes N1 (class_id) + N3 (stable per-class presence bitmask) + the HHTL-inherited checklist; `recipe_kernels.rs:36 ThoughtCtx.rung`; `recipes::Recipe.tier`; MUL/Dunning-Kruger (known-unknown enumeration); iron-rule-savant VIOLATES-I-NOISE-FLOOR-JIRAK (dissolved, not calibrated). |
| 19 | + |
| 20 | +--- |
| 21 | + |
1 | 22 | ## 2026-05-30 — E-CALIBRATE-RELIABILITY-PSYCHOMETRICALLY — replace the hand-tuned Rubicon (f,c)/SD thresholds with MEASURED psychometric reliability (Cronbach α / ICC / Spearman / Pearson) — the existing crates, applied brutally to the gate |
2 | 23 |
|
3 | 24 | **Status:** FINDING + BUILD-DIRECTION (user 2026-05-30: "be brutal and use psychometry calibration"). Follows directly from E-RELIABILITY-NOT-VALIDITY: if (f,c) is a RELIABILITY coefficient, calibrate it with real reliability statistics, don't hand-tune it. Resolves the iron-rule-savant's VIOLATES-I-NOISE-FLOOR-JIRAK (uncited 0.2/0.8/0.15/0.35 thresholds). |
|
0 commit comments