@@ -1624,22 +1624,49 @@ K2.A is purely an interface-and-codec swap, not a re-architecting.
16241624
16251625#### 11.11.5 K2.A acceptance gates
16261626
1627- K2.A merge requires evidence of all three:
1628-
1629- 1 . ** Round-trip identity** : per-tensor numerical: ` ‖decompress(compress(K, V)) - (K, V)‖ / ‖(K, V)‖ ` within KakeyaLattice's published fidelity envelope (per layer per head). Linux unit-test gate; deterministic on synthetic K/V.
1630- 2 . ** No quality regression** : K1.E NIAH harness, same Gemma 3-1B
1631- identity-projection setup, KL on vs KL off:
1627+ K2.A is staged across two PRs (see §11.11.12 below for the
1628+ staging rationale). K2.A.1 ships the stateless KL plumbing —
1629+ gates (a) and (b) testable. K2.A.2 ships the stateful caching —
1630+ gate (c) testable. Both PRs together comprise full K2.A
1631+ acceptance.
1632+
1633+ 1 . ** Round-trip identity (gate a) — testable in K2.A.1** .
1634+ Per-tensor numerical: `‖decompress(compress(K, V)) - (K, V)‖ /
1635+ ‖(K, V)‖` within KakeyaLattice's published fidelity envelope
1636+ (per layer per head). Linux unit-test gate; deterministic on
1637+ synthetic K/V. Mac M4 platform-specific calibrated bound is
1638+ 1.5e-3 per §11.11.9; CUDA reference is 3e-5 (the published
1639+ KL CUDA envelope).
1640+ 2 . ** No quality regression (gate b) — testable in K2.A.1** .
1641+ K1.E NIAH harness, same Gemma 3-1B identity-projection setup,
1642+ KL on vs KL off:
16321643 * recall(KL on) ≥ recall(KL off) − 1pp at every context rung
16331644 in §11.12 ladder (1.4k, 5.6k, 22k, 64k, 100k).
16341645 * ` effective_attention_fraction ` from K1.H schema: identical
16351646 between KL on and KL off (KL is structurally invisible to
16361647 the attention-mask path).
1637- 3 . ** Throughput improvement** : K1.I throughput metric (schema v4):
1648+ * Mac M4 escape hatch: if recall regresses on Mac specifically,
1649+ tighten Q (e.g. Q=76 instead of Q=38, +1 bit/coord, halves
1650+ the lattice-quantisation error per §11.11.9). Do NOT fail
1651+ K2.A on Mac platform-specific fidelity issues — it's a Q
1652+ parameter sweep, not an architectural failure.
1653+ 3 . ** Throughput improvement (gate c) — testable in K2.A.2 only** .
1654+ K1.I throughput metric (schema v4):
16381655 * ` mean_throughput_tokens_per_sec(KL on) / mean_throughput_tokens_per_sec(KL off) ≥ 1.3 ` at the 22k+ rungs of the §11.12 ladder.
16391656 * The 1.3× floor is conservative; theoretical upper bound is
16401657 the inverse of the KL-on eviction rate, which approaches the
16411658 full-attention oracle's throughput as the local cache grows
16421659 to cover most of T.
1660+ * ** K2.A.1 NOTE** : stateless KL plumbing (compress + decompress
1661+ per forward step, no cross-step caching) does not target gate
1662+ (c). Throughput on K2.A.1 with KL on is expected to be SAME
1663+ OR SLOWER than KL off — the round-trip overhead is paid each
1664+ step with no caching savings to amortise it. Gate (c) is
1665+ architecturally bound to the K2.A.2 stateful caching design
1666+ (DLMRestoredVerifier maintains compressed K/V across decode
1667+ steps so the verifier's per-step forward becomes O(window)
1668+ instead of O(T)). K2.A.1 evidence at gate (c) is the
1669+ ** baseline** K2.A.2 will be measured against.
16431670
16441671#### 11.11.6 K2.B (was K2): cross-model ` f_θ ` trained against KL-on cache
16451672
@@ -2102,6 +2129,104 @@ samples reproduces a 64k dip on the same model family, that
21022129becomes a base-model finding to escalate to the model provider,
21032130not a v0.4 finding.
21042131
2132+ #### 11.11.12 K2.A staging: K2.A.1 stateless plumbing vs K2.A.2 stateful caching
2133+
2134+ Added 2026-06-09 alongside the K2.A.1 implementation PR. K2.A
2135+ acceptance gates (§11.11.5 above) are now staged across two PRs
2136+ because they require structurally different engineering work:
2137+
2138+ ** K2.A.1 (stateless KL plumbing — code change scope: ~ 150 LOC
2139+ in ` inference_engine/v04/dlm_restored_verifier.py ` + reviewer
2140+ scripts + tests).** What it delivers:
2141+
2142+ * ` DLMRestoredVerifier.__init__ ` accepts a ` kv_compressor_factory `
2143+ parameter. Default ` None ` preserves K1 behaviour bit-for-bit
2144+ via ` IdentityCompressor ` . When provided, the factory is invoked
2145+ once per attention module ** per forward call** (= every decode
2146+ step) to construct a fresh per-layer compressor instance. State
2147+ is therefore reset between decode steps — there is no
2148+ cross-step amortisation.
2149+ * ` _restored_attention_forward ` calls a new
2150+ ` _round_trip_resident_through_compressor ` helper after the K/V
2151+ merge step. The helper compresses K/V at resident-window
2152+ positions through the per-layer compressor, then immediately
2153+ decompresses. K/V at evicted positions (reconstructed from the
2154+ proposer per §11.11.2) are NOT routed through the codec.
2155+ * The ` _LayerRestorationContext ` dataclass gains
2156+ ` resident_positions: List[int] ` and ` compressor: KVCompressor `
2157+ fields, threaded through ` _restoration_active ` .
2158+ * The K1.E NIAH runner (` scripts/research/k1e_niah_validation.py ` )
2159+ gains ` --kl-on / --kl-lattice / --kl-q-range ` flags. JSON
2160+ schema bumps 4 → 5 to record the KL config block.
2161+ * Reviewer scripts:
2162+ - ` scripts/review_pr_k2a1_integration_on_vast.sh ` — vast.ai
2163+ CUDA A/B at the §11.12 ladder.
2164+ - ` scripts/review_pr_k2a1_integration_on_mac.sh ` — Mac M4
2165+ (PyTorch MPS) A/B at the small-end §11.12 rungs (1.4k +
2166+ 5.6k by default).
2167+
2168+ K2.A.1 acceptance gates (per §11.11.5 above): ** gate (a)
2169+ round-trip identity** is closed by the K2.A.0 Mac smoke
2170+ (` 3536e57 ` ) plus a CUDA-equivalent reference check that lands
2171+ with K2.A.1's first vast run. ** Gate (b) recall delta ≤ 1pp**
2172+ is the K2.A.1 binding signal: A/B at every §11.12 rung must
2173+ show recall(KL on) within 1 pp of recall(KL off). ** Gate (c)
2174+ throughput improvement** is OUT OF K2.A.1's scope; the
2175+ stateless plumbing's per-step compress+decompress cost has no
2176+ caching offset to amortise it, so gate (c) is expected to fail
2177+ at K2.A.1.
2178+
2179+ ** K2.A.2 (stateful caching — future PR; code change scope:
2180+ ~ 500–1000 LOC, refactor of DLMRestoredVerifier across forwards).**
2181+ What it must deliver:
2182+
2183+ * ` DLMRestoredVerifier ` becomes session-stateful: compressors
2184+ (one per layer) are created at session start and persist
2185+ across ` forward() ` calls. Resident K/V at sink+window slots
2186+ are compressed once when produced and reused on subsequent
2187+ decode steps via ` decompress ` . New decode steps add 1 token
2188+ to the cache; positions leaving the window are evicted via
2189+ ` compressor.evict ` .
2190+ * Verifier forward over ` [1, T] ` becomes verifier forward over
2191+ ` [1, 1] ` (the new query position only) plus a K/V assembly
2192+ step that decompresses the resident cache and merges with
2193+ proposer-restored evicted K/V. The proposer still runs O(T)
2194+ per step (no proposer cache by §11.3 design); the verifier
2195+ drops to O(1) per step. Net per-step cost goes from
2196+ O(T)_ proposer + O(T)_ verifier (K1.D + K2.A.1) to
2197+ O(T)_ proposer + O(1)_ verifier (K2.A.2).
2198+ * This is what closes gate (c). At the 100k rung, K1.F evidence
2199+ (` aab8686 ` ) shows v0.4 / oracle latency ratio = 0.53× (1.9×
2200+ slower than oracle); gate (c) requires ≥0.6× — i.e. K2.A.2
2201+ must yield ≥1.13× over K2.A.1's stateless baseline. The
2202+ theoretical upper bound is approximately the proposer/verifier
2203+ cost ratio at long context, typically 1.5–2× — within reach of
2204+ the K2.A.2 stateful design.
2205+ * K2.A.2 also closes the §11.11.9 sustained-memory empirical gap
2206+ by introducing a ** persistent** compressed cache whose size IS
2207+ the architecturally-meaningful "v0.4 sustained working set",
2208+ visible to CUDA ` peak_allocated_bytes ` measurement in K1.G's
2209+ schema.
2210+
2211+ ** Why split.** Stateless plumbing first lets us validate the
2212+ correctness contract (gate b) before committing to the stateful
2213+ caching design. K2.A.1 makes the integration risk concrete:
2214+ "does KL on every forward break recall?" If the answer is no,
2215+ K2.A.2 can pursue throughput aggressively without simultaneously
2216+ defending correctness. If the answer is yes (recall regresses
2217+ even with stateless KL), the failure mode is isolated to the
2218+ codec composition — Q-sweep escape hatch (§11.11.9) applies and
2219+ K2.A.2 is unblocked once K2.A.1 finds a working Q. Either way,
2220+ the staging makes each phase's risk diagnosable in isolation.
2221+
2222+ ** What K2.A.2 specifically must NOT do.** K2.A.2 is a stateful
2223+ caching refactor, not a new architectural variant. The §11.11.3
2224+ two-path K/V sourcing model (resident → KL-decompress, evicted
2225+ → dLM → f_θ) remains. The only structural change is moving the
2226+ resident cache from "computed per forward" (K1.D, K2.A.1) to
2227+ "persisted across forwards" (K2.A.2). f_θ remains identity in
2228+ K2.A.{1,2} same-model setup; cross-model f_θ is K2.B.
2229+
21052230### 11.12 Canonical empirical ladder (recall × rung × platform)
21062231
21072232Reference matrix for the K1 multi-source baseline.
0 commit comments