Skip to content

Commit af0fe20

Browse files
authored
Merge pull request #83 from FluffyAIcode/AgentMemory/v04-pr-k2a1-kl-integration-8e7f
PR-K2.A.1: stateless KakeyaLattice integration into DLMRestoredVerifier (Mac + CUDA)
2 parents e2db26c + 55dbed4 commit af0fe20

7 files changed

Lines changed: 1086 additions & 18 deletions

File tree

docs/adr/0008-session-bound-runtime-and-grpc-protocol.md

Lines changed: 131 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1624,22 +1624,49 @@ K2.A is purely an interface-and-codec swap, not a re-architecting.
16241624

16251625
#### 11.11.5 K2.A acceptance gates
16261626

1627-
K2.A merge requires evidence of all three:
1628-
1629-
1. **Round-trip identity**: per-tensor numerical: `‖decompress(compress(K, V)) - (K, V)‖ / ‖(K, V)‖` within KakeyaLattice's published fidelity envelope (per layer per head). Linux unit-test gate; deterministic on synthetic K/V.
1630-
2. **No quality regression**: K1.E NIAH harness, same Gemma 3-1B
1631-
identity-projection setup, KL on vs KL off:
1627+
K2.A is staged across two PRs (see §11.11.12 below for the
1628+
staging rationale). K2.A.1 ships the stateless KL plumbing —
1629+
gates (a) and (b) testable. K2.A.2 ships the stateful caching —
1630+
gate (c) testable. Both PRs together comprise full K2.A
1631+
acceptance.
1632+
1633+
1. **Round-trip identity (gate a) — testable in K2.A.1**.
1634+
Per-tensor numerical: `‖decompress(compress(K, V)) - (K, V)‖ /
1635+
‖(K, V)‖` within KakeyaLattice's published fidelity envelope
1636+
(per layer per head). Linux unit-test gate; deterministic on
1637+
synthetic K/V. Mac M4 platform-specific calibrated bound is
1638+
1.5e-3 per §11.11.9; CUDA reference is 3e-5 (the published
1639+
KL CUDA envelope).
1640+
2. **No quality regression (gate b) — testable in K2.A.1**.
1641+
K1.E NIAH harness, same Gemma 3-1B identity-projection setup,
1642+
KL on vs KL off:
16321643
* recall(KL on) ≥ recall(KL off) − 1pp at every context rung
16331644
in §11.12 ladder (1.4k, 5.6k, 22k, 64k, 100k).
16341645
* `effective_attention_fraction` from K1.H schema: identical
16351646
between KL on and KL off (KL is structurally invisible to
16361647
the attention-mask path).
1637-
3. **Throughput improvement**: K1.I throughput metric (schema v4):
1648+
* Mac M4 escape hatch: if recall regresses on Mac specifically,
1649+
tighten Q (e.g. Q=76 instead of Q=38, +1 bit/coord, halves
1650+
the lattice-quantisation error per §11.11.9). Do NOT fail
1651+
K2.A on Mac platform-specific fidelity issues — it's a Q
1652+
parameter sweep, not an architectural failure.
1653+
3. **Throughput improvement (gate c) — testable in K2.A.2 only**.
1654+
K1.I throughput metric (schema v4):
16381655
* `mean_throughput_tokens_per_sec(KL on) / mean_throughput_tokens_per_sec(KL off) ≥ 1.3` at the 22k+ rungs of the §11.12 ladder.
16391656
* The 1.3× floor is conservative; theoretical upper bound is
16401657
the inverse of the KL-on eviction rate, which approaches the
16411658
full-attention oracle's throughput as the local cache grows
16421659
to cover most of T.
1660+
* **K2.A.1 NOTE**: stateless KL plumbing (compress + decompress
1661+
per forward step, no cross-step caching) does not target gate
1662+
(c). Throughput on K2.A.1 with KL on is expected to be SAME
1663+
OR SLOWER than KL off — the round-trip overhead is paid each
1664+
step with no caching savings to amortise it. Gate (c) is
1665+
architecturally bound to the K2.A.2 stateful caching design
1666+
(DLMRestoredVerifier maintains compressed K/V across decode
1667+
steps so the verifier's per-step forward becomes O(window)
1668+
instead of O(T)). K2.A.1 evidence at gate (c) is the
1669+
**baseline** K2.A.2 will be measured against.
16431670

16441671
#### 11.11.6 K2.B (was K2): cross-model `f_θ` trained against KL-on cache
16451672

@@ -2102,6 +2129,104 @@ samples reproduces a 64k dip on the same model family, that
21022129
becomes a base-model finding to escalate to the model provider,
21032130
not a v0.4 finding.
21042131

2132+
#### 11.11.12 K2.A staging: K2.A.1 stateless plumbing vs K2.A.2 stateful caching
2133+
2134+
Added 2026-06-09 alongside the K2.A.1 implementation PR. K2.A
2135+
acceptance gates (§11.11.5 above) are now staged across two PRs
2136+
because they require structurally different engineering work:
2137+
2138+
**K2.A.1 (stateless KL plumbing — code change scope: ~150 LOC
2139+
in `inference_engine/v04/dlm_restored_verifier.py` + reviewer
2140+
scripts + tests).** What it delivers:
2141+
2142+
* `DLMRestoredVerifier.__init__` accepts a `kv_compressor_factory`
2143+
parameter. Default `None` preserves K1 behaviour bit-for-bit
2144+
via `IdentityCompressor`. When provided, the factory is invoked
2145+
once per attention module **per forward call** (= every decode
2146+
step) to construct a fresh per-layer compressor instance. State
2147+
is therefore reset between decode steps — there is no
2148+
cross-step amortisation.
2149+
* `_restored_attention_forward` calls a new
2150+
`_round_trip_resident_through_compressor` helper after the K/V
2151+
merge step. The helper compresses K/V at resident-window
2152+
positions through the per-layer compressor, then immediately
2153+
decompresses. K/V at evicted positions (reconstructed from the
2154+
proposer per §11.11.2) are NOT routed through the codec.
2155+
* The `_LayerRestorationContext` dataclass gains
2156+
`resident_positions: List[int]` and `compressor: KVCompressor`
2157+
fields, threaded through `_restoration_active`.
2158+
* The K1.E NIAH runner (`scripts/research/k1e_niah_validation.py`)
2159+
gains `--kl-on / --kl-lattice / --kl-q-range` flags. JSON
2160+
schema bumps 4 → 5 to record the KL config block.
2161+
* Reviewer scripts:
2162+
- `scripts/review_pr_k2a1_integration_on_vast.sh` — vast.ai
2163+
CUDA A/B at the §11.12 ladder.
2164+
- `scripts/review_pr_k2a1_integration_on_mac.sh` — Mac M4
2165+
(PyTorch MPS) A/B at the small-end §11.12 rungs (1.4k +
2166+
5.6k by default).
2167+
2168+
K2.A.1 acceptance gates (per §11.11.5 above): **gate (a)
2169+
round-trip identity** is closed by the K2.A.0 Mac smoke
2170+
(`3536e57`) plus a CUDA-equivalent reference check that lands
2171+
with K2.A.1's first vast run. **Gate (b) recall delta ≤ 1pp**
2172+
is the K2.A.1 binding signal: A/B at every §11.12 rung must
2173+
show recall(KL on) within 1 pp of recall(KL off). **Gate (c)
2174+
throughput improvement** is OUT OF K2.A.1's scope; the
2175+
stateless plumbing's per-step compress+decompress cost has no
2176+
caching offset to amortise it, so gate (c) is expected to fail
2177+
at K2.A.1.
2178+
2179+
**K2.A.2 (stateful caching — future PR; code change scope:
2180+
~500–1000 LOC, refactor of DLMRestoredVerifier across forwards).**
2181+
What it must deliver:
2182+
2183+
* `DLMRestoredVerifier` becomes session-stateful: compressors
2184+
(one per layer) are created at session start and persist
2185+
across `forward()` calls. Resident K/V at sink+window slots
2186+
are compressed once when produced and reused on subsequent
2187+
decode steps via `decompress`. New decode steps add 1 token
2188+
to the cache; positions leaving the window are evicted via
2189+
`compressor.evict`.
2190+
* Verifier forward over `[1, T]` becomes verifier forward over
2191+
`[1, 1]` (the new query position only) plus a K/V assembly
2192+
step that decompresses the resident cache and merges with
2193+
proposer-restored evicted K/V. The proposer still runs O(T)
2194+
per step (no proposer cache by §11.3 design); the verifier
2195+
drops to O(1) per step. Net per-step cost goes from
2196+
O(T)_proposer + O(T)_verifier (K1.D + K2.A.1) to
2197+
O(T)_proposer + O(1)_verifier (K2.A.2).
2198+
* This is what closes gate (c). At the 100k rung, K1.F evidence
2199+
(`aab8686`) shows v0.4 / oracle latency ratio = 0.53× (1.9×
2200+
slower than oracle); gate (c) requires ≥0.6× — i.e. K2.A.2
2201+
must yield ≥1.13× over K2.A.1's stateless baseline. The
2202+
theoretical upper bound is approximately the proposer/verifier
2203+
cost ratio at long context, typically 1.5–2× — within reach of
2204+
the K2.A.2 stateful design.
2205+
* K2.A.2 also closes the §11.11.9 sustained-memory empirical gap
2206+
by introducing a **persistent** compressed cache whose size IS
2207+
the architecturally-meaningful "v0.4 sustained working set",
2208+
visible to CUDA `peak_allocated_bytes` measurement in K1.G's
2209+
schema.
2210+
2211+
**Why split.** Stateless plumbing first lets us validate the
2212+
correctness contract (gate b) before committing to the stateful
2213+
caching design. K2.A.1 makes the integration risk concrete:
2214+
"does KL on every forward break recall?" If the answer is no,
2215+
K2.A.2 can pursue throughput aggressively without simultaneously
2216+
defending correctness. If the answer is yes (recall regresses
2217+
even with stateless KL), the failure mode is isolated to the
2218+
codec composition — Q-sweep escape hatch (§11.11.9) applies and
2219+
K2.A.2 is unblocked once K2.A.1 finds a working Q. Either way,
2220+
the staging makes each phase's risk diagnosable in isolation.
2221+
2222+
**What K2.A.2 specifically must NOT do.** K2.A.2 is a stateful
2223+
caching refactor, not a new architectural variant. The §11.11.3
2224+
two-path K/V sourcing model (resident → KL-decompress, evicted
2225+
→ dLM → f_θ) remains. The only structural change is moving the
2226+
resident cache from "computed per forward" (K1.D, K2.A.1) to
2227+
"persisted across forwards" (K2.A.2). f_θ remains identity in
2228+
K2.A.{1,2} same-model setup; cross-model f_θ is K2.B.
2229+
21052230
### 11.12 Canonical empirical ladder (recall × rung × platform)
21062231

21072232
Reference matrix for the K1 multi-source baseline.

0 commit comments

Comments
 (0)