Commit 8791722
PR-K1.D: DLMRestoredVerifier + Gemma3Attention monkey-patch + Mac M4 smoke
Final foundational PR of the K1 series. Ties together K1.A (capture),
K1.B (merge), and K1.C (per-layer K/V prep) into a single end-to-end
DLMRestoredVerifier wrapper that runs an HF Gemma3-class model under
the v0.4 K/V Restoration architecture (ADR 0008 §11).
Files:
inference_engine/v04/dlm_restored_verifier.py (387 lines)
* _LayerRestorationContext (dataclass) — per-layer captured K, V
plus shared evicted_positions list. Attached to each attention
module as the private attribute _v04_layer_context during a
forward call; removed afterwards.
* _restored_attention_forward(attn_module, hidden_states, ...) —
a faithful copy of Gemma3Attention.forward (HF transformers
>= 4.57) with one insertion: between apply_rotary_pos_emb (which
rotates Q and K) and attention_interface, post-RoPE K/V are
passed through prepare_restored_attention_kv (K1.C) along with
the layer's slice of the captured K/V from K1.A. The result
replaces local K/V at evicted positions; non-evicted positions
pass through unchanged.
HF function pointers (apply_rotary_pos_emb, eager_attention_
forward, ALL_ATTENTION_FUNCTIONS) are passed in as parameters
rather than imported at module load time so this module remains
importable without HF transformers (Linux CI does have it; the
imports in synthetic tests can be replaced with stubs).
* DLMRestoredVerifier(model, sink_size=4, window_size=64) — the
public wrapper. Discovers the model's decoder layers
(model.model.layers, Gemma3 / Llama / Qwen / Mistral shape).
The .forward(input_ids, apply_rotary_pos_emb=..., ...) call:
1. capture_proposer_kv(self.model, input_ids) → KVCapture
2. compute_evicted_positions → contiguous range
3. select_positions(evicted) on the capture
4. Install monkey-patch on every layer's self_attn.forward
via a context manager (exception-safe; finally restores)
5. Run the model's standard forward (which now goes through
the patched attention forwards)
6. Discard captures, remove patches, return logits.
Single-batch only in K1.D (input_ids.size(0) == 1) — multi-batch
routing is K-series Phase 2 work.
inference_engine/v04/__init__.py
Public API now exports DLMRestoredVerifier alongside K1.A / K1.B
/ K1.C primitives. After this PR the v0.4 module's public API
is functionally complete for K1 phase: callable end-to-end
wrapper + the three primitive layers (capture, merge, attention
K/V prep) for downstream composition.
tests/inference_engine/v04/test_dlm_restored_verifier.py (560 lines,
21 cases, all <0.20 s on Linux CI)
Test classes:
* TestRestoredAttentionForward — the patched function in isolation
on a synthetic Gemma3-shape attention module: no-context falls
through to upstream-equivalent behaviour; empty evicted short-
circuits the merge; non-empty evicted runs the merge code path
and returns correct shape.
* TestDLMRestoredVerifierConstruction — default sink/window;
negative values raise.
* TestDLMRestoredVerifierShapeDiscovery — decoder layers found on
Gemma3-shape; layers without self_attn raise; unrecognised model
shape raises (no silent fallback per ADR 0008 §6.2).
* TestDLMRestoredVerifierBatchValidation — rank-1 input raises;
batch>1 raises with 'single-batch only' message.
* TestRestorationActiveLifecycle — patches install on every layer;
_v04_layer_context attached during context, removed after; layer
count mismatch raises; exception during context still unpatches
(finally clause); empty evicted list still installs but the
patched forward short-circuits the merge.
* TestDLMRestoredVerifierForward — end-to-end with stub model:
forward calls the model's forward; short input (no eviction)
produces correct shape; @torch.no_grad guarantees no-grad
output; patches cleared after call (bound-method __func__
identity check, since Python recreates bound methods on each
attribute access).
Tests use a synthetic Gemma3-shape surrogate (_FakeAttention,
_FakeDecoderLayer, _FakeInner, _FakeModel) that mirrors the HF
hook surface. The fake _FakeAttention.forward actually invokes
k_proj/v_proj/q_proj so capture hooks fire correctly. The fake
model's .forward seeds a deterministic synthetic hidden state by
input_ids.sum() so two consecutive forwards on the same input
see the same internal state — making the same-model identity
case mathematically meaningful.
Combined v04 test status after this PR:
tests/inference_engine/v04/ has 124 cases (32 K1.A + 39 K1.B
+ 32 K1.C + 21 K1.D), all <0.20 s on Linux CI, no HF model
download.
scripts/review_pr_k1d_on_mac.sh (216 lines)
Mac M4 reviewer aid — smoke test against real google/gemma-3-1b-it.
This is NOT the empirical NIAH validation gate (that is K1.E).
It answers: 'does the v0.4 K/V Restoration wrapper actually run
end-to-end on real Gemma 3-1B-it without crashing or producing
NaN/Inf?'.
Three runs on a 256-token synthetic input:
(a) standard model.forward (full attention oracle)
(b) DLMRestoredVerifier sink=4 window=64 (real eviction)
(c) DLMRestoredVerifier sink=10000 window=10000 (no eviction
— should match (a) bit-exactly to numerical precision,
KL < 1e-3 nats)
Smoke gate:
* no NaN/Inf in any of (a), (b), (c)
* (c) KL vs (a) < 1e-3 nats (the patched forward is
mathematically identical to upstream when no merge runs;
any drift indicates a bug in the patched forward's copy of
the upstream logic)
(b)'s argmax / KL vs oracle is recorded but NOT smoke-gated
because we expect divergence — that's the whole point of v0.4
(preserve intelligence under bounded sink+window via reconstruction,
not bit-exact equivalence to full attention). The gate that does
matter for v0.4 GA — '>=95% mid-context recall at 100k context'
— is K1.E.
Time budget on Mac M4 24 GB with Gemma 3-1B-it: ~3-5 minutes.
Output: results/research/k1d_smoke_<stamp>.json + log.
What's next:
K1.E — DLMRestoredVerifier NIAH validation harness on Mac M4.
Synthetic needle-in-haystack at 100k+ context, comparison
across:
* full-attention oracle (target: ~100% recall, the
upper bound)
* v0.3 sink=4 + window=64 (target: ~17% recall per the
2026-06-06 A/B benchmark — confirms the regression
this whole architecture exists to fix)
* v0.4 DLMRestoredVerifier with sink=4 + window=64 (gate:
ADR 0008 §11.8 (a) >=95% mid-context recall)
Successful K1.E is the v0.4 GA primary architectural
validation.
Stacking notes:
Logical base is #72 K1.C (which itself is logically based on #71
K1.B). After #71 and #72 are merged into main, this PR's diff
shrinks to just the K1.D additions. base_branch is set to main
for tooling reasons; merge order is #71 → #72 → this PR.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>1 parent 651305b commit 8791722
4 files changed
Lines changed: 1380 additions & 0 deletions
File tree
- inference_engine/v04
- scripts
- tests/inference_engine/v04
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
| 45 | + | |
45 | 46 | | |
46 | 47 | | |
47 | 48 | | |
| |||
55 | 56 | | |
56 | 57 | | |
57 | 58 | | |
| 59 | + | |
| 60 | + | |
58 | 61 | | |
0 commit comments