Skip to content

Commit 8791722

Browse files
PR-K1.D: DLMRestoredVerifier + Gemma3Attention monkey-patch + Mac M4 smoke
Final foundational PR of the K1 series. Ties together K1.A (capture), K1.B (merge), and K1.C (per-layer K/V prep) into a single end-to-end DLMRestoredVerifier wrapper that runs an HF Gemma3-class model under the v0.4 K/V Restoration architecture (ADR 0008 §11). Files: inference_engine/v04/dlm_restored_verifier.py (387 lines) * _LayerRestorationContext (dataclass) — per-layer captured K, V plus shared evicted_positions list. Attached to each attention module as the private attribute _v04_layer_context during a forward call; removed afterwards. * _restored_attention_forward(attn_module, hidden_states, ...) — a faithful copy of Gemma3Attention.forward (HF transformers >= 4.57) with one insertion: between apply_rotary_pos_emb (which rotates Q and K) and attention_interface, post-RoPE K/V are passed through prepare_restored_attention_kv (K1.C) along with the layer's slice of the captured K/V from K1.A. The result replaces local K/V at evicted positions; non-evicted positions pass through unchanged. HF function pointers (apply_rotary_pos_emb, eager_attention_ forward, ALL_ATTENTION_FUNCTIONS) are passed in as parameters rather than imported at module load time so this module remains importable without HF transformers (Linux CI does have it; the imports in synthetic tests can be replaced with stubs). * DLMRestoredVerifier(model, sink_size=4, window_size=64) — the public wrapper. Discovers the model's decoder layers (model.model.layers, Gemma3 / Llama / Qwen / Mistral shape). The .forward(input_ids, apply_rotary_pos_emb=..., ...) call: 1. capture_proposer_kv(self.model, input_ids) → KVCapture 2. compute_evicted_positions → contiguous range 3. select_positions(evicted) on the capture 4. Install monkey-patch on every layer's self_attn.forward via a context manager (exception-safe; finally restores) 5. Run the model's standard forward (which now goes through the patched attention forwards) 6. Discard captures, remove patches, return logits. Single-batch only in K1.D (input_ids.size(0) == 1) — multi-batch routing is K-series Phase 2 work. inference_engine/v04/__init__.py Public API now exports DLMRestoredVerifier alongside K1.A / K1.B / K1.C primitives. After this PR the v0.4 module's public API is functionally complete for K1 phase: callable end-to-end wrapper + the three primitive layers (capture, merge, attention K/V prep) for downstream composition. tests/inference_engine/v04/test_dlm_restored_verifier.py (560 lines, 21 cases, all <0.20 s on Linux CI) Test classes: * TestRestoredAttentionForward — the patched function in isolation on a synthetic Gemma3-shape attention module: no-context falls through to upstream-equivalent behaviour; empty evicted short- circuits the merge; non-empty evicted runs the merge code path and returns correct shape. * TestDLMRestoredVerifierConstruction — default sink/window; negative values raise. * TestDLMRestoredVerifierShapeDiscovery — decoder layers found on Gemma3-shape; layers without self_attn raise; unrecognised model shape raises (no silent fallback per ADR 0008 §6.2). * TestDLMRestoredVerifierBatchValidation — rank-1 input raises; batch>1 raises with 'single-batch only' message. * TestRestorationActiveLifecycle — patches install on every layer; _v04_layer_context attached during context, removed after; layer count mismatch raises; exception during context still unpatches (finally clause); empty evicted list still installs but the patched forward short-circuits the merge. * TestDLMRestoredVerifierForward — end-to-end with stub model: forward calls the model's forward; short input (no eviction) produces correct shape; @torch.no_grad guarantees no-grad output; patches cleared after call (bound-method __func__ identity check, since Python recreates bound methods on each attribute access). Tests use a synthetic Gemma3-shape surrogate (_FakeAttention, _FakeDecoderLayer, _FakeInner, _FakeModel) that mirrors the HF hook surface. The fake _FakeAttention.forward actually invokes k_proj/v_proj/q_proj so capture hooks fire correctly. The fake model's .forward seeds a deterministic synthetic hidden state by input_ids.sum() so two consecutive forwards on the same input see the same internal state — making the same-model identity case mathematically meaningful. Combined v04 test status after this PR: tests/inference_engine/v04/ has 124 cases (32 K1.A + 39 K1.B + 32 K1.C + 21 K1.D), all <0.20 s on Linux CI, no HF model download. scripts/review_pr_k1d_on_mac.sh (216 lines) Mac M4 reviewer aid — smoke test against real google/gemma-3-1b-it. This is NOT the empirical NIAH validation gate (that is K1.E). It answers: 'does the v0.4 K/V Restoration wrapper actually run end-to-end on real Gemma 3-1B-it without crashing or producing NaN/Inf?'. Three runs on a 256-token synthetic input: (a) standard model.forward (full attention oracle) (b) DLMRestoredVerifier sink=4 window=64 (real eviction) (c) DLMRestoredVerifier sink=10000 window=10000 (no eviction — should match (a) bit-exactly to numerical precision, KL < 1e-3 nats) Smoke gate: * no NaN/Inf in any of (a), (b), (c) * (c) KL vs (a) < 1e-3 nats (the patched forward is mathematically identical to upstream when no merge runs; any drift indicates a bug in the patched forward's copy of the upstream logic) (b)'s argmax / KL vs oracle is recorded but NOT smoke-gated because we expect divergence — that's the whole point of v0.4 (preserve intelligence under bounded sink+window via reconstruction, not bit-exact equivalence to full attention). The gate that does matter for v0.4 GA — '>=95% mid-context recall at 100k context' — is K1.E. Time budget on Mac M4 24 GB with Gemma 3-1B-it: ~3-5 minutes. Output: results/research/k1d_smoke_<stamp>.json + log. What's next: K1.E — DLMRestoredVerifier NIAH validation harness on Mac M4. Synthetic needle-in-haystack at 100k+ context, comparison across: * full-attention oracle (target: ~100% recall, the upper bound) * v0.3 sink=4 + window=64 (target: ~17% recall per the 2026-06-06 A/B benchmark — confirms the regression this whole architecture exists to fix) * v0.4 DLMRestoredVerifier with sink=4 + window=64 (gate: ADR 0008 §11.8 (a) >=95% mid-context recall) Successful K1.E is the v0.4 GA primary architectural validation. Stacking notes: Logical base is #72 K1.C (which itself is logically based on #71 K1.B). After #71 and #72 are merged into main, this PR's diff shrinks to just the K1.D additions. base_branch is set to main for tooling reasons; merge order is #71#72 → this PR. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
1 parent 651305b commit 8791722

4 files changed

Lines changed: 1380 additions & 0 deletions

File tree

inference_engine/v04/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
prepare_restored_attention_kv,
4343
slice_position_embeddings,
4444
)
45+
from inference_engine.v04.dlm_restored_verifier import DLMRestoredVerifier
4546

4647
__all__ = [
4748
# K1.A — capture
@@ -55,4 +56,6 @@
5556
"apply_rope_to_k_at_positions",
5657
"prepare_restored_attention_kv",
5758
"slice_position_embeddings",
59+
# K1.D — end-to-end wrapper
60+
"DLMRestoredVerifier",
5861
]

0 commit comments

Comments
 (0)