@@ -2807,3 +2807,219 @@ candidates for Mac M4 24 GB single-device deployment.
28072807 But model names are the most vulnerable class because LLM
28082808 agents have strong priors about "what would naturally exist"
28092809 and tend to confabulate.
2810+
2811+ ### 11.15 K3 implementation roadmap (added 2026-06-09)
2812+
2813+ Per user directive 2026-06-09: * "直接把 k3 生产规模的 vast GPU 版本和
2814+ Mac mini 版本全部准备好"* (prepare both vast GPU and Mac mini K3
2815+ production versions immediately) ** and** "k3 完成之后,再做 k2 qwen
2816+ 模型的适配" (K3 first; K2.B Qwen adaptation as a backport after K3).
2817+
2818+ This subsection sequences the K3 work into discrete deliverable
2819+ blocks with explicit prerequisites. The companion design documents
2820+ in ` docs/design/ ` flesh out per-block contracts.
2821+
2822+ #### 11.15.1 Block sequence
2823+
2824+ ```
2825+ A. Hardware feasibility (this PR — DONE in scaffold form)
2826+ ├── A.1 vast.ai bf16 path
2827+ └── A.2 Mac M4 4-bit path (one-time MLX quantize)
2828+ ↓
2829+ B. Cross-model wrapper (K2.B/K3 implementation PR — NOT YET)
2830+ ↓
2831+ C. f_θ training (Stage 1) (K3 training PR — NOT YET)
2832+ ↓
2833+ D. f_θ Stage 2 fine-tune (K3 training PR cont. — NOT YET)
2834+ ↓
2835+ E. K3 NIAH ladder evidence (K3 evidence PR — NOT YET)
2836+ ↓
2837+ F. K2.B Qwen backport (K2.B research-scale validation — NOT YET)
2838+ ↓
2839+ G. K3 production deployment (release engineering — NOT YET)
2840+ ```
2841+
2842+ #### 11.15.2 Block A — Hardware feasibility (this PR)
2843+
2844+ ** Prerequisites** : HF token (Gemma 4 is gated) on each host.
2845+
2846+ ** Deliverables** (shipped in this PR):
2847+
2848+ * ` scripts/research/k3_quantize_for_mac.py ` — one-time
2849+ ` mlx_lm.convert ` 4-bit quantize of ` google/gemma-4-26B-A4B-it `
2850+ to ` ~13 GB ` local MLX directory on Mac M4.
2851+ * ` scripts/research/k3_feasibility_smoke.py ` — cross-platform
2852+ smoke that loads (verifier, drafter), runs forward, reports
2853+ memory + latency JSON evidence.
2854+ * ` scripts/review_pr_k3_feasibility_on_vast.sh ` — bf16 path
2855+ on vast.ai H100 / H200 80 GB (no quantization needed).
2856+ * ` scripts/review_pr_k3_feasibility_on_mac.sh ` — 4-bit path on
2857+ Mac M4 24 GB (with quantize prerequisite check).
2858+ * This roadmap (§11.15).
2859+ * Cross-model ` DLMRestoredVerifier ` interface contract (no code,
2860+ just contract): ` docs/design/k3-cross-model-dlmrestored-verifier-contract.md `
2861+ * ` f_θ ` training pipeline skeleton (no code, just skeleton):
2862+ ` docs/design/k3-f-theta-training-pipeline.md `
2863+
2864+ ** Acceptance gate** : smoke runs return exit 0 + JSON evidence
2865+ shows verifier + drafter both load and run a forward on the
2866+ target hardware. ** What this gate does NOT verify** : cross-model
2867+ correctness (that's Block B), trained-f_θ behaviour (Block C/D),
2868+ NIAH recall (Block E).
2869+
2870+ ** Cost** : zero compute beyond a one-time Mac quantize
2871+ (~ 30-90 min, free) and a single vast.ai GPU-hour smoke (~ $1-3).
2872+
2873+ #### 11.15.3 Block B — Cross-model ` DLMRestoredVerifier ` implementation
2874+
2875+ ** Prerequisites** : Block A's feasibility smoke confirms both
2876+ models load on target hardware; the smoke's JSON report contains
2877+ the actual drafter ` (num_layers, head_dim, num_kv_heads) ` shape
2878+ needed to parameterise ` LinearLayerProjection ` .
2879+
2880+ ** Deliverables** :
2881+
2882+ * ` inference_engine/v04/dlm_restored_verifier.py ` extended:
2883+ cross-model constructor signature per
2884+ ` docs/design/k3-cross-model-dlmrestored-verifier-contract.md ` §1.
2885+ * New ` inference_engine/v04/layer_projection.py ` :
2886+ ` LayerProjection ` Protocol + ` IdentityLayerProjection ` +
2887+ ` LinearLayerProjection ` .
2888+ * K1 / K2.A backward-compat regression test passes unchanged
2889+ (all 31 existing tests in
2890+ ` test_dlm_restored_verifier.py ` continue to pass).
2891+ * New cross-model tests covering `IdentityLayerProjection ==
2892+ K1.D bit-for-bit` , ` LinearLayerProjection` shape correctness,
2893+ layer_alignment strategies, error cases.
2894+
2895+ ** Acceptance gate** : tests pass; running cross-model
2896+ ` DLMRestoredVerifier ` with ` IdentityLayerProjection ` on
2897+ ` (google/gemma-3-1b-it, google/gemma-3-1b-it) ` produces
2898+ bit-equal output to existing K1.D ` DLMRestoredVerifier(model) ` .
2899+
2900+ ** Cost** : pure engineering. ~ 500-1000 LOC. No GPU compute.
2901+
2902+ #### 11.15.4 Block C — ` f_θ ` Stage 1 training (L_recon)
2903+
2904+ ** Prerequisites** : Block B implementation merged. Long-context
2905+ corpus (RULER / NarrativeQA) accessible.
2906+
2907+ ** Deliverables** :
2908+
2909+ * ` scripts/training/train_f_theta_stage1.py ` — training driver
2910+ per ` docs/design/k3-f-theta-training-pipeline.md ` .
2911+ * Trained ` f_θ ` checkpoint at ` checkpoints/k3_f_theta_stage1/ `
2912+ (committed to LFS or shared blob storage, NOT into the main
2913+ repo — too large).
2914+ * Training metadata JSON.
2915+
2916+ ** Acceptance gate** : validation L_recon converges to bounded
2917+ plateau; validation NIAH recall at the 5.6k canary rung > some
2918+ empirical threshold (TBD after first iteration).
2919+
2920+ ** Cost** : per ADR §11.7 K3 row, ~ $200-500 of GPU compute on
2921+ vast for Stage 1 alone (1B token training).
2922+
2923+ #### 11.15.5 Block D — ` f_θ ` Stage 2 fine-tune (L_logit)
2924+
2925+ ** Prerequisites** : Block C checkpoint. ` f_θ ` already in the
2926+ "correct ballpark"; Stage 2 tightens.
2927+
2928+ ** Deliverables** :
2929+
2930+ * ` scripts/training/train_f_theta_stage2.py ` — driver.
2931+ * Updated ` f_θ ` checkpoint at ` checkpoints/k3_f_theta_stage2/ ` .
2932+
2933+ ** Acceptance gate** : validation NIAH recall meets ADR §11.8 1a
2934+ (Δ ≤ 5pp of oracle at every §11.12 ladder rung).
2935+
2936+ ** Cost** : ~ $50-200 (100M token fine-tune at 5-10× per-step cost
2937+ of Stage 1).
2938+
2939+ #### 11.15.6 Block E — K3 NIAH ladder evidence
2940+
2941+ ** Prerequisites** : Block D checkpoint passing Stage 2 validation.
2942+
2943+ ** Deliverables** :
2944+
2945+ * Re-run K1.E NIAH harness with cross-model setup at every
2946+ §11.12 ladder rung (1.4k / 5.6k / 21k / 64k / 100k) on vast.
2947+ * JSON evidence committed to ` results/research/ ` .
2948+ * ADR §11.11.10 postscript update with K3 baseline.
2949+
2950+ ** Acceptance gate** : ADR §11.8 1a gate met across full ladder;
2951+ the v0.4 architectural validation is ** finally** demonstrated
2952+ on a real dLM proposer (not just K1's same-checkpoint AR
2953+ toy).
2954+
2955+ ** Cost** : ~ $10-30 of vast time for one full ladder run.
2956+
2957+ #### 11.15.7 Block F — K2.B Qwen backport (research-scale validation)
2958+
2959+ ** Prerequisites** : Block E showed K3 works; K2.B is a backport.
2960+
2961+ ** Per the user directive** , K2.B is intentionally deferred to
2962+ ** after** K3 is established. Rationale: validating at production
2963+ scale first ensures the architecture works at the deployment
2964+ target; the smaller K2.B research scale then becomes a faster
2965+ iteration vehicle for hyperparameter tuning + design exploration,
2966+ not the primary validation gate.
2967+
2968+ ** Deliverables** :
2969+
2970+ * Same training + evidence as Blocks C/D/E but with the
2971+ Qwen3.5-4B + DFlash 0.4B pair (scale ratio 10:1 vs K3's 65:1).
2972+ * Evidence that K2.B reproduces K3's qualitative behaviour at
2973+ smaller scale (cheap for future research iterations).
2974+
2975+ ** Cost** : ~ $20-50 (training is cheaper at smaller scale).
2976+
2977+ #### 11.15.8 Block G — K3 production deployment
2978+
2979+ ** Prerequisites** : Block E + Block F passed.
2980+
2981+ ** Deliverables** : release engineering — Docker image, deployment
2982+ docs, gRPC service config, multi-tenant scheduler tuning. Out of
2983+ scope for v0.4 GA; lands in v0.5 release.
2984+
2985+ #### 11.15.9 Critical dependencies
2986+
2987+ The blocks must run in sequence:
2988+
2989+ ```
2990+ A → B → C → D → E (the K3 main path)
2991+ ↓
2992+ F (K2.B backport, parallel after E)
2993+ ↓
2994+ G (production deployment)
2995+ ```
2996+
2997+ The user's directive ordering ("K3 first, then K2.B") is preserved
2998+ explicitly — F comes after E, not in parallel.
2999+
3000+ #### 11.15.10 Risk register
3001+
3002+ | risk | block triggered | mitigation |
3003+ | ---| ---| ---|
3004+ | Drafter K/V hooks don't fire (DFlash custom modeling) | B | adapt K1.A hook pattern; may require DFlash-specific code path |
3005+ | ` f_θ ` capacity insufficient at 65:1 ratio | C/D | escalate per training pipeline §8: MLP, low-rank, learned alignment |
3006+ | Mac M4 4-bit smoke OOMs at 100k context | A.2 | accept Mac M4 as research-only at smaller context; K3 production validation on vast only |
3007+ | Gemma 4 26B-A4B verifier weights not accessible (gating delays) | A | use the alternative Gemma 4-31B-it pair (also HF-verified §11.14.3) |
3008+ | f_θ training cost overruns budget | C/D | smaller Stage 1 token budget; accept partial convergence + larger Δ vs oracle |
3009+ | Staleness (per §11.13.6) prevents Δ ≤ 5pp at production scale | E | escalate to §11.13.6.4 stateful-caching freshness designs (refresh-on-eviction or periodic refresh) |
3010+ | Multi-tenant scheduling conflict with v0.4 architecture | G | deferred — out of scope for K3 per §11.15.8; addressed in v0.5 release engineering |
3011+
3012+ #### 11.15.11 Why this roadmap matters
3013+
3014+ Without this sequencing, K3 work would either:
3015+
3016+ * (a) be over-claimed at Block A — "we have K3 prepared!" when in
3017+ reality only feasibility scripts exist; or
3018+ * (b) be under-scoped at Block C/D — agents would underestimate
3019+ the training cost and skip Stage 2 fine-tuning.
3020+
3021+ The roadmap makes both failure modes harder by giving each block
3022+ a fixed scope, a deliverable list, and an acceptance gate. PR
3023+ reviewers can map any K3-related PR to a block; PRs that try to
3024+ collapse multiple blocks (e.g., "B+C+D+E in one PR") are scope
3025+ violations.
0 commit comments