Skip to content

Commit 578b048

Browse files
authored
Merge pull request #86 from FluffyAIcode/AgentMemory/v04-pr-k3-prep-and-roadmap-8e7f
PR-K3-prep: feasibility scripts (Mac + vast) + cross-model contract + f_θ training skeleton + §11.15 K3 roadmap
2 parents 64eea20 + 3f0557a commit 578b048

9 files changed

Lines changed: 2041 additions & 0 deletions

docs/adr/0008-session-bound-runtime-and-grpc-protocol.md

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2807,3 +2807,219 @@ candidates for Mac M4 24 GB single-device deployment.
28072807
But model names are the most vulnerable class because LLM
28082808
agents have strong priors about "what would naturally exist"
28092809
and tend to confabulate.
2810+
2811+
### 11.15 K3 implementation roadmap (added 2026-06-09)
2812+
2813+
Per user directive 2026-06-09: *"直接把 k3 生产规模的 vast GPU 版本和
2814+
Mac mini 版本全部准备好"* (prepare both vast GPU and Mac mini K3
2815+
production versions immediately) **and** "k3 完成之后,再做 k2 qwen
2816+
模型的适配" (K3 first; K2.B Qwen adaptation as a backport after K3).
2817+
2818+
This subsection sequences the K3 work into discrete deliverable
2819+
blocks with explicit prerequisites. The companion design documents
2820+
in `docs/design/` flesh out per-block contracts.
2821+
2822+
#### 11.15.1 Block sequence
2823+
2824+
```
2825+
A. Hardware feasibility (this PR — DONE in scaffold form)
2826+
├── A.1 vast.ai bf16 path
2827+
└── A.2 Mac M4 4-bit path (one-time MLX quantize)
2828+
2829+
B. Cross-model wrapper (K2.B/K3 implementation PR — NOT YET)
2830+
2831+
C. f_θ training (Stage 1) (K3 training PR — NOT YET)
2832+
2833+
D. f_θ Stage 2 fine-tune (K3 training PR cont. — NOT YET)
2834+
2835+
E. K3 NIAH ladder evidence (K3 evidence PR — NOT YET)
2836+
2837+
F. K2.B Qwen backport (K2.B research-scale validation — NOT YET)
2838+
2839+
G. K3 production deployment (release engineering — NOT YET)
2840+
```
2841+
2842+
#### 11.15.2 Block A — Hardware feasibility (this PR)
2843+
2844+
**Prerequisites**: HF token (Gemma 4 is gated) on each host.
2845+
2846+
**Deliverables** (shipped in this PR):
2847+
2848+
* `scripts/research/k3_quantize_for_mac.py` — one-time
2849+
`mlx_lm.convert` 4-bit quantize of `google/gemma-4-26B-A4B-it`
2850+
to `~13 GB` local MLX directory on Mac M4.
2851+
* `scripts/research/k3_feasibility_smoke.py` — cross-platform
2852+
smoke that loads (verifier, drafter), runs forward, reports
2853+
memory + latency JSON evidence.
2854+
* `scripts/review_pr_k3_feasibility_on_vast.sh` — bf16 path
2855+
on vast.ai H100 / H200 80 GB (no quantization needed).
2856+
* `scripts/review_pr_k3_feasibility_on_mac.sh` — 4-bit path on
2857+
Mac M4 24 GB (with quantize prerequisite check).
2858+
* This roadmap (§11.15).
2859+
* Cross-model `DLMRestoredVerifier` interface contract (no code,
2860+
just contract): `docs/design/k3-cross-model-dlmrestored-verifier-contract.md`
2861+
* `f_θ` training pipeline skeleton (no code, just skeleton):
2862+
`docs/design/k3-f-theta-training-pipeline.md`
2863+
2864+
**Acceptance gate**: smoke runs return exit 0 + JSON evidence
2865+
shows verifier + drafter both load and run a forward on the
2866+
target hardware. **What this gate does NOT verify**: cross-model
2867+
correctness (that's Block B), trained-f_θ behaviour (Block C/D),
2868+
NIAH recall (Block E).
2869+
2870+
**Cost**: zero compute beyond a one-time Mac quantize
2871+
(~30-90 min, free) and a single vast.ai GPU-hour smoke (~$1-3).
2872+
2873+
#### 11.15.3 Block B — Cross-model `DLMRestoredVerifier` implementation
2874+
2875+
**Prerequisites**: Block A's feasibility smoke confirms both
2876+
models load on target hardware; the smoke's JSON report contains
2877+
the actual drafter `(num_layers, head_dim, num_kv_heads)` shape
2878+
needed to parameterise `LinearLayerProjection`.
2879+
2880+
**Deliverables**:
2881+
2882+
* `inference_engine/v04/dlm_restored_verifier.py` extended:
2883+
cross-model constructor signature per
2884+
`docs/design/k3-cross-model-dlmrestored-verifier-contract.md` §1.
2885+
* New `inference_engine/v04/layer_projection.py`:
2886+
`LayerProjection` Protocol + `IdentityLayerProjection` +
2887+
`LinearLayerProjection`.
2888+
* K1 / K2.A backward-compat regression test passes unchanged
2889+
(all 31 existing tests in
2890+
`test_dlm_restored_verifier.py` continue to pass).
2891+
* New cross-model tests covering `IdentityLayerProjection ==
2892+
K1.D bit-for-bit`, `LinearLayerProjection` shape correctness,
2893+
layer_alignment strategies, error cases.
2894+
2895+
**Acceptance gate**: tests pass; running cross-model
2896+
`DLMRestoredVerifier` with `IdentityLayerProjection` on
2897+
`(google/gemma-3-1b-it, google/gemma-3-1b-it)` produces
2898+
bit-equal output to existing K1.D `DLMRestoredVerifier(model)`.
2899+
2900+
**Cost**: pure engineering. ~500-1000 LOC. No GPU compute.
2901+
2902+
#### 11.15.4 Block C — `f_θ` Stage 1 training (L_recon)
2903+
2904+
**Prerequisites**: Block B implementation merged. Long-context
2905+
corpus (RULER / NarrativeQA) accessible.
2906+
2907+
**Deliverables**:
2908+
2909+
* `scripts/training/train_f_theta_stage1.py` — training driver
2910+
per `docs/design/k3-f-theta-training-pipeline.md`.
2911+
* Trained `f_θ` checkpoint at `checkpoints/k3_f_theta_stage1/`
2912+
(committed to LFS or shared blob storage, NOT into the main
2913+
repo — too large).
2914+
* Training metadata JSON.
2915+
2916+
**Acceptance gate**: validation L_recon converges to bounded
2917+
plateau; validation NIAH recall at the 5.6k canary rung > some
2918+
empirical threshold (TBD after first iteration).
2919+
2920+
**Cost**: per ADR §11.7 K3 row, ~$200-500 of GPU compute on
2921+
vast for Stage 1 alone (1B token training).
2922+
2923+
#### 11.15.5 Block D — `f_θ` Stage 2 fine-tune (L_logit)
2924+
2925+
**Prerequisites**: Block C checkpoint. `f_θ` already in the
2926+
"correct ballpark"; Stage 2 tightens.
2927+
2928+
**Deliverables**:
2929+
2930+
* `scripts/training/train_f_theta_stage2.py` — driver.
2931+
* Updated `f_θ` checkpoint at `checkpoints/k3_f_theta_stage2/`.
2932+
2933+
**Acceptance gate**: validation NIAH recall meets ADR §11.8 1a
2934+
(Δ ≤ 5pp of oracle at every §11.12 ladder rung).
2935+
2936+
**Cost**: ~$50-200 (100M token fine-tune at 5-10× per-step cost
2937+
of Stage 1).
2938+
2939+
#### 11.15.6 Block E — K3 NIAH ladder evidence
2940+
2941+
**Prerequisites**: Block D checkpoint passing Stage 2 validation.
2942+
2943+
**Deliverables**:
2944+
2945+
* Re-run K1.E NIAH harness with cross-model setup at every
2946+
§11.12 ladder rung (1.4k / 5.6k / 21k / 64k / 100k) on vast.
2947+
* JSON evidence committed to `results/research/`.
2948+
* ADR §11.11.10 postscript update with K3 baseline.
2949+
2950+
**Acceptance gate**: ADR §11.8 1a gate met across full ladder;
2951+
the v0.4 architectural validation is **finally** demonstrated
2952+
on a real dLM proposer (not just K1's same-checkpoint AR
2953+
toy).
2954+
2955+
**Cost**: ~$10-30 of vast time for one full ladder run.
2956+
2957+
#### 11.15.7 Block F — K2.B Qwen backport (research-scale validation)
2958+
2959+
**Prerequisites**: Block E showed K3 works; K2.B is a backport.
2960+
2961+
**Per the user directive**, K2.B is intentionally deferred to
2962+
**after** K3 is established. Rationale: validating at production
2963+
scale first ensures the architecture works at the deployment
2964+
target; the smaller K2.B research scale then becomes a faster
2965+
iteration vehicle for hyperparameter tuning + design exploration,
2966+
not the primary validation gate.
2967+
2968+
**Deliverables**:
2969+
2970+
* Same training + evidence as Blocks C/D/E but with the
2971+
Qwen3.5-4B + DFlash 0.4B pair (scale ratio 10:1 vs K3's 65:1).
2972+
* Evidence that K2.B reproduces K3's qualitative behaviour at
2973+
smaller scale (cheap for future research iterations).
2974+
2975+
**Cost**: ~$20-50 (training is cheaper at smaller scale).
2976+
2977+
#### 11.15.8 Block G — K3 production deployment
2978+
2979+
**Prerequisites**: Block E + Block F passed.
2980+
2981+
**Deliverables**: release engineering — Docker image, deployment
2982+
docs, gRPC service config, multi-tenant scheduler tuning. Out of
2983+
scope for v0.4 GA; lands in v0.5 release.
2984+
2985+
#### 11.15.9 Critical dependencies
2986+
2987+
The blocks must run in sequence:
2988+
2989+
```
2990+
A → B → C → D → E (the K3 main path)
2991+
2992+
F (K2.B backport, parallel after E)
2993+
2994+
G (production deployment)
2995+
```
2996+
2997+
The user's directive ordering ("K3 first, then K2.B") is preserved
2998+
explicitly — F comes after E, not in parallel.
2999+
3000+
#### 11.15.10 Risk register
3001+
3002+
| risk | block triggered | mitigation |
3003+
|---|---|---|
3004+
| Drafter K/V hooks don't fire (DFlash custom modeling) | B | adapt K1.A hook pattern; may require DFlash-specific code path |
3005+
| `f_θ` capacity insufficient at 65:1 ratio | C/D | escalate per training pipeline §8: MLP, low-rank, learned alignment |
3006+
| Mac M4 4-bit smoke OOMs at 100k context | A.2 | accept Mac M4 as research-only at smaller context; K3 production validation on vast only |
3007+
| Gemma 4 26B-A4B verifier weights not accessible (gating delays) | A | use the alternative Gemma 4-31B-it pair (also HF-verified §11.14.3) |
3008+
| f_θ training cost overruns budget | C/D | smaller Stage 1 token budget; accept partial convergence + larger Δ vs oracle |
3009+
| Staleness (per §11.13.6) prevents Δ ≤ 5pp at production scale | E | escalate to §11.13.6.4 stateful-caching freshness designs (refresh-on-eviction or periodic refresh) |
3010+
| Multi-tenant scheduling conflict with v0.4 architecture | G | deferred — out of scope for K3 per §11.15.8; addressed in v0.5 release engineering |
3011+
3012+
#### 11.15.11 Why this roadmap matters
3013+
3014+
Without this sequencing, K3 work would either:
3015+
3016+
* (a) be over-claimed at Block A — "we have K3 prepared!" when in
3017+
reality only feasibility scripts exist; or
3018+
* (b) be under-scoped at Block C/D — agents would underestimate
3019+
the training cost and skip Stage 2 fine-tuning.
3020+
3021+
The roadmap makes both failure modes harder by giving each block
3022+
a fixed scope, a deliverable list, and an acceptance gate. PR
3023+
reviewers can map any K3-related PR to a block; PRs that try to
3024+
collapse multiple blocks (e.g., "B+C+D+E in one PR") are scope
3025+
violations.

0 commit comments

Comments
 (0)