|
| 1 | +# Skill: Build a distributed speculative-decode inference engine (remote DFlash + f_θ proposer) |
| 2 | + |
| 3 | +**Reusable across agents (Claude / Codex / Cursor).** This is the SOP for taking a |
| 4 | +single-host fused spec-decode engine (an AR verifier + an EAGLE-style drafter + |
| 5 | +f_θ KV restoration) and splitting it across hosts — **verifier on host A, drafter |
| 6 | ++ f_θ proposer on host B** — over a real gRPC data plane (ADR 0009 §4 "F3"). The |
| 7 | +concrete example is Kakeya's gemma-4 verifier (MLX, Mac) ↔ DFlash+f_θ (torch, GPU), |
| 8 | +but the pattern is general. |
| 9 | + |
| 10 | +The non-negotiable invariant that makes this safe: **correctness containment** — |
| 11 | +the verifier's local greedy verify decides every token, so the output is |
| 12 | +**byte-identical to local greedy regardless of what the remote proposer drafts**. |
| 13 | +A wrong/stale/garbage draft can only lower the acceptance rate, never change a token. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## 1. When to use this skill |
| 18 | + |
| 19 | +- You have a working **single-host** fused spec-decode loop and want to offload the |
| 20 | + drafter (+ f_θ) to another machine (GPU fleet utilization, memory split, etc.). |
| 21 | +- The drafter is **EAGLE-style** (needs the verifier's aux-layer hidden states + |
| 22 | + the verifier's tied embedding), so it is NOT a token-ids-only proposer. |
| 23 | +- You need a real cross-host **RTT / throughput / bounded-memory** measurement of |
| 24 | + the production config, not a toy proposer. |
| 25 | + |
| 26 | +If your proposer is **model-free / token-ids-only** (e.g. an n-gram prompt-lookup), |
| 27 | +you do NOT need this — use the simpler `ProposerService` / `RemoteProposer` |
| 28 | +(ADR 0009 control plane). This skill is specifically for the **bulk-tensor data |
| 29 | +plane** (aux hidden + f_θ-projected K/V crossing the wire). |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | +## 2. Architecture: two layers |
| 34 | + |
| 35 | +Keep the **transport/protocol** strictly separate from the **model math** so the |
| 36 | +former is unit-testable without GPUs/models and the latter is swappable per |
| 37 | +framework. |
| 38 | + |
| 39 | +### Layer 1 — framework-agnostic machinery (pure-python, 100%-unit-tested) |
| 40 | +- `tensor_codec` — a self-describing `WireTensor` ↔ proto `Tensor` (dtype string + |
| 41 | + int64 shape + raw little-endian bytes). bf16 has no numpy scalar → carry it as |
| 42 | + `uint16` bits under the logical name `"bfloat16"`; rebuild via thin torch/mlx |
| 43 | + bridges. **No torch/mlx import in the codec** (mlx bridges are `# pragma: no cover`). |
| 44 | +- `dflash_service` — a `RestorationDraftEngine` Protocol (WireTensor in/out), an |
| 45 | + async gRPC servicer, and a sync `RemoteDFlashProposer` client. Engine `KeyError` |
| 46 | + → `NOT_FOUND`, `ValueError` → `INVALID_ARGUMENT`. |
| 47 | +- `fused_decode` — `DistributedFusedDecoder` (mirrors the in-process fused loop) |
| 48 | + driving a `RestoringVerifier` Protocol. Aux/K-V cross the verifier↔decoder |
| 49 | + boundary as `WireTensor`, so the loop is framework-agnostic and fully fakeable. |
| 50 | + |
| 51 | +### Layer 2 — real-model engines (mlx/torch, validated on-device, NOT coverage-gated) |
| 52 | +- **Host A (verifier):** a `RestoringVerifier` adapter wrapping your restored |
| 53 | + incremental verifier (Kakeya: `MLXRestoringVerifierAdapter` over |
| 54 | + `MLXRestoredIncrementalVerifier`). |
| 55 | +- **Host B (proposer):** a `RestorationDraftEngine` impl holding the drafter + f_θ |
| 56 | + + the verifier's tied embedding (Kakeya: `MLXRestorationDraftEngine` for an |
| 57 | + all-Mac loopback, `TorchRestorationDraftEngine` for a CUDA host). |
| 58 | + |
| 59 | +### Wire protocol (stateful session) |
| 60 | +Per turn: **Restore** (prompt → host B captures drafter K/V → f_θ → verifier K/V |
| 61 | +banks; host A prefills) → **SeedContext** (host A's verifier aux hidden over the |
| 62 | +prompt → host B's drafter context K/V). Per block: **DraftBlock** (bonus + |
| 63 | +context_len → exactly `block_size` drafts) → host A verifies/commits → |
| 64 | +**ExtendContext** (committed tokens' aux, O(block) → grow host B's context). |
| 65 | +**CloseSession** frees host-B state. |
| 66 | + |
| 67 | +| Message | Dir | Size class | |
| 68 | +|---|---|---| |
| 69 | +| Restore | A→B ids / B→A K/V banks | O(T) one-time (empty under S5 free-lunch) | |
| 70 | +| SeedContext | A→B aux | O(T) one-time | |
| 71 | +| DraftBlock | A↔B | O(1) / O(block) | |
| 72 | +| ExtendContext | A→B committed aux | O(block) (the per-block bandwidth term) | |
| 73 | + |
| 74 | +--- |
| 75 | + |
| 76 | +## 3. SOP — build order |
| 77 | + |
| 78 | +1. **Ground the dataflow first.** Read the EXACT single-host fused loop and write |
| 79 | + down, per block, every tensor that crosses the drafter↔verifier boundary |
| 80 | + (shapes, dtype, which model produces it). Decide what stays local (drafter |
| 81 | + context K/V, verifier KV cache, full logits — send only the bonus int) vs what |
| 82 | + crosses (aux hidden O(block), draft ids, restored K/V once). |
| 83 | +2. **Build Layer 1 + unit tests FIRST.** Codec roundtrip + dtype/byte-count |
| 84 | + validation; servicer over a real `grpc.aio` server with a fake engine (status |
| 85 | + mapping, dead-address wrap, draft-count refusal); decoder with a fake verifier |
| 86 | + that models a fixed greedy continuation + fake remotes returning **perfect AND |
| 87 | + wrong** drafts — assert **byte-identical to greedy in both cases**. This proves |
| 88 | + containment before any model is involved. |
| 89 | +3. **Build the real engines (Layer 2)** by REUSING the in-process fused helpers |
| 90 | + (capture-drafter-KV, f_θ projection, `make_context_kv`/`draft_block_cached`/ |
| 91 | + `extend_context_kv`, the restored verifier). Don't reimplement the math. |
| 92 | +4. **Climb the validation ladder** (each rung adds one risk, all assert |
| 93 | + byte-identical): |
| 94 | + - **in-process** (single model load, no gRPC) — validates engine+adapter+loop; |
| 95 | + - **loopback gRPC** (real wire + codec, same host) — validates serialization; |
| 96 | + - **cross-host** (real network) — validates deployment + measures RTT. |
| 97 | + Use **block_size=1 as the greedy baseline** (the same decoder at block=1 is pure |
| 98 | + greedy) so baseline and distributed share one code path. |
| 99 | +5. **Deploy** with the scripts in §5 and **measure** throughput / bounded-memory / RTT. |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +## 4. Gotchas / lessons (the expensive ones) |
| 104 | + |
| 105 | +- **MLX is Apple-only.** A CUDA host B cannot run the MLX verifier's embedding; |
| 106 | + give host B a **torch** embedding (load the base verifier, or ship just the |
| 107 | + ~1.5 GB tied-embed weight). Output stays byte-identical (greedy verify is |
| 108 | + authoritative); only the drafter numerics / acceptance shift. |
| 109 | +- **transformers version.** gemma-4 (torch) needs `transformers>=5.0`; older |
| 110 | + custom modeling that depends on `decoder_layer.attention_type` breaks under 5.x |
| 111 | + (see `requirements.txt`). Also: 5.x `apply_chat_template` returns a dict — pass |
| 112 | + `tokenize=True, return_dict=False`. |
| 113 | +- **Cross-layer KV sharing.** gemma-4 shares K/V across layers. Ship every |
| 114 | + non-exact f_θ layer from host B, but on host A **filter restored layers to the |
| 115 | + verifier's `kv_source_layer_map` source layers** — the verifier only injects |
| 116 | + those. Keep that filter on the host-A (MLX) side where the layout lives. |
| 117 | +- **f_θ is prefill-only** under S5; on gemma-4 the projected sliding-layer K/V are |
| 118 | + recall-irrelevant ("free lunch") so `Restore` can be empty — force f_θ (ship the |
| 119 | + banks) only when you want it load-bearing / to exercise the path. |
| 120 | +- **gRPC max message size.** Restored K/V (~11 MB) and per-block aux exceed gRPC's |
| 121 | + 4 MiB default — set `grpc.max_{send,receive}_message_length` high on both ends. |
| 122 | +- **Don't sync-RPC on the server's event loop in tests.** A sync client `close()` |
| 123 | + that issues an RPC will deadlock an in-process `grpc.aio` server sharing the |
| 124 | + thread; drive it via `asyncio.to_thread`. (In production the server is remote — |
| 125 | + no constraint.) |
| 126 | +- **vast / cloud port mapping.** Portal ports (Caddy) return HTTP 401 to gRPC, and |
| 127 | + some mapped ports silently drop. Use a **plain high port** (e.g. 50070) reached |
| 128 | + over an **SSH `-L` tunnel** — do not rely on the externally-mapped portal ports. |
| 129 | +- **Big model cache.** The base verifier may exceed the root disk; cache it in a |
| 130 | + RAM-disk (`/dev/shm`). |
| 131 | +- **Verify, don't trust comments.** Every "should be byte-identical" claim must be |
| 132 | + asserted by an actual run on each rung of the ladder. |
| 133 | + |
| 134 | +--- |
| 135 | + |
| 136 | +## 5. Deployment + startup scripts |
| 137 | + |
| 138 | +| Host | Script | What it does | |
| 139 | +|---|---|---| |
| 140 | +| B (GPU) | `scripts/deploy/dflash_proposer_server_gpu.sh` | ensure transformers 5.x, fetch gemma-4 (embed) + DFlash into `/dev/shm` HF cache, serve `DFlashProposerService` on a non-portal port | |
| 141 | +| A (verifier) | `scripts/deploy/dflash_verifier_client.sh` | (optionally) open the SSH `-L` tunnel, probe it, run the byte-identical + RTT E2E against `localhost:<port>` | |
| 142 | +| both | `scripts/research/k3_dflash_proposer_server.py` / `k3_distributed_dflash_e2e_mac.py` | the underlying server + harness (in-process / `--grpc` / `--remote-addr`) | |
| 143 | + |
| 144 | +Typical run: |
| 145 | +```bash |
| 146 | +# Host B (GPU): |
| 147 | +bash scripts/deploy/dflash_proposer_server_gpu.sh --port 50070 |
| 148 | +# Host A (Mac): open the tunnel with YOUR creds, then: |
| 149 | +ssh -p <ssh_port> root@<gpu_host> -L 50070:localhost:50070 # in another shell |
| 150 | +bash scripts/deploy/dflash_verifier_client.sh \ |
| 151 | + --verifier-path /path/to/gemma-4-26B-A4B-it-mlx-4bit --port 50070 |
| 152 | +``` |
| 153 | +On a self-hosted Mac runner, the same E2E runs via the bridge preset |
| 154 | +`mlx-distributed-dflash-e2e-crosshost` (it expects the tunnel open on the runner). |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## 6. What "good" looks like (Kakeya gemma-4 ↔ H200, measured) |
| 159 | + |
| 160 | +- **Correctness:** PASS byte-identical-to-greedy on all three rungs (in-process, |
| 161 | + loopback gRPC, real Mac↔H200), DFlash acceptance ≈ **0.86–0.89** (vs n-gram 0.10). |
| 162 | +- **Bounded memory:** verifier-side invariant unchanged by the split — ~235 MB |
| 163 | + resident KV, constant over a 1241-token generation (S5: 25 sliding layers bound |
| 164 | + to sink+window, 5 exact layers full-context). |
| 165 | +- **RTT (Mac↔H200 over SSH tunnel):** Restore ~3.2 s / 11.5 MB (one-time), |
| 166 | + SeedContext ~0.4 s, DraftBlock ~268 ms, ExtendContext ~316 ms / 0.27 MB-per-block, |
| 167 | + per-block ~584 ms; throughput 3.7 tok/s (block=4) vs 1.0 (block=1). The DFlash |
| 168 | + forward is offloaded to the GPU (a VM→H200 probe shows DraftBlock 108 ms is |
| 169 | + mostly net-RTT vs the 232 ms Mac-CPU compute); cross-host cost is then network |
| 170 | + RTT + per-block aux bandwidth bound. **GA levers:** aux quantization/compression, |
| 171 | + same-rack placement. |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +## 7. Reference (Kakeya impl) |
| 176 | + |
| 177 | +- Machinery: `inference_engine/distributed/{tensor_codec,dflash_service,fused_decode}.py` |
| 178 | + + tests under `tests/inference_engine/distributed/`. |
| 179 | +- Engines: `inference_engine/backends/mlx/dflash_distributed.py` (host A + Mac host B), |
| 180 | + `inference_engine/v04/dflash_distributed_engine.py` (CUDA host B). |
| 181 | +- Proto: `proto/kakeya/v1/distributed.proto` (`DFlashProposerService`). |
| 182 | +- Design + measured report: `docs/design/distributed-dflash-ftheta-data-plane.md`. |
0 commit comments