docs(skill)+deploy: distributed DFlash+f_θ inference SOP skill + host A/B deploy scripts

cursoragent · FluffyAIcode · cursoragent · commit 00530dd074db · 2026-06-19T14:43:21.000Z
- docs/skills/distributed-dflash-ftheta-inference-skill.md: reusable SOP (two-layer
  design, build order, the byte-identical validation ladder, the expensive gotchas:
  MLX-Apple-only/torch-embed, transformers 5.x, gemma-4 KV-source-layer filtering,
  vast Caddy ports + SSH -L, /dev/shm cache).
- scripts/deploy/dflash_proposer_server_gpu.sh: one-command host-B (GPU) deploy
  (transformers 5.x + fetch gemma-4/DFlash to /dev/shm + serve DFlashProposerService).
- scripts/deploy/dflash_verifier_client.sh: host-A (verifier) launcher (open SSH -L
  tunnel + probe + run the byte-identical + RTT E2E).

Co-authored-by: FluffyAIcode &lt;FluffyAIcode@users.noreply.github.com&gt;
diff --git a/docs/skills/distributed-dflash-ftheta-inference-skill.md b/docs/skills/distributed-dflash-ftheta-inference-skill.md
@@ -0,0 +1,182 @@
+# Skill: Build a distributed speculative-decode inference engine (remote DFlash + f_θ proposer)
+
+**Reusable across agents (Claude / Codex / Cursor).** This is the SOP for taking a
+single-host fused spec-decode engine (an AR verifier + an EAGLE-style drafter +
+f_θ KV restoration) and splitting it across hosts — **verifier on host A, drafter
++ f_θ proposer on host B** — over a real gRPC data plane (ADR 0009 §4 "F3"). The
+concrete example is Kakeya's gemma-4 verifier (MLX, Mac) ↔ DFlash+f_θ (torch, GPU),
+but the pattern is general.
+
+The non-negotiable invariant that makes this safe: **correctness containment** —
+the verifier's local greedy verify decides every token, so the output is
+**byte-identical to local greedy regardless of what the remote proposer drafts**.
+A wrong/stale/garbage draft can only lower the acceptance rate, never change a token.
+
+---
+
+## 1. When to use this skill
+
+- You have a working **single-host** fused spec-decode loop and want to offload the
+  drafter (+ f_θ) to another machine (GPU fleet utilization, memory split, etc.).
+- The drafter is **EAGLE-style** (needs the verifier's aux-layer hidden states +
+  the verifier's tied embedding), so it is NOT a token-ids-only proposer.
+- You need a real cross-host **RTT / throughput / bounded-memory** measurement of
+  the production config, not a toy proposer.
+
+If your proposer is **model-free / token-ids-only** (e.g. an n-gram prompt-lookup),
+you do NOT need this — use the simpler `ProposerService` / `RemoteProposer`
+(ADR 0009 control plane). This skill is specifically for the **bulk-tensor data
+plane** (aux hidden + f_θ-projected K/V crossing the wire).
+
+---
+
+## 2. Architecture: two layers
+
+Keep the **transport/protocol** strictly separate from the **model math** so the
+former is unit-testable without GPUs/models and the latter is swappable per
+framework.
+
+### Layer 1 — framework-agnostic machinery (pure-python, 100%-unit-tested)
+- `tensor_codec` — a self-describing `WireTensor` ↔ proto `Tensor` (dtype string +
+  int64 shape + raw little-endian bytes). bf16 has no numpy scalar → carry it as
+  `uint16` bits under the logical name `"bfloat16"`; rebuild via thin torch/mlx
+  bridges. **No torch/mlx import in the codec** (mlx bridges are `# pragma: no cover`).
+- `dflash_service` — a `RestorationDraftEngine` Protocol (WireTensor in/out), an
+  async gRPC servicer, and a sync `RemoteDFlashProposer` client. Engine `KeyError`
+  → `NOT_FOUND`, `ValueError` → `INVALID_ARGUMENT`.
+- `fused_decode` — `DistributedFusedDecoder` (mirrors the in-process fused loop)
+  driving a `RestoringVerifier` Protocol. Aux/K-V cross the verifier↔decoder
+  boundary as `WireTensor`, so the loop is framework-agnostic and fully fakeable.
+
+### Layer 2 — real-model engines (mlx/torch, validated on-device, NOT coverage-gated)
+- **Host A (verifier):** a `RestoringVerifier` adapter wrapping your restored
+  incremental verifier (Kakeya: `MLXRestoringVerifierAdapter` over
+  `MLXRestoredIncrementalVerifier`).
+- **Host B (proposer):** a `RestorationDraftEngine` impl holding the drafter + f_θ
+  + the verifier's tied embedding (Kakeya: `MLXRestorationDraftEngine` for an
+  all-Mac loopback, `TorchRestorationDraftEngine` for a CUDA host).
+
+### Wire protocol (stateful session)
+Per turn: **Restore** (prompt → host B captures drafter K/V → f_θ → verifier K/V
+banks; host A prefills) → **SeedContext** (host A's verifier aux hidden over the
+prompt → host B's drafter context K/V). Per block: **DraftBlock** (bonus +
+context_len → exactly `block_size` drafts) → host A verifies/commits →
+**ExtendContext** (committed tokens' aux, O(block) → grow host B's context).
+**CloseSession** frees host-B state.
+
+| Message | Dir | Size class |
+|---|---|---|
+| Restore | A→B ids / B→A K/V banks | O(T) one-time (empty under S5 free-lunch) |
+| SeedContext | A→B aux | O(T) one-time |
+| DraftBlock | A↔B | O(1) / O(block) |
+| ExtendContext | A→B committed aux | O(block) (the per-block bandwidth term) |
+
+---
+
+## 3. SOP — build order
+
+1. **Ground the dataflow first.** Read the EXACT single-host fused loop and write
+   down, per block, every tensor that crosses the drafter↔verifier boundary
+   (shapes, dtype, which model produces it). Decide what stays local (drafter
+   context K/V, verifier KV cache, full logits — send only the bonus int) vs what
+   crosses (aux hidden O(block), draft ids, restored K/V once).
+2. **Build Layer 1 + unit tests FIRST.** Codec roundtrip + dtype/byte-count
+   validation; servicer over a real `grpc.aio` server with a fake engine (status
+   mapping, dead-address wrap, draft-count refusal); decoder with a fake verifier
+   that models a fixed greedy continuation + fake remotes returning **perfect AND
+   wrong** drafts — assert **byte-identical to greedy in both cases**. This proves
+   containment before any model is involved.
+3. **Build the real engines (Layer 2)** by REUSING the in-process fused helpers
+   (capture-drafter-KV, f_θ projection, `make_context_kv`/`draft_block_cached`/
+   `extend_context_kv`, the restored verifier). Don't reimplement the math.
+4. **Climb the validation ladder** (each rung adds one risk, all assert
+   byte-identical):
+   - **in-process** (single model load, no gRPC) — validates engine+adapter+loop;
+   - **loopback gRPC** (real wire + codec, same host) — validates serialization;
+   - **cross-host** (real network) — validates deployment + measures RTT.
+   Use **block_size=1 as the greedy baseline** (the same decoder at block=1 is pure
+   greedy) so baseline and distributed share one code path.
+5. **Deploy** with the scripts in §5 and **measure** throughput / bounded-memory / RTT.
+
+---
+
+## 4. Gotchas / lessons (the expensive ones)
+
+- **MLX is Apple-only.** A CUDA host B cannot run the MLX verifier's embedding;
+  give host B a **torch** embedding (load the base verifier, or ship just the
+  ~1.5 GB tied-embed weight). Output stays byte-identical (greedy verify is
+  authoritative); only the drafter numerics / acceptance shift.
+- **transformers version.** gemma-4 (torch) needs `transformers>=5.0`; older
+  custom modeling that depends on `decoder_layer.attention_type` breaks under 5.x
+  (see `requirements.txt`). Also: 5.x `apply_chat_template` returns a dict — pass
+  `tokenize=True, return_dict=False`.
+- **Cross-layer KV sharing.** gemma-4 shares K/V across layers. Ship every
+  non-exact f_θ layer from host B, but on host A **filter restored layers to the
+  verifier's `kv_source_layer_map` source layers** — the verifier only injects
+  those. Keep that filter on the host-A (MLX) side where the layout lives.
+- **f_θ is prefill-only** under S5; on gemma-4 the projected sliding-layer K/V are
+  recall-irrelevant ("free lunch") so `Restore` can be empty — force f_θ (ship the
+  banks) only when you want it load-bearing / to exercise the path.
+- **gRPC max message size.** Restored K/V (~11 MB) and per-block aux exceed gRPC's
+  4 MiB default — set `grpc.max_{send,receive}_message_length` high on both ends.
+- **Don't sync-RPC on the server's event loop in tests.** A sync client `close()`
+  that issues an RPC will deadlock an in-process `grpc.aio` server sharing the
+  thread; drive it via `asyncio.to_thread`. (In production the server is remote —
+  no constraint.)
+- **vast / cloud port mapping.** Portal ports (Caddy) return HTTP 401 to gRPC, and
+  some mapped ports silently drop. Use a **plain high port** (e.g. 50070) reached
+  over an **SSH `-L` tunnel** — do not rely on the externally-mapped portal ports.
+- **Big model cache.** The base verifier may exceed the root disk; cache it in a
+  RAM-disk (`/dev/shm`).
+- **Verify, don't trust comments.** Every "should be byte-identical" claim must be
+  asserted by an actual run on each rung of the ladder.
+
+---
+
+## 5. Deployment + startup scripts
+
+| Host | Script | What it does |
+|---|---|---|
+| B (GPU) | `scripts/deploy/dflash_proposer_server_gpu.sh` | ensure transformers 5.x, fetch gemma-4 (embed) + DFlash into `/dev/shm` HF cache, serve `DFlashProposerService` on a non-portal port |
+| A (verifier) | `scripts/deploy/dflash_verifier_client.sh` | (optionally) open the SSH `-L` tunnel, probe it, run the byte-identical + RTT E2E against `localhost:<port>` |
+| both | `scripts/research/k3_dflash_proposer_server.py` / `k3_distributed_dflash_e2e_mac.py` | the underlying server + harness (in-process / `--grpc` / `--remote-addr`) |
+
+Typical run:
+```bash
+# Host B (GPU):
+bash scripts/deploy/dflash_proposer_server_gpu.sh --port 50070
+# Host A (Mac): open the tunnel with YOUR creds, then:
+ssh -p <ssh_port> root@<gpu_host> -L 50070:localhost:50070   # in another shell
+bash scripts/deploy/dflash_verifier_client.sh \
+    --verifier-path /path/to/gemma-4-26B-A4B-it-mlx-4bit --port 50070
+```
+On a self-hosted Mac runner, the same E2E runs via the bridge preset
+`mlx-distributed-dflash-e2e-crosshost` (it expects the tunnel open on the runner).
+
+---
+
+## 6. What "good" looks like (Kakeya gemma-4 ↔ H200, measured)
+
+- **Correctness:** PASS byte-identical-to-greedy on all three rungs (in-process,
+  loopback gRPC, real Mac↔H200), DFlash acceptance ≈ **0.86–0.89** (vs n-gram 0.10).
+- **Bounded memory:** verifier-side invariant unchanged by the split — ~235 MB
+  resident KV, constant over a 1241-token generation (S5: 25 sliding layers bound
+  to sink+window, 5 exact layers full-context).
+- **RTT (Mac↔H200 over SSH tunnel):** Restore ~3.2 s / 11.5 MB (one-time),
+  SeedContext ~0.4 s, DraftBlock ~268 ms, ExtendContext ~316 ms / 0.27 MB-per-block,
+  per-block ~584 ms; throughput 3.7 tok/s (block=4) vs 1.0 (block=1). The DFlash
+  forward is offloaded to the GPU (a VM→H200 probe shows DraftBlock 108 ms is
+  mostly net-RTT vs the 232 ms Mac-CPU compute); cross-host cost is then network
+  RTT + per-block aux bandwidth bound. **GA levers:** aux quantization/compression,
+  same-rack placement.
+
+---
+
+## 7. Reference (Kakeya impl)
+
+- Machinery: `inference_engine/distributed/{tensor_codec,dflash_service,fused_decode}.py`
+  + tests under `tests/inference_engine/distributed/`.
+- Engines: `inference_engine/backends/mlx/dflash_distributed.py` (host A + Mac host B),
+  `inference_engine/v04/dflash_distributed_engine.py` (CUDA host B).
+- Proto: `proto/kakeya/v1/distributed.proto` (`DFlashProposerService`).
+- Design + measured report: `docs/design/distributed-dflash-ftheta-data-plane.md`.
diff --git a/scripts/deploy/dflash_proposer_server_gpu.sh b/scripts/deploy/dflash_proposer_server_gpu.sh
@@ -0,0 +1,83 @@
+#!/usr/bin/env bash
+# Deploy the remote DFlash+f_θ proposer (ADR 0009 §4 F3) on a CUDA host (host B).
+#
+# One command: ensure transformers 5.x, fetch the gemma-4 verifier (for its
+# embedding) + DFlash drafter to a (RAM-disk) HF cache, and serve the
+# DFlashProposerService. A gemma-4 MLX verifier on host A drives it via
+# RemoteDFlashProposer (see scripts/deploy/dflash_verifier_client.sh).
+#
+# Usage:
+#   bash scripts/deploy/dflash_proposer_server_gpu.sh \
+#       [--port 50070] [--hf-cache /dev/shm/hf] \
+#       [--verifier-id google/gemma-4-26B-A4B-it] \
+#       [--drafter-id z-lab/gemma-4-26B-A4B-it-DFlash] \
+#       [--f-theta-dir results/research/f_theta_v5_s5_sliding] \
+#       [--python /path/to/venv/python] [--foreground]
+#
+# IMPORTANT — pick a port the vast/portal Caddy does NOT own. Portal ports
+# (1111/8080/8384/6006 on vast) are Caddy-proxied (HTTP 401 to gRPC); use a
+# plain high port like 50070 and reach it from host A over an SSH -L tunnel.
+set -euo pipefail
+
+PORT=50070
+HF_CACHE="/dev/shm/hf"          # RAM-disk: the gemma-4 base is ~52GB, > many root disks
+VERIFIER_ID="google/gemma-4-26B-A4B-it"
+DRAFTER_ID="z-lab/gemma-4-26B-A4B-it-DFlash"
+FTHETA_DIR="results/research/f_theta_v5_s5_sliding"
+PYBIN="${KAKEYA_GPU_PYTHON:-python3}"
+FOREGROUND=0
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --port) shift; PORT="${1:?}" ;;
+    --hf-cache) shift; HF_CACHE="${1:?}" ;;
+    --verifier-id) shift; VERIFIER_ID="${1:?}" ;;
+    --drafter-id) shift; DRAFTER_ID="${1:?}" ;;
+    --f-theta-dir) shift; FTHETA_DIR="${1:?}" ;;
+    --python) shift; PYBIN="${1:?}" ;;
+    --foreground) FOREGROUND=1 ;;
+    *) echo "[deploy-gpu] unknown arg: $1" >&2; exit 2 ;;
+  esac
+  shift
+done
+
+repo_root="$(cd "$(dirname "$0")/../.." && pwd)"
+cd "$repo_root"
+export HF_HOME="$HF_CACHE"
+export PYTHONPATH="$repo_root:$repo_root/sdks/python"
+
+log() { echo "[deploy-gpu] $*" >&2; }
+
+log "repo=$repo_root python=$PYBIN port=$PORT hf_cache=$HF_CACHE"
+[[ -s "$FTHETA_DIR/f_theta_weights.pt" ]] || {
+  log "ERROR: $FTHETA_DIR/f_theta_weights.pt missing (git lfs pull it, or scp from host A)"; exit 1; }
+
+# gemma-4 needs transformers 5.x; the DFlash drafter + f_θ are framework-custom.
+if ! "$PYBIN" -c 'import transformers,sys; sys.exit(0 if transformers.__version__>="5" else 1)' 2>/dev/null; then
+  log "installing transformers>=5.0 (gemma-4 requires it)"
+  "$PYBIN" -m pip install -q "transformers>=5.0,<6.0"
+fi
+
+log "fetching weights into $HF_CACHE (gemma-4 verifier embed + DFlash drafter)"
+"$PYBIN" - "$VERIFIER_ID" "$DRAFTER_ID" <<'PY'
+import sys
+from huggingface_hub import snapshot_download
+v, d = sys.argv[1], sys.argv[2]
+snapshot_download(v, allow_patterns=["*.json","*.model","tokenizer*","*.safetensors"])
+snapshot_download(d)
+print("[deploy-gpu] weights ready", file=sys.stderr)
+PY
+
+cmd=("$PYBIN" scripts/research/k3_dflash_proposer_server.py
+     --verifier-id "$VERIFIER_ID" --drafter-id "$DRAFTER_ID"
+     --f-theta-dir "$FTHETA_DIR" --bind "0.0.0.0:$PORT")
+
+if [[ "$FOREGROUND" == "1" ]]; then
+  log "serving in foreground on 0.0.0.0:$PORT"
+  exec "${cmd[@]}"
+fi
+for p in $(pgrep -f k3_dflash_proposer_server 2>/dev/null || true); do kill "$p" 2>/dev/null || true; done
+sleep 1
+nohup "${cmd[@]}" > /tmp/dflash_proposer_server.log 2>&1 &
+log "server pid $! -> /tmp/dflash_proposer_server.log (loading gemma-4 onto the GPU…)"
+log "host A connects via: ssh -p <ssh_port> root@<gpu_host> -L $PORT:localhost:$PORT"
diff --git a/scripts/deploy/dflash_verifier_client.sh b/scripts/deploy/dflash_verifier_client.sh
@@ -0,0 +1,78 @@
+#!/usr/bin/env bash
+# Host A (verifier) side of the distributed DFlash+f_θ engine: a gemma-4 MLX
+# verifier driving the remote proposer (host B) over an SSH -L tunnel, asserting
+# byte-identical-to-greedy and reporting throughput + cross-host RTT.
+#
+# Usage:
+#   bash scripts/deploy/dflash_verifier_client.sh \
+#       --verifier-path /path/to/gemma-4-26B-A4B-it-mlx-4bit \
+#       --drafter-id z-lab/gemma-4-26B-A4B-it-DFlash \
+#       [--port 50070] [--max-new 64] [--block 4] \
+#       [--ssh "-p 43350 root@107.206.71.138" --ssh-key /path/key]   # auto-open tunnel
+#
+# If --ssh is omitted, assumes an SSH -L <port>:localhost:<port> tunnel to host B
+# is ALREADY open (the vast/portal case: open it yourself with your own creds).
+set -euo pipefail
+
+PORT=50070
+VERIFIER_PATH="${KAKEYA_MAC_VERIFIER_PATH:-}"
+DRAFTER_ID="${KAKEYA_MAC_DRAFTER_ID:-z-lab/gemma-4-26B-A4B-it-DFlash}"
+MAXNEW=64
+BLOCK=4
+SSH_TARGET=""
+SSH_KEY=""
+PYBIN="${KAKEYA_MAC_PYTHON:-python3}"
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --port) shift; PORT="${1:?}" ;;
+    --verifier-path) shift; VERIFIER_PATH="${1:?}" ;;
+    --drafter-id) shift; DRAFTER_ID="${1:?}" ;;
+    --max-new) shift; MAXNEW="${1:?}" ;;
+    --block) shift; BLOCK="${1:?}" ;;
+    --ssh) shift; SSH_TARGET="${1:?}" ;;
+    --ssh-key) shift; SSH_KEY="${1:?}" ;;
+    --python) shift; PYBIN="${1:?}" ;;
+    *) echo "[verifier-client] unknown arg: $1" >&2; exit 2 ;;
+  esac
+  shift
+done
+
+repo_root="$(cd "$(dirname "$0")/../.." && pwd)"
+cd "$repo_root"
+export PYTHONPATH="$repo_root:$repo_root/sdks/python"
+log() { echo "[verifier-client] $*" >&2; }
+[[ -n "$VERIFIER_PATH" ]] || { log "ERROR: --verifier-path (or KAKEYA_MAC_VERIFIER_PATH) required"; exit 1; }
+
+tunnel_pid=""
+cleanup() { [[ -n "$tunnel_pid" ]] && kill "$tunnel_pid" 2>/dev/null || true; }
+trap cleanup EXIT
+
+if [[ -n "$SSH_TARGET" ]]; then
+  key_opt=""; [[ -n "$SSH_KEY" ]] && key_opt="-i $SSH_KEY"
+  log "opening SSH tunnel: localhost:$PORT -> host B :$PORT ($SSH_TARGET)"
+  # shellcheck disable=SC2086
+  ssh $key_opt -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes \
+      -fN -L "$PORT:localhost:$PORT" $SSH_TARGET
+  tunnel_pid=$(pgrep -f "$PORT:localhost:$PORT" | head -1 || true)
+  sleep 3
+fi
+
+# Connectivity probe (helps distinguish "tunnel down" from "Caddy 401").
+"$PYBIN" - "$PORT" <<'PY'
+import socket, sys
+p = int(sys.argv[1]); s = socket.socket(); s.settimeout(5)
+try:
+    s.connect(("127.0.0.1", p)); print(f"[verifier-client] tunnel OK -> localhost:{p}", file=sys.stderr)
+except Exception as e:
+    print(f"[verifier-client] NO tunnel on localhost:{p}: {e}\n"
+          f"  open one: ssh -p <ssh_port> root@<gpu_host> -L {p}:localhost:{p}", file=sys.stderr)
+    sys.exit(1)
+finally:
+    s.close()
+PY
+
+log "running cross-host E2E (verifier @here <-> proposer @localhost:$PORT)"
+exec "$PYBIN" scripts/research/k3_distributed_dflash_e2e_mac.py \
+    --verifier-path "$VERIFIER_PATH" --drafter-id "$DRAFTER_ID" \
+    --remote-addr "localhost:$PORT" --max-new-tokens "$MAXNEW" --block-size "$BLOCK"