Skip to content

Commit 00530dd

Browse files
docs(skill)+deploy: distributed DFlash+f_θ inference SOP skill + host A/B deploy scripts
- docs/skills/distributed-dflash-ftheta-inference-skill.md: reusable SOP (two-layer design, build order, the byte-identical validation ladder, the expensive gotchas: MLX-Apple-only/torch-embed, transformers 5.x, gemma-4 KV-source-layer filtering, vast Caddy ports + SSH -L, /dev/shm cache). - scripts/deploy/dflash_proposer_server_gpu.sh: one-command host-B (GPU) deploy (transformers 5.x + fetch gemma-4/DFlash to /dev/shm + serve DFlashProposerService). - scripts/deploy/dflash_verifier_client.sh: host-A (verifier) launcher (open SSH -L tunnel + probe + run the byte-identical + RTT E2E). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
1 parent caa520e commit 00530dd

3 files changed

Lines changed: 343 additions & 0 deletions

File tree

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# Skill: Build a distributed speculative-decode inference engine (remote DFlash + f_θ proposer)
2+
3+
**Reusable across agents (Claude / Codex / Cursor).** This is the SOP for taking a
4+
single-host fused spec-decode engine (an AR verifier + an EAGLE-style drafter +
5+
f_θ KV restoration) and splitting it across hosts — **verifier on host A, drafter
6+
+ f_θ proposer on host B** — over a real gRPC data plane (ADR 0009 §4 "F3"). The
7+
concrete example is Kakeya's gemma-4 verifier (MLX, Mac) ↔ DFlash+f_θ (torch, GPU),
8+
but the pattern is general.
9+
10+
The non-negotiable invariant that makes this safe: **correctness containment**
11+
the verifier's local greedy verify decides every token, so the output is
12+
**byte-identical to local greedy regardless of what the remote proposer drafts**.
13+
A wrong/stale/garbage draft can only lower the acceptance rate, never change a token.
14+
15+
---
16+
17+
## 1. When to use this skill
18+
19+
- You have a working **single-host** fused spec-decode loop and want to offload the
20+
drafter (+ f_θ) to another machine (GPU fleet utilization, memory split, etc.).
21+
- The drafter is **EAGLE-style** (needs the verifier's aux-layer hidden states +
22+
the verifier's tied embedding), so it is NOT a token-ids-only proposer.
23+
- You need a real cross-host **RTT / throughput / bounded-memory** measurement of
24+
the production config, not a toy proposer.
25+
26+
If your proposer is **model-free / token-ids-only** (e.g. an n-gram prompt-lookup),
27+
you do NOT need this — use the simpler `ProposerService` / `RemoteProposer`
28+
(ADR 0009 control plane). This skill is specifically for the **bulk-tensor data
29+
plane** (aux hidden + f_θ-projected K/V crossing the wire).
30+
31+
---
32+
33+
## 2. Architecture: two layers
34+
35+
Keep the **transport/protocol** strictly separate from the **model math** so the
36+
former is unit-testable without GPUs/models and the latter is swappable per
37+
framework.
38+
39+
### Layer 1 — framework-agnostic machinery (pure-python, 100%-unit-tested)
40+
- `tensor_codec` — a self-describing `WireTensor` ↔ proto `Tensor` (dtype string +
41+
int64 shape + raw little-endian bytes). bf16 has no numpy scalar → carry it as
42+
`uint16` bits under the logical name `"bfloat16"`; rebuild via thin torch/mlx
43+
bridges. **No torch/mlx import in the codec** (mlx bridges are `# pragma: no cover`).
44+
- `dflash_service` — a `RestorationDraftEngine` Protocol (WireTensor in/out), an
45+
async gRPC servicer, and a sync `RemoteDFlashProposer` client. Engine `KeyError`
46+
`NOT_FOUND`, `ValueError``INVALID_ARGUMENT`.
47+
- `fused_decode``DistributedFusedDecoder` (mirrors the in-process fused loop)
48+
driving a `RestoringVerifier` Protocol. Aux/K-V cross the verifier↔decoder
49+
boundary as `WireTensor`, so the loop is framework-agnostic and fully fakeable.
50+
51+
### Layer 2 — real-model engines (mlx/torch, validated on-device, NOT coverage-gated)
52+
- **Host A (verifier):** a `RestoringVerifier` adapter wrapping your restored
53+
incremental verifier (Kakeya: `MLXRestoringVerifierAdapter` over
54+
`MLXRestoredIncrementalVerifier`).
55+
- **Host B (proposer):** a `RestorationDraftEngine` impl holding the drafter + f_θ
56+
+ the verifier's tied embedding (Kakeya: `MLXRestorationDraftEngine` for an
57+
all-Mac loopback, `TorchRestorationDraftEngine` for a CUDA host).
58+
59+
### Wire protocol (stateful session)
60+
Per turn: **Restore** (prompt → host B captures drafter K/V → f_θ → verifier K/V
61+
banks; host A prefills) → **SeedContext** (host A's verifier aux hidden over the
62+
prompt → host B's drafter context K/V). Per block: **DraftBlock** (bonus +
63+
context_len → exactly `block_size` drafts) → host A verifies/commits →
64+
**ExtendContext** (committed tokens' aux, O(block) → grow host B's context).
65+
**CloseSession** frees host-B state.
66+
67+
| Message | Dir | Size class |
68+
|---|---|---|
69+
| Restore | A→B ids / B→A K/V banks | O(T) one-time (empty under S5 free-lunch) |
70+
| SeedContext | A→B aux | O(T) one-time |
71+
| DraftBlock | A↔B | O(1) / O(block) |
72+
| ExtendContext | A→B committed aux | O(block) (the per-block bandwidth term) |
73+
74+
---
75+
76+
## 3. SOP — build order
77+
78+
1. **Ground the dataflow first.** Read the EXACT single-host fused loop and write
79+
down, per block, every tensor that crosses the drafter↔verifier boundary
80+
(shapes, dtype, which model produces it). Decide what stays local (drafter
81+
context K/V, verifier KV cache, full logits — send only the bonus int) vs what
82+
crosses (aux hidden O(block), draft ids, restored K/V once).
83+
2. **Build Layer 1 + unit tests FIRST.** Codec roundtrip + dtype/byte-count
84+
validation; servicer over a real `grpc.aio` server with a fake engine (status
85+
mapping, dead-address wrap, draft-count refusal); decoder with a fake verifier
86+
that models a fixed greedy continuation + fake remotes returning **perfect AND
87+
wrong** drafts — assert **byte-identical to greedy in both cases**. This proves
88+
containment before any model is involved.
89+
3. **Build the real engines (Layer 2)** by REUSING the in-process fused helpers
90+
(capture-drafter-KV, f_θ projection, `make_context_kv`/`draft_block_cached`/
91+
`extend_context_kv`, the restored verifier). Don't reimplement the math.
92+
4. **Climb the validation ladder** (each rung adds one risk, all assert
93+
byte-identical):
94+
- **in-process** (single model load, no gRPC) — validates engine+adapter+loop;
95+
- **loopback gRPC** (real wire + codec, same host) — validates serialization;
96+
- **cross-host** (real network) — validates deployment + measures RTT.
97+
Use **block_size=1 as the greedy baseline** (the same decoder at block=1 is pure
98+
greedy) so baseline and distributed share one code path.
99+
5. **Deploy** with the scripts in §5 and **measure** throughput / bounded-memory / RTT.
100+
101+
---
102+
103+
## 4. Gotchas / lessons (the expensive ones)
104+
105+
- **MLX is Apple-only.** A CUDA host B cannot run the MLX verifier's embedding;
106+
give host B a **torch** embedding (load the base verifier, or ship just the
107+
~1.5 GB tied-embed weight). Output stays byte-identical (greedy verify is
108+
authoritative); only the drafter numerics / acceptance shift.
109+
- **transformers version.** gemma-4 (torch) needs `transformers>=5.0`; older
110+
custom modeling that depends on `decoder_layer.attention_type` breaks under 5.x
111+
(see `requirements.txt`). Also: 5.x `apply_chat_template` returns a dict — pass
112+
`tokenize=True, return_dict=False`.
113+
- **Cross-layer KV sharing.** gemma-4 shares K/V across layers. Ship every
114+
non-exact f_θ layer from host B, but on host A **filter restored layers to the
115+
verifier's `kv_source_layer_map` source layers** — the verifier only injects
116+
those. Keep that filter on the host-A (MLX) side where the layout lives.
117+
- **f_θ is prefill-only** under S5; on gemma-4 the projected sliding-layer K/V are
118+
recall-irrelevant ("free lunch") so `Restore` can be empty — force f_θ (ship the
119+
banks) only when you want it load-bearing / to exercise the path.
120+
- **gRPC max message size.** Restored K/V (~11 MB) and per-block aux exceed gRPC's
121+
4 MiB default — set `grpc.max_{send,receive}_message_length` high on both ends.
122+
- **Don't sync-RPC on the server's event loop in tests.** A sync client `close()`
123+
that issues an RPC will deadlock an in-process `grpc.aio` server sharing the
124+
thread; drive it via `asyncio.to_thread`. (In production the server is remote —
125+
no constraint.)
126+
- **vast / cloud port mapping.** Portal ports (Caddy) return HTTP 401 to gRPC, and
127+
some mapped ports silently drop. Use a **plain high port** (e.g. 50070) reached
128+
over an **SSH `-L` tunnel** — do not rely on the externally-mapped portal ports.
129+
- **Big model cache.** The base verifier may exceed the root disk; cache it in a
130+
RAM-disk (`/dev/shm`).
131+
- **Verify, don't trust comments.** Every "should be byte-identical" claim must be
132+
asserted by an actual run on each rung of the ladder.
133+
134+
---
135+
136+
## 5. Deployment + startup scripts
137+
138+
| Host | Script | What it does |
139+
|---|---|---|
140+
| B (GPU) | `scripts/deploy/dflash_proposer_server_gpu.sh` | ensure transformers 5.x, fetch gemma-4 (embed) + DFlash into `/dev/shm` HF cache, serve `DFlashProposerService` on a non-portal port |
141+
| A (verifier) | `scripts/deploy/dflash_verifier_client.sh` | (optionally) open the SSH `-L` tunnel, probe it, run the byte-identical + RTT E2E against `localhost:<port>` |
142+
| both | `scripts/research/k3_dflash_proposer_server.py` / `k3_distributed_dflash_e2e_mac.py` | the underlying server + harness (in-process / `--grpc` / `--remote-addr`) |
143+
144+
Typical run:
145+
```bash
146+
# Host B (GPU):
147+
bash scripts/deploy/dflash_proposer_server_gpu.sh --port 50070
148+
# Host A (Mac): open the tunnel with YOUR creds, then:
149+
ssh -p <ssh_port> root@<gpu_host> -L 50070:localhost:50070 # in another shell
150+
bash scripts/deploy/dflash_verifier_client.sh \
151+
--verifier-path /path/to/gemma-4-26B-A4B-it-mlx-4bit --port 50070
152+
```
153+
On a self-hosted Mac runner, the same E2E runs via the bridge preset
154+
`mlx-distributed-dflash-e2e-crosshost` (it expects the tunnel open on the runner).
155+
156+
---
157+
158+
## 6. What "good" looks like (Kakeya gemma-4 ↔ H200, measured)
159+
160+
- **Correctness:** PASS byte-identical-to-greedy on all three rungs (in-process,
161+
loopback gRPC, real Mac↔H200), DFlash acceptance ≈ **0.86–0.89** (vs n-gram 0.10).
162+
- **Bounded memory:** verifier-side invariant unchanged by the split — ~235 MB
163+
resident KV, constant over a 1241-token generation (S5: 25 sliding layers bound
164+
to sink+window, 5 exact layers full-context).
165+
- **RTT (Mac↔H200 over SSH tunnel):** Restore ~3.2 s / 11.5 MB (one-time),
166+
SeedContext ~0.4 s, DraftBlock ~268 ms, ExtendContext ~316 ms / 0.27 MB-per-block,
167+
per-block ~584 ms; throughput 3.7 tok/s (block=4) vs 1.0 (block=1). The DFlash
168+
forward is offloaded to the GPU (a VM→H200 probe shows DraftBlock 108 ms is
169+
mostly net-RTT vs the 232 ms Mac-CPU compute); cross-host cost is then network
170+
RTT + per-block aux bandwidth bound. **GA levers:** aux quantization/compression,
171+
same-rack placement.
172+
173+
---
174+
175+
## 7. Reference (Kakeya impl)
176+
177+
- Machinery: `inference_engine/distributed/{tensor_codec,dflash_service,fused_decode}.py`
178+
+ tests under `tests/inference_engine/distributed/`.
179+
- Engines: `inference_engine/backends/mlx/dflash_distributed.py` (host A + Mac host B),
180+
`inference_engine/v04/dflash_distributed_engine.py` (CUDA host B).
181+
- Proto: `proto/kakeya/v1/distributed.proto` (`DFlashProposerService`).
182+
- Design + measured report: `docs/design/distributed-dflash-ftheta-data-plane.md`.
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
#!/usr/bin/env bash
2+
# Deploy the remote DFlash+f_θ proposer (ADR 0009 §4 F3) on a CUDA host (host B).
3+
#
4+
# One command: ensure transformers 5.x, fetch the gemma-4 verifier (for its
5+
# embedding) + DFlash drafter to a (RAM-disk) HF cache, and serve the
6+
# DFlashProposerService. A gemma-4 MLX verifier on host A drives it via
7+
# RemoteDFlashProposer (see scripts/deploy/dflash_verifier_client.sh).
8+
#
9+
# Usage:
10+
# bash scripts/deploy/dflash_proposer_server_gpu.sh \
11+
# [--port 50070] [--hf-cache /dev/shm/hf] \
12+
# [--verifier-id google/gemma-4-26B-A4B-it] \
13+
# [--drafter-id z-lab/gemma-4-26B-A4B-it-DFlash] \
14+
# [--f-theta-dir results/research/f_theta_v5_s5_sliding] \
15+
# [--python /path/to/venv/python] [--foreground]
16+
#
17+
# IMPORTANT — pick a port the vast/portal Caddy does NOT own. Portal ports
18+
# (1111/8080/8384/6006 on vast) are Caddy-proxied (HTTP 401 to gRPC); use a
19+
# plain high port like 50070 and reach it from host A over an SSH -L tunnel.
20+
set -euo pipefail
21+
22+
PORT=50070
23+
HF_CACHE="/dev/shm/hf" # RAM-disk: the gemma-4 base is ~52GB, > many root disks
24+
VERIFIER_ID="google/gemma-4-26B-A4B-it"
25+
DRAFTER_ID="z-lab/gemma-4-26B-A4B-it-DFlash"
26+
FTHETA_DIR="results/research/f_theta_v5_s5_sliding"
27+
PYBIN="${KAKEYA_GPU_PYTHON:-python3}"
28+
FOREGROUND=0
29+
30+
while [[ $# -gt 0 ]]; do
31+
case "$1" in
32+
--port) shift; PORT="${1:?}" ;;
33+
--hf-cache) shift; HF_CACHE="${1:?}" ;;
34+
--verifier-id) shift; VERIFIER_ID="${1:?}" ;;
35+
--drafter-id) shift; DRAFTER_ID="${1:?}" ;;
36+
--f-theta-dir) shift; FTHETA_DIR="${1:?}" ;;
37+
--python) shift; PYBIN="${1:?}" ;;
38+
--foreground) FOREGROUND=1 ;;
39+
*) echo "[deploy-gpu] unknown arg: $1" >&2; exit 2 ;;
40+
esac
41+
shift
42+
done
43+
44+
repo_root="$(cd "$(dirname "$0")/../.." && pwd)"
45+
cd "$repo_root"
46+
export HF_HOME="$HF_CACHE"
47+
export PYTHONPATH="$repo_root:$repo_root/sdks/python"
48+
49+
log() { echo "[deploy-gpu] $*" >&2; }
50+
51+
log "repo=$repo_root python=$PYBIN port=$PORT hf_cache=$HF_CACHE"
52+
[[ -s "$FTHETA_DIR/f_theta_weights.pt" ]] || {
53+
log "ERROR: $FTHETA_DIR/f_theta_weights.pt missing (git lfs pull it, or scp from host A)"; exit 1; }
54+
55+
# gemma-4 needs transformers 5.x; the DFlash drafter + f_θ are framework-custom.
56+
if ! "$PYBIN" -c 'import transformers,sys; sys.exit(0 if transformers.__version__>="5" else 1)' 2>/dev/null; then
57+
log "installing transformers>=5.0 (gemma-4 requires it)"
58+
"$PYBIN" -m pip install -q "transformers>=5.0,<6.0"
59+
fi
60+
61+
log "fetching weights into $HF_CACHE (gemma-4 verifier embed + DFlash drafter)"
62+
"$PYBIN" - "$VERIFIER_ID" "$DRAFTER_ID" <<'PY'
63+
import sys
64+
from huggingface_hub import snapshot_download
65+
v, d = sys.argv[1], sys.argv[2]
66+
snapshot_download(v, allow_patterns=["*.json","*.model","tokenizer*","*.safetensors"])
67+
snapshot_download(d)
68+
print("[deploy-gpu] weights ready", file=sys.stderr)
69+
PY
70+
71+
cmd=("$PYBIN" scripts/research/k3_dflash_proposer_server.py
72+
--verifier-id "$VERIFIER_ID" --drafter-id "$DRAFTER_ID"
73+
--f-theta-dir "$FTHETA_DIR" --bind "0.0.0.0:$PORT")
74+
75+
if [[ "$FOREGROUND" == "1" ]]; then
76+
log "serving in foreground on 0.0.0.0:$PORT"
77+
exec "${cmd[@]}"
78+
fi
79+
for p in $(pgrep -f k3_dflash_proposer_server 2>/dev/null || true); do kill "$p" 2>/dev/null || true; done
80+
sleep 1
81+
nohup "${cmd[@]}" > /tmp/dflash_proposer_server.log 2>&1 &
82+
log "server pid $! -> /tmp/dflash_proposer_server.log (loading gemma-4 onto the GPU…)"
83+
log "host A connects via: ssh -p <ssh_port> root@<gpu_host> -L $PORT:localhost:$PORT"
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
#!/usr/bin/env bash
2+
# Host A (verifier) side of the distributed DFlash+f_θ engine: a gemma-4 MLX
3+
# verifier driving the remote proposer (host B) over an SSH -L tunnel, asserting
4+
# byte-identical-to-greedy and reporting throughput + cross-host RTT.
5+
#
6+
# Usage:
7+
# bash scripts/deploy/dflash_verifier_client.sh \
8+
# --verifier-path /path/to/gemma-4-26B-A4B-it-mlx-4bit \
9+
# --drafter-id z-lab/gemma-4-26B-A4B-it-DFlash \
10+
# [--port 50070] [--max-new 64] [--block 4] \
11+
# [--ssh "-p 43350 root@107.206.71.138" --ssh-key /path/key] # auto-open tunnel
12+
#
13+
# If --ssh is omitted, assumes an SSH -L <port>:localhost:<port> tunnel to host B
14+
# is ALREADY open (the vast/portal case: open it yourself with your own creds).
15+
set -euo pipefail
16+
17+
PORT=50070
18+
VERIFIER_PATH="${KAKEYA_MAC_VERIFIER_PATH:-}"
19+
DRAFTER_ID="${KAKEYA_MAC_DRAFTER_ID:-z-lab/gemma-4-26B-A4B-it-DFlash}"
20+
MAXNEW=64
21+
BLOCK=4
22+
SSH_TARGET=""
23+
SSH_KEY=""
24+
PYBIN="${KAKEYA_MAC_PYTHON:-python3}"
25+
26+
while [[ $# -gt 0 ]]; do
27+
case "$1" in
28+
--port) shift; PORT="${1:?}" ;;
29+
--verifier-path) shift; VERIFIER_PATH="${1:?}" ;;
30+
--drafter-id) shift; DRAFTER_ID="${1:?}" ;;
31+
--max-new) shift; MAXNEW="${1:?}" ;;
32+
--block) shift; BLOCK="${1:?}" ;;
33+
--ssh) shift; SSH_TARGET="${1:?}" ;;
34+
--ssh-key) shift; SSH_KEY="${1:?}" ;;
35+
--python) shift; PYBIN="${1:?}" ;;
36+
*) echo "[verifier-client] unknown arg: $1" >&2; exit 2 ;;
37+
esac
38+
shift
39+
done
40+
41+
repo_root="$(cd "$(dirname "$0")/../.." && pwd)"
42+
cd "$repo_root"
43+
export PYTHONPATH="$repo_root:$repo_root/sdks/python"
44+
log() { echo "[verifier-client] $*" >&2; }
45+
[[ -n "$VERIFIER_PATH" ]] || { log "ERROR: --verifier-path (or KAKEYA_MAC_VERIFIER_PATH) required"; exit 1; }
46+
47+
tunnel_pid=""
48+
cleanup() { [[ -n "$tunnel_pid" ]] && kill "$tunnel_pid" 2>/dev/null || true; }
49+
trap cleanup EXIT
50+
51+
if [[ -n "$SSH_TARGET" ]]; then
52+
key_opt=""; [[ -n "$SSH_KEY" ]] && key_opt="-i $SSH_KEY"
53+
log "opening SSH tunnel: localhost:$PORT -> host B :$PORT ($SSH_TARGET)"
54+
# shellcheck disable=SC2086
55+
ssh $key_opt -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes \
56+
-fN -L "$PORT:localhost:$PORT" $SSH_TARGET
57+
tunnel_pid=$(pgrep -f "$PORT:localhost:$PORT" | head -1 || true)
58+
sleep 3
59+
fi
60+
61+
# Connectivity probe (helps distinguish "tunnel down" from "Caddy 401").
62+
"$PYBIN" - "$PORT" <<'PY'
63+
import socket, sys
64+
p = int(sys.argv[1]); s = socket.socket(); s.settimeout(5)
65+
try:
66+
s.connect(("127.0.0.1", p)); print(f"[verifier-client] tunnel OK -> localhost:{p}", file=sys.stderr)
67+
except Exception as e:
68+
print(f"[verifier-client] NO tunnel on localhost:{p}: {e}\n"
69+
f" open one: ssh -p <ssh_port> root@<gpu_host> -L {p}:localhost:{p}", file=sys.stderr)
70+
sys.exit(1)
71+
finally:
72+
s.close()
73+
PY
74+
75+
log "running cross-host E2E (verifier @here <-> proposer @localhost:$PORT)"
76+
exec "$PYBIN" scripts/research/k3_distributed_dflash_e2e_mac.py \
77+
--verifier-path "$VERIFIER_PATH" --drafter-id "$DRAFTER_ID" \
78+
--remote-addr "localhost:$PORT" --max-new-tokens "$MAXNEW" --block-size "$BLOCK"

0 commit comments

Comments
 (0)