Skip to content

Commit 4a1d60c

Browse files
docs+bridge: record live cross-host RTT (VM->H200) + sync crosshost preset to :50070
Deployed TorchRestorationDraftEngine on an H200; measured the real DFlash+f_θ data plane cross-host: DraftBlock p50 108ms (vs 232ms Mac-CPU loopback — GPU offload cuts draft compute), ExtendContext 140ms/0.27MB, per-block ~248ms over an SSH tunnel. Caddy occupies the portal ports, so the link uses :50070. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
1 parent bdacbd8 commit 4a1d60c

2 files changed

Lines changed: 27 additions & 4 deletions

File tree

docs/design/distributed-dflash-ftheta-data-plane.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,30 @@ measured VM↔H200) + ExtendContext aux (~0.25 MB)** — i.e. moving the propose
127127
the GPU is projected to cut `DraftBlock` from ~232 ms to well under network RTT.
128128
The one-time `Restore` (11.5 MB) + `SeedContext` (1.9 MB) amortize over the turn.
129129

130-
### Remaining for the LIVE Mac↔GPU number
130+
## Live cross-host RTT (landed)
131+
132+
Deployed the torch engine on an H200: `inference_engine/v04/dflash_distributed_engine
133+
.TorchRestorationDraftEngine` (torch gemma-4-26B-A4B-it for the embed + DFlash +
134+
f_θ) served by `scripts/research/k3_dflash_proposer_server.py`; a verifier host
135+
connects with `RemoteDFlashProposer`. The MLX verifier adapter filters restored
136+
layers to the verifier's KV-source layers (gemma-4 cross-layer sharing).
137+
138+
Measured VM→H200 over an SSH `-L` tunnel (real GPU compute; true data-plane payloads):
139+
140+
| RPC | p50 | payload | note |
141+
|---|---|---|---|
142+
| Restore | 2310 ms | 11.47 MB | one-time; f_θ-projected sliding-layer K/V (25 layers) |
143+
| SeedContext | 947 ms | 1.89 MB | one-time; prompt aux |
144+
| **DraftBlock** | **108 ms** | O(1) | H200 DFlash forward + net RTT — **vs 232 ms on the Mac CPU (loopback)**: the GPU offload cuts draft compute |
145+
| ExtendContext | 140 ms | 0.27 MB/block | committed aux — bandwidth-dominated cross-host |
146+
147+
Per-block (draft+extend) p50 ≈ **248 ms** over the SSH tunnel. Caveats: the SSH
148+
single-stream inflates transfer-bound RPCs vs a direct gRPC link; VM↔H200 base RTT
149+
≈ 52 ms; byte-identical correctness is proven on the Mac loopback (same engine code).
150+
The Mac↔H200 byte-identical run uses the same path via `mlx-distributed-dflash-e2e-
151+
crosshost` with `ssh -p 43350 root@107.206.71.138 -L 50070:localhost:50070` active.
152+
153+
### (historical) Remaining for the LIVE Mac↔GPU number
131154
The GPU (CUDA) cannot run MLX, so the GPU-side engine needs a **torch embedding**
132155
source for `embed_fn`/`lm_head_fn` (gemma-4 tied embed). Two options:
133156
1. one-time ship of the verifier embedding weights Mac→GPU at session setup

inference_engine/bridge/manifest.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -158,16 +158,16 @@ def _harness_preset(
158158
name="mlx-distributed-dflash-e2e-crosshost",
159159
description="TRUE cross-host: gemma-4 mlx-4bit verifier on THIS Mac ↔ a "
160160
"remote torch DFlash+f_θ DFlashProposerService on the H200, "
161-
"reached at localhost:6006 via an SSH -L tunnel "
162-
"(ssh -p 43350 root@107.206.71.138 -L 6006:localhost:6006). "
161+
"reached at localhost:50070 via an SSH -L tunnel "
162+
"(ssh -p 43350 root@107.206.71.138 -L 6006:localhost:50070). "
163163
"Runs greedy (block=1) + distributed (block=N) over the wire "
164164
"and asserts byte-identical, reporting real cross-host RTT.",
165165
command_templates=(
166166
(
167167
"python3", "scripts/research/k3_distributed_dflash_e2e_mac.py",
168168
"--verifier-path", "${ENV:KAKEYA_MAC_VERIFIER_PATH}",
169169
"--drafter-id", "${ENV:KAKEYA_MAC_DRAFTER_ID}",
170-
"--remote-addr", "localhost:6006",
170+
"--remote-addr", "localhost:50070",
171171
"--max-new-tokens", "{max_new_tokens}",
172172
"--block-size", "{block_size}",
173173
),

0 commit comments

Comments
 (0)