docs+bridge: record live cross-host RTT (VM->H200) + sync crosshost preset to :50070

cursoragent · FluffyAIcode · cursoragent · commit 4a1d60c4aa54 · 2026-06-19T14:08:06.000Z
Deployed TorchRestorationDraftEngine on an H200; measured the real DFlash+f_θ data
plane cross-host: DraftBlock p50 108ms (vs 232ms Mac-CPU loopback — GPU offload
cuts draft compute), ExtendContext 140ms/0.27MB, per-block ~248ms over an SSH
tunnel. Caddy occupies the portal ports, so the link uses :50070.

Co-authored-by: FluffyAIcode &lt;FluffyAIcode@users.noreply.github.com&gt;
diff --git a/docs/design/distributed-dflash-ftheta-data-plane.md b/docs/design/distributed-dflash-ftheta-data-plane.md
@@ -127,7 +127,30 @@ measured VM↔H200) + ExtendContext aux (~0.25 MB)** — i.e. moving the propose
 the GPU is projected to cut `DraftBlock` from ~232 ms to well under network RTT.
 The one-time `Restore` (11.5 MB) + `SeedContext` (1.9 MB) amortize over the turn.
 
-### Remaining for the LIVE Mac↔GPU number
+## Live cross-host RTT (landed)
+
+Deployed the torch engine on an H200: `inference_engine/v04/dflash_distributed_engine
+.TorchRestorationDraftEngine` (torch gemma-4-26B-A4B-it for the embed + DFlash +
+f_θ) served by `scripts/research/k3_dflash_proposer_server.py`; a verifier host
+connects with `RemoteDFlashProposer`. The MLX verifier adapter filters restored
+layers to the verifier's KV-source layers (gemma-4 cross-layer sharing).
+
+Measured VM→H200 over an SSH `-L` tunnel (real GPU compute; true data-plane payloads):
+
+| RPC | p50 | payload | note |
+|---|---|---|---|
+| Restore | 2310 ms | 11.47 MB | one-time; f_θ-projected sliding-layer K/V (25 layers) |
+| SeedContext | 947 ms | 1.89 MB | one-time; prompt aux |
+| **DraftBlock** | **108 ms** | O(1) | H200 DFlash forward + net RTT — **vs 232 ms on the Mac CPU (loopback)**: the GPU offload cuts draft compute |
+| ExtendContext | 140 ms | 0.27 MB/block | committed aux — bandwidth-dominated cross-host |
+
+Per-block (draft+extend) p50 ≈ **248 ms** over the SSH tunnel. Caveats: the SSH
+single-stream inflates transfer-bound RPCs vs a direct gRPC link; VM↔H200 base RTT
+≈ 52 ms; byte-identical correctness is proven on the Mac loopback (same engine code).
+The Mac↔H200 byte-identical run uses the same path via `mlx-distributed-dflash-e2e-
+crosshost` with `ssh -p 43350 root@107.206.71.138 -L 50070:localhost:50070` active.
+
+### (historical) Remaining for the LIVE Mac↔GPU number
 The GPU (CUDA) cannot run MLX, so the GPU-side engine needs a **torch embedding**
 source for `embed_fn`/`lm_head_fn` (gemma-4 tied embed). Two options:
 1. one-time ship of the verifier embedding weights Mac→GPU at session setup
diff --git a/inference_engine/bridge/manifest.py b/inference_engine/bridge/manifest.py
@@ -158,16 +158,16 @@ def _harness_preset(
             name="mlx-distributed-dflash-e2e-crosshost",
             description="TRUE cross-host: gemma-4 mlx-4bit verifier on THIS Mac ↔ a "
                         "remote torch DFlash+f_θ DFlashProposerService on the H200, "
-                        "reached at localhost:6006 via an SSH -L tunnel "
-                        "(ssh -p 43350 root@107.206.71.138 -L 6006:localhost:6006). "
+                        "reached at localhost:50070 via an SSH -L tunnel "
+                        "(ssh -p 43350 root@107.206.71.138 -L 6006:localhost:50070). "
                         "Runs greedy (block=1) + distributed (block=N) over the wire "
                         "and asserts byte-identical, reporting real cross-host RTT.",
             command_templates=(
                 (
                     "python3", "scripts/research/k3_distributed_dflash_e2e_mac.py",
                     "--verifier-path", "${ENV:KAKEYA_MAC_VERIFIER_PATH}",
                     "--drafter-id", "${ENV:KAKEYA_MAC_DRAFTER_ID}",
-                    "--remote-addr", "localhost:6006",
+                    "--remote-addr", "localhost:50070",
                     "--max-new-tokens", "{max_new_tokens}",
                     "--block-size", "{block_size}",
                 ),