|
| 1 | +# ADR 0009 — Multi-host milestone: AR-verifier / dLM-proposer on `mlx.distributed`, and the agent capability exchange plane |
| 2 | + |
| 3 | +- **Status**: Accepted |
| 4 | +- **Date**: 2026-06-10 |
| 5 | +- **Relates to**: ADR 0001 (proposer sizing), ADR 0006 (local agent |
| 6 | + infrastructure positioning), ADR 0008 (session-bound runtime + gRPC), |
| 7 | + ADR 0008 §11 (K-series dLM K/V restoration) |
| 8 | +- **Companion design doc**: |
| 9 | + [`docs/design/agent-capability-exchange-platform.md`](../design/agent-capability-exchange-platform.md) |
| 10 | + |
| 11 | +## 1. Context |
| 12 | + |
| 13 | +Kakeya is positioned (ADR 0006) as **local agent infrastructure for |
| 14 | +Mac**. Through v0.3 every deployment is a single host: one runtime |
| 15 | +process, one verifier, sessions bound to one machine. Two pressures |
| 16 | +push past one host: |
| 17 | + |
| 18 | +1. **The AR-verifier / dLM-proposer split is naturally asymmetric.** |
| 19 | + The proposer band is fixed at 0.25–1 B params (ADR 0001) while the |
| 20 | + verifier scales with quality targets (Qwen3-1.7B today, Gemma-4-26B |
| 21 | + in the K3 track). On a 16–24 GB Mac mini the verifier wants the |
| 22 | + whole unified-memory budget; evicting the proposer to a second |
| 23 | + Mac mini frees roughly `proposer_weight_bytes + proposer activation |
| 24 | + peak` on the verifier host and lets the proposer run its K diffusion |
| 25 | + steps concurrently with other work. |
| 26 | +2. **Agent fleets are appearing on real desks.** Multiple Mac minis on |
| 27 | + one Thunderbolt/10GbE segment, each running Kakeya with different |
| 28 | + models warmed, different quantizations, and different roles. Today |
| 29 | + they cannot discover one another or trade work. |
| 30 | + |
| 31 | +Meanwhile Apple's MLX has grown a real multi-host story, |
| 32 | +**`mlx.distributed`**, which we evaluated for this milestone: |
| 33 | + |
| 34 | +- **Backends**: `ring` (TCP/IP, the default; works over Ethernet or |
| 35 | + Thunderbolt bridge at ~10–40 Gb/s) and `jaccl` (RDMA over |
| 36 | + Thunderbolt 5, ~80 Gb/s, macOS 26.2+, TB5-only, full-mesh). MPI is |
| 37 | + also supported where installed. |
| 38 | +- **Programming model**: SPMD. `mlx.launch --hostfile …` starts the |
| 39 | + *same* program on every host; ranks coordinate through collectives |
| 40 | + (`all_sum`, `all_gather`, `send`/`recv`) on a static `Group` that is |
| 41 | + fixed at process start. |
| 42 | +- **What mlx-lm builds on it**: tensor parallelism via |
| 43 | + `model.shard(group)` and pipeline parallelism for architectures with |
| 44 | + `PipelineMixin` — both for a *single* model too big for one host. |
| 45 | +- **Known sharp edges** (June 2026): Metal's ~5 s command-buffer |
| 46 | + timeout fires in distributed settings unless communication is issued |
| 47 | + on `stream=mx.cpu`; long prefills need chunking that `mlx.distributed` |
| 48 | + does not yet do; `jaccl` requires disabling Thunderbolt Bridge and a |
| 49 | + recovery-OS `rdma_ctl enable`; node membership is static — a dead |
| 50 | + rank kills the job. |
| 51 | + |
| 52 | +## 2. Question 1 — should the AR-verifier / dLM-proposer pair run *on* |
| 53 | +`mlx.distributed`? |
| 54 | + |
| 55 | +We decompose spec decode traffic into its three flows and evaluate each |
| 56 | +against `mlx.distributed`'s strengths: |
| 57 | + |
| 58 | +| Flow | Payload per block (L=16) | Frequency | Latency sensitivity | |
| 59 | +| --- | --- | --- | --- | |
| 60 | +| F1: committed prefix → proposer | ≤ a few hundred `uint32` ids (~1 KB) | once per block | low — hidden by proposer compute (tens of ms for K diffusion steps) | |
| 61 | +| F2: draft block → verifier | L `uint32` ids (64 B) | once per block | low | |
| 62 | +| F3: aux hidden states → drafter (K3 DFlash only) | `L_ctx × hidden × n_aux_layers` bf16 — **MBs per block** | once per block | **high** — on the critical path before drafting | |
| 63 | + |
| 64 | +### 2.1 Strengths of `mlx.distributed` for this pair |
| 65 | + |
| 66 | +- **Bandwidth where it matters (F3).** DFlash-style drafters (ADR 0008 |
| 67 | + §11, K3) condition on verifier hidden states. At Gemma-4-26B scale |
| 68 | + that is megabytes per block; ring-over-Thunderbolt or `jaccl` RDMA |
| 69 | + moves that 10–50× faster than a gRPC/protobuf hop, and MLX arrays |
| 70 | + cross without serialization into Python objects. |
| 71 | +- **Unified memory + native arrays.** No host↔device staging on either |
| 72 | + end; an `mx.array` produced by the verifier's forward is directly |
| 73 | + `send()`-able. |
| 74 | +- **Verifier sharding is free riding.** If the verifier itself outgrows |
| 75 | + one Mac mini (Qwen3-32B, Gemma-4-26B bf16), `model.shard(group)` |
| 76 | + tensor-parallelism inside a *verifier sub-group* is the only |
| 77 | + practical option — and it composes with this design (§4). |
| 78 | + |
| 79 | +### 2.2 Weaknesses for this pair |
| 80 | + |
| 81 | +- **SPMD vs. asymmetric roles.** Proposer and verifier are *different |
| 82 | + programs* with different weights, lifecycles, and failure domains. |
| 83 | + Expressing them as ranks of one SPMD job means rank-branching |
| 84 | + (`if rank == 0: verifier_loop() else: proposer_loop()`), one shared |
| 85 | + fate (any rank dying kills generation for every session on the |
| 86 | + fleet), and lock-step launch via `mlx.launch` + static hostfile. |
| 87 | +- **No dynamic membership.** Agent fleets churn: a Mac mini sleeps, |
| 88 | + reboots, gets a new model warmed. `mlx.distributed` groups are fixed |
| 89 | + at init; there is no join/leave, no health-check, no re-balance. |
| 90 | +- **F1/F2 gain nothing.** Token-id flows are < 1 KB per block. A LAN |
| 91 | + gRPC round trip is ~0.3–1 ms; one proposer block is tens of ms of |
| 92 | + compute and one verifier block forward is similar. Collectives would |
| 93 | + shave microseconds off a millisecond-scale, compute-dominated loop. |
| 94 | +- **Operational constraints.** macOS 26.2 + TB5-only for `jaccl`; |
| 95 | + Metal timeout workarounds; no auth story on ring sockets (Kakeya's |
| 96 | + gRPC plane already has an auth path from the HTTP-shim era). |
| 97 | +- **Cross-ecosystem reach.** The K3 drafter currently runs in PyTorch |
| 98 | + (MPS) while the verifier runs in MLX — `mlx.distributed` cannot carry |
| 99 | + a PyTorch process; an RPC plane can. |
| 100 | + |
| 101 | +### 2.3 Verdict |
| 102 | + |
| 103 | +`mlx.distributed` is the right **data plane for bulk tensors** |
| 104 | +(F3 hidden-state shipping, intra-verifier tensor parallelism) and the |
| 105 | +wrong **control plane** (membership, placement, session routing, |
| 106 | +failure isolation, F1/F2). Neither a pure-`mlx.distributed` design nor |
| 107 | +a pure-gRPC design wins on all flows. |
| 108 | + |
| 109 | +## 3. Question 2 — what does capability exchange between Mac minis need? |
| 110 | + |
| 111 | +Requirements distilled from ADR 0006's agent framing: |
| 112 | + |
| 113 | +- **R1 Discovery**: a node can learn which peers exist, which models |
| 114 | + (verifier/proposer roles, quantization) they have warmed, and how |
| 115 | + much unified memory each has — without a central registry. |
| 116 | +- **R2 Liveness**: stale nodes age out (TTL); re-announcing refreshes. |
| 117 | +- **R3 Placement**: given "I need verifier X + proposer Y", pick hosts |
| 118 | + deterministically from the exchanged capability set. |
| 119 | +- **R4 Work exchange**: actually call the chosen peer (first concrete |
| 120 | + capability: remote `ProposeBlock`). |
| 121 | +- **R5 Heterogeneity**: Mac M4 + Linux x86 CPU nodes coexist (our CI |
| 122 | + and dev reality); MLX-only mechanisms exclude half the fleet. |
| 123 | + |
| 124 | +`mlx.distributed` satisfies none of R1–R3 and R5 by construction (SPMD, |
| 125 | +static, Apple-only). gRPC + protobuf — already Kakeya's wire contract |
| 126 | +per ADR 0008 — satisfies all five, with typed errors, deadlines, and |
| 127 | +language-neutral stubs (the TS SDK can render fleet dashboards from the |
| 128 | +same proto). |
| 129 | + |
| 130 | +## 4. Decision |
| 131 | + |
| 132 | +**Hybrid, with gRPC as the control plane and `mlx.distributed` as an |
| 133 | +optional data plane.** Concretely, this milestone (v0.5-M1) ships: |
| 134 | + |
| 135 | +1. **`kakeya.v1.CapabilityService`** (new proto, |
| 136 | + `proto/kakeya/v1/distributed.proto`): symmetric gossip-style |
| 137 | + `ExchangeCapabilities` — caller pushes its view of the fleet, callee |
| 138 | + merges (last-writer-wins on `announced_at_unix`) and returns its |
| 139 | + merged view. TTL-based expiry. No coordinator, no consensus; the |
| 140 | + registry is a CRDT-ish converging map keyed by `node_id`. |
| 141 | +2. **`kakeya.v1.ProposerService`**: `ProposeBlock(committed, L, K) → |
| 142 | + tokens` — the dLM-proposer contract (`DLMProposer.propose_block`, |
| 143 | + ADR 0001) lifted onto the wire, so any node can serve proposals for |
| 144 | + any node's verifier. Token-ids-only on purpose (F1/F2 analysis, |
| 145 | + §2.2): payloads are tiny and the contract is runtime-agnostic |
| 146 | + (PyTorch dLM, MLX dLM, model-free n-gram all serve it). |
| 147 | +3. **`DistributedSpeculativeDecoder`**: the v0.2 greedy spec-decode |
| 148 | + loop (`kv_cache_proposer.speculative`) driven by a `RemoteProposer` |
| 149 | + gRPC client instead of an in-process dLM. Bit-equivalence to local |
| 150 | + greedy AR decoding is preserved — the accept rule never changes, |
| 151 | + only where the draft comes from. A draft that arrives late or wrong |
| 152 | + costs throughput, never correctness. |
| 153 | +4. **`mlx.distributed` ring adapter** (`inference_engine/distributed/ |
| 154 | + mlx_ring.py`): environment probe + group bootstrap mirroring the |
| 155 | + `backends/mlx/env.py` no-fallback pattern. Nodes advertise their |
| 156 | + ring endpoint in their capability card (`ring_address`); when two |
| 157 | + placed roles both have one, bulk-tensor flows (F3, future K3 |
| 158 | + integration) can be promoted from gRPC to the ring. Linux nodes |
| 159 | + simply advertise no ring endpoint. |
| 160 | + |
| 161 | +### What we explicitly rejected |
| 162 | + |
| 163 | +- **All-in on `mlx.distributed` (SPMD ranks for proposer/verifier).** |
| 164 | + Rejected for shared fate, static membership, Apple-only fleets, and |
| 165 | + zero benefit on F1/F2 (§2.2). Re-open only if proposer↔verifier |
| 166 | + traffic becomes tensor-dominated *and* fleets are TB5-homogeneous. |
| 167 | +- **Central fleet coordinator / etcd-style registry.** Rejected: |
| 168 | + a desk of 2–5 Mac minis does not need consensus infrastructure, and |
| 169 | + a coordinator is one more failure domain. Gossip pairs converge in |
| 170 | + one exchange round per link. |
| 171 | +- **mDNS/Bonjour auto-discovery in this milestone.** Deferred, not |
| 172 | + rejected: seed peers are static CLI flags today (`--peer`). The |
| 173 | + registry merge logic is discovery-mechanism-agnostic, so Bonjour can |
| 174 | + later feed the same `merge()`. |
| 175 | +- **Carrying logits/hidden states in `ProposeBlockResponse`.** |
| 176 | + Rejected for v0.5-M1: greedy acceptance needs only token ids; |
| 177 | + distribution-level (lossless sampling) acceptance would need draft |
| 178 | + probabilities and is deferred until sampling lands in the session |
| 179 | + path (ADR 0008 OQ-4). |
| 180 | + |
| 181 | +## 5. Consequences |
| 182 | + |
| 183 | +- A second `.proto` module joins the wire contract; `buf` lint/breaking |
| 184 | + gates and the stub-drift CI job extend to it. The capability schema |
| 185 | + is marked **Unstable** until v0.5 GA (same policy as runtime.proto |
| 186 | + pre-v0.3). |
| 187 | +- The Linux CI gate gains a fully verifier-independent surface: |
| 188 | + capability registry, merge/TTL semantics, placement planning, the |
| 189 | + gRPC exchange/proposer services (exercised with the model-free |
| 190 | + n-gram proposer — a *real* prompt-lookup implementation, not a test |
| 191 | + double), and the greedy acceptance rule as a pure function. |
| 192 | +- The Mac M4 integration gate gains a two-node-on-one-host spec-decode |
| 193 | + equivalence test: remote proposer over loopback gRPC, real verifier, |
| 194 | + output must be byte-identical to local greedy decode. |
| 195 | +- Spec decode remains **outside** the gRPC `Generate` session path |
| 196 | + (that wiring is the separate v0.4 proposer-back-in milestone, ADR |
| 197 | + 0008). This milestone deliberately lands the distributed machinery at |
| 198 | + the decoder layer where v0.2 spec decode lives, so the two tracks |
| 199 | + compose instead of colliding. |
| 200 | +- Security: the capability plane ships on insecure channels bound to |
| 201 | + LAN interfaces, same trust model as v0.3 single-host gRPC. mTLS for |
| 202 | + cross-host channels is queued for v0.5 GA. |
0 commit comments