Skip to content

Commit 83315bd

Browse files
Merge #157 + #158: multi-host distributed spec-decode + remote DFlash+f_θ proposer (F3 data plane)
#157 feat(distributed): multi-host capability exchange + distributed speculative decoding #158 feat(distributed): remote DFlash+f_θ proposer (F3 bulk-tensor data plane) + SOP skill + deploy scripts Validated: 111 distributed unit tests; real-model byte-identical E2E (in-process, loopback gRPC, live Mac<->H200 cross-host); RTT/throughput/bounded-memory report. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
2 parents da82b73 + 00530dd commit 83315bd

51 files changed

Lines changed: 9982 additions & 9 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yaml

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -96,16 +96,17 @@ jobs:
9696
tests/inference_engine/bench/ \
9797
tests/inference_engine/setup/ \
9898
tests/inference_engine/bridge/ \
99+
tests/inference_engine/distributed/ \
99100
tests/sdk/python/ \
100101
tests/training/repr_align/ \
101102
tests/backends/mlx/test_env.py \
102103
--junitxml=junit.xml \
103104
-v
104105
coverage report \
105-
--include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/bridge/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/pipeline/*,inference_engine/session/store.py,inference_engine/setup/*,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*' \
106+
--include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/bridge/*,inference_engine/distributed/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/pipeline/*,inference_engine/session/store.py,inference_engine/setup/*,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*' \
106107
--fail-under=100
107108
coverage xml -o coverage.xml \
108-
--include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/bridge/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/pipeline/*,inference_engine/session/store.py,inference_engine/setup/*,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*'
109+
--include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/bridge/*,inference_engine/distributed/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/pipeline/*,inference_engine/session/store.py,inference_engine/setup/*,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*'
109110
110111
- name: Upload coverage artifact
111112
if: always()
@@ -169,6 +170,16 @@ jobs:
169170
import kakeya.errors; \
170171
import inference_engine.bridge; \
171172
import inference_engine.bridge.manifest; \
173+
import inference_engine.distributed; \
174+
import inference_engine.distributed.capability; \
175+
import inference_engine.distributed.placement; \
176+
import inference_engine.distributed.exchange; \
177+
import inference_engine.distributed.ngram; \
178+
import inference_engine.distributed.proposer_service; \
179+
import inference_engine.distributed.spec_decode; \
180+
import inference_engine.distributed.mlx_ring; \
181+
import inference_engine.server.proto_gen.kakeya.v1.distributed_pb2; \
182+
import inference_engine.server.proto_gen.kakeya.v1.distributed_pb2_grpc; \
172183
import inference_engine.proposer; \
173184
import inference_engine.proposer.sparse_logits; \
174185
import inference_engine.backends.mlx.env; \

.github/workflows/integration.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,11 @@ jobs:
8585
# The runner is expected to have a long-lived venv.
8686
# If a per-run venv is preferred, swap to ``python3 -m venv .venv``.
8787
python3 -m pip install --upgrade pip
88-
python3 -m pip install -e .
88+
# The repo runs via PYTHONPATH (see ci.yaml) — it is NOT a pip package
89+
# (no setup.py/pyproject.toml), so install runtime deps from
90+
# requirements.txt rather than an editable `-e .` (which errors with
91+
# "does not appear to be a Python project").
92+
python3 -m pip install -r requirements.txt
8993
python3 -m pip install pytest pytest-asyncio pytest-timeout coverage
9094
9195
- name: Run integration suite

README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -525,6 +525,35 @@ with client.create_session() as session:
525525

526526
That's where the 4400× latency-drift improvement comes from.
527527

528+
## Multi-host: capability exchange + distributed spec decode (v0.5-M1)
529+
530+
Per [ADR 0009](docs/adr/0009-mlx-distributed-spec-decode-and-capability-exchange.md),
531+
Kakeya nodes on one LAN (Mac minis, plus Linux CPU hosts) can now
532+
**gossip capability cards** — which models each node has warmed, in
533+
which role (verifier / proposer), with how much unified memory — and
534+
**trade work**: an AR verifier on one node drives speculative decoding
535+
with draft blocks served by a proposer on another node. The greedy
536+
accept rule runs locally and is unchanged, so remote drafts can change
537+
throughput but never tokens.
538+
539+
```bash
540+
# Node B (proposer host)
541+
PYTHONPATH=. python3 scripts/demo_distributed_spec_decode.py \
542+
--role proposer-node --bind 0.0.0.0:50061 --node-id node-b
543+
544+
# Node A (verifier host) — discovers B, plans placement, decodes
545+
PYTHONPATH=. python3 scripts/demo_distributed_spec_decode.py \
546+
--role verifier-node --bind 0.0.0.0:50060 --node-id node-a \
547+
--peer <node-b-ip>:50061 --verifier-id Qwen/Qwen3-0.6B
548+
```
549+
550+
The production runtime joins a fleet with the same flags on
551+
`scripts/start_grpc_runtime_server.py` (`--node-id`, `--peer`,
552+
`--serve-ngram-proposer`). `mlx.distributed` rings are advertised on
553+
capability cards (`ring_address`) as the bulk-tensor data plane for
554+
the K3 hidden-state flows; the control plane is pure gRPC and needs no
555+
MLX. Design details: [agent capability exchange platform](docs/design/agent-capability-exchange-platform.md).
556+
528557
## Deprecated HTTP shim
529558

530559
The OpenAI-compatible HTTP API at `/v1/chat/completions` still works for
@@ -670,6 +699,8 @@ scripts/
670699
| **v0.4 for Mac (`v0.4-mac`)** | ✅ shipped | MLX restored Gemma-4 26B engine: bounded KV (~90% saved), recall 1.0, ≈AR-parity spec-decode. Multi-tenant is **serial-only** (no batched `B>1` decode — upstream MLX kernel bug, [ADR 0014](docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md)) |
671700
| **v0.4 for CUDA (`v0.4-cuda`)** | ✅ shipped | Restored Gemma-4 26B engine on NVIDIA: fused DFlash spec-decode **1.79–2.06× AR**, 44–87× KV saving, recall 1.0 |
672701
| **v0.4 multi-tenant (PR-A3c)** | ✅ shipped | Per-session binding (isolated KV, shared weights) on both platforms. **CUDA**: batched scheduler → **8.45× served throughput**, per-session recall 1.0. **MLX (Mac): serial-only** (sessions served one at a time; batched parallel decode unsupported upstream) |
702+
| **v0.5-M1 multi-host milestone** | ✅ landed | [ADR 0009](docs/adr/0009-mlx-distributed-spec-decode-and-capability-exchange.md): agent **capability exchange** between Mac mini hosts (gossip `CapabilityService`, TTL registry, deterministic placement) + **distributed speculative decoding** (remote `ProposerService` drafts, local greedy verification, byte-identical output) + optional `mlx.distributed` ring probe for bulk-tensor flows. See [design doc](docs/design/agent-capability-exchange-platform.md) and `scripts/demo_distributed_spec_decode.py` |
703+
| v0.5 GA multi-host hardening | queued | mTLS node identity, Bonjour seed discovery, K3 DFlash hidden-state flow over the mlx.distributed ring |
673704
| Async continuous batching | designing | Dynamic mid-flight arrival + ragged-length cohorts under the async gRPC `Generate` handlers (current batcher is fixed-cohort) |
674705
| Deployment polish | queued | PyPI + npm publishing, GHCR Docker image, `kakeya prewarm` CLI, `kakeya chat` REPL |
675706
| Cross-request KV reuse | designing | Sessions survive across requests on gRPC; turns intra-session drift into 0 ms inter-request drift |
Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# ADR 0009 — Multi-host milestone: AR-verifier / dLM-proposer on `mlx.distributed`, and the agent capability exchange plane
2+
3+
- **Status**: Accepted
4+
- **Date**: 2026-06-10
5+
- **Relates to**: ADR 0001 (proposer sizing), ADR 0006 (local agent
6+
infrastructure positioning), ADR 0008 (session-bound runtime + gRPC),
7+
ADR 0008 §11 (K-series dLM K/V restoration)
8+
- **Companion design doc**:
9+
[`docs/design/agent-capability-exchange-platform.md`](../design/agent-capability-exchange-platform.md)
10+
11+
## 1. Context
12+
13+
Kakeya is positioned (ADR 0006) as **local agent infrastructure for
14+
Mac**. Through v0.3 every deployment is a single host: one runtime
15+
process, one verifier, sessions bound to one machine. Two pressures
16+
push past one host:
17+
18+
1. **The AR-verifier / dLM-proposer split is naturally asymmetric.**
19+
The proposer band is fixed at 0.25–1 B params (ADR 0001) while the
20+
verifier scales with quality targets (Qwen3-1.7B today, Gemma-4-26B
21+
in the K3 track). On a 16–24 GB Mac mini the verifier wants the
22+
whole unified-memory budget; evicting the proposer to a second
23+
Mac mini frees roughly `proposer_weight_bytes + proposer activation
24+
peak` on the verifier host and lets the proposer run its K diffusion
25+
steps concurrently with other work.
26+
2. **Agent fleets are appearing on real desks.** Multiple Mac minis on
27+
one Thunderbolt/10GbE segment, each running Kakeya with different
28+
models warmed, different quantizations, and different roles. Today
29+
they cannot discover one another or trade work.
30+
31+
Meanwhile Apple's MLX has grown a real multi-host story,
32+
**`mlx.distributed`**, which we evaluated for this milestone:
33+
34+
- **Backends**: `ring` (TCP/IP, the default; works over Ethernet or
35+
Thunderbolt bridge at ~10–40 Gb/s) and `jaccl` (RDMA over
36+
Thunderbolt 5, ~80 Gb/s, macOS 26.2+, TB5-only, full-mesh). MPI is
37+
also supported where installed.
38+
- **Programming model**: SPMD. `mlx.launch --hostfile …` starts the
39+
*same* program on every host; ranks coordinate through collectives
40+
(`all_sum`, `all_gather`, `send`/`recv`) on a static `Group` that is
41+
fixed at process start.
42+
- **What mlx-lm builds on it**: tensor parallelism via
43+
`model.shard(group)` and pipeline parallelism for architectures with
44+
`PipelineMixin` — both for a *single* model too big for one host.
45+
- **Known sharp edges** (June 2026): Metal's ~5 s command-buffer
46+
timeout fires in distributed settings unless communication is issued
47+
on `stream=mx.cpu`; long prefills need chunking that `mlx.distributed`
48+
does not yet do; `jaccl` requires disabling Thunderbolt Bridge and a
49+
recovery-OS `rdma_ctl enable`; node membership is static — a dead
50+
rank kills the job.
51+
52+
## 2. Question 1 — should the AR-verifier / dLM-proposer pair run *on*
53+
`mlx.distributed`?
54+
55+
We decompose spec decode traffic into its three flows and evaluate each
56+
against `mlx.distributed`'s strengths:
57+
58+
| Flow | Payload per block (L=16) | Frequency | Latency sensitivity |
59+
| --- | --- | --- | --- |
60+
| F1: committed prefix → proposer | ≤ a few hundred `uint32` ids (~1 KB) | once per block | low — hidden by proposer compute (tens of ms for K diffusion steps) |
61+
| F2: draft block → verifier | L `uint32` ids (64 B) | once per block | low |
62+
| F3: aux hidden states → drafter (K3 DFlash only) | `L_ctx × hidden × n_aux_layers` bf16 — **MBs per block** | once per block | **high** — on the critical path before drafting |
63+
64+
### 2.1 Strengths of `mlx.distributed` for this pair
65+
66+
- **Bandwidth where it matters (F3).** DFlash-style drafters (ADR 0008
67+
§11, K3) condition on verifier hidden states. At Gemma-4-26B scale
68+
that is megabytes per block; ring-over-Thunderbolt or `jaccl` RDMA
69+
moves that 10–50× faster than a gRPC/protobuf hop, and MLX arrays
70+
cross without serialization into Python objects.
71+
- **Unified memory + native arrays.** No host↔device staging on either
72+
end; an `mx.array` produced by the verifier's forward is directly
73+
`send()`-able.
74+
- **Verifier sharding is free riding.** If the verifier itself outgrows
75+
one Mac mini (Qwen3-32B, Gemma-4-26B bf16), `model.shard(group)`
76+
tensor-parallelism inside a *verifier sub-group* is the only
77+
practical option — and it composes with this design (§4).
78+
79+
### 2.2 Weaknesses for this pair
80+
81+
- **SPMD vs. asymmetric roles.** Proposer and verifier are *different
82+
programs* with different weights, lifecycles, and failure domains.
83+
Expressing them as ranks of one SPMD job means rank-branching
84+
(`if rank == 0: verifier_loop() else: proposer_loop()`), one shared
85+
fate (any rank dying kills generation for every session on the
86+
fleet), and lock-step launch via `mlx.launch` + static hostfile.
87+
- **No dynamic membership.** Agent fleets churn: a Mac mini sleeps,
88+
reboots, gets a new model warmed. `mlx.distributed` groups are fixed
89+
at init; there is no join/leave, no health-check, no re-balance.
90+
- **F1/F2 gain nothing.** Token-id flows are < 1 KB per block. A LAN
91+
gRPC round trip is ~0.3–1 ms; one proposer block is tens of ms of
92+
compute and one verifier block forward is similar. Collectives would
93+
shave microseconds off a millisecond-scale, compute-dominated loop.
94+
- **Operational constraints.** macOS 26.2 + TB5-only for `jaccl`;
95+
Metal timeout workarounds; no auth story on ring sockets (Kakeya's
96+
gRPC plane already has an auth path from the HTTP-shim era).
97+
- **Cross-ecosystem reach.** The K3 drafter currently runs in PyTorch
98+
(MPS) while the verifier runs in MLX — `mlx.distributed` cannot carry
99+
a PyTorch process; an RPC plane can.
100+
101+
### 2.3 Verdict
102+
103+
`mlx.distributed` is the right **data plane for bulk tensors**
104+
(F3 hidden-state shipping, intra-verifier tensor parallelism) and the
105+
wrong **control plane** (membership, placement, session routing,
106+
failure isolation, F1/F2). Neither a pure-`mlx.distributed` design nor
107+
a pure-gRPC design wins on all flows.
108+
109+
## 3. Question 2 — what does capability exchange between Mac minis need?
110+
111+
Requirements distilled from ADR 0006's agent framing:
112+
113+
- **R1 Discovery**: a node can learn which peers exist, which models
114+
(verifier/proposer roles, quantization) they have warmed, and how
115+
much unified memory each has — without a central registry.
116+
- **R2 Liveness**: stale nodes age out (TTL); re-announcing refreshes.
117+
- **R3 Placement**: given "I need verifier X + proposer Y", pick hosts
118+
deterministically from the exchanged capability set.
119+
- **R4 Work exchange**: actually call the chosen peer (first concrete
120+
capability: remote `ProposeBlock`).
121+
- **R5 Heterogeneity**: Mac M4 + Linux x86 CPU nodes coexist (our CI
122+
and dev reality); MLX-only mechanisms exclude half the fleet.
123+
124+
`mlx.distributed` satisfies none of R1–R3 and R5 by construction (SPMD,
125+
static, Apple-only). gRPC + protobuf — already Kakeya's wire contract
126+
per ADR 0008 — satisfies all five, with typed errors, deadlines, and
127+
language-neutral stubs (the TS SDK can render fleet dashboards from the
128+
same proto).
129+
130+
## 4. Decision
131+
132+
**Hybrid, with gRPC as the control plane and `mlx.distributed` as an
133+
optional data plane.** Concretely, this milestone (v0.5-M1) ships:
134+
135+
1. **`kakeya.v1.CapabilityService`** (new proto,
136+
`proto/kakeya/v1/distributed.proto`): symmetric gossip-style
137+
`ExchangeCapabilities` — caller pushes its view of the fleet, callee
138+
merges (last-writer-wins on `announced_at_unix`) and returns its
139+
merged view. TTL-based expiry. No coordinator, no consensus; the
140+
registry is a CRDT-ish converging map keyed by `node_id`.
141+
2. **`kakeya.v1.ProposerService`**: `ProposeBlock(committed, L, K) →
142+
tokens` — the dLM-proposer contract (`DLMProposer.propose_block`,
143+
ADR 0001) lifted onto the wire, so any node can serve proposals for
144+
any node's verifier. Token-ids-only on purpose (F1/F2 analysis,
145+
§2.2): payloads are tiny and the contract is runtime-agnostic
146+
(PyTorch dLM, MLX dLM, model-free n-gram all serve it).
147+
3. **`DistributedSpeculativeDecoder`**: the v0.2 greedy spec-decode
148+
loop (`kv_cache_proposer.speculative`) driven by a `RemoteProposer`
149+
gRPC client instead of an in-process dLM. Bit-equivalence to local
150+
greedy AR decoding is preserved — the accept rule never changes,
151+
only where the draft comes from. A draft that arrives late or wrong
152+
costs throughput, never correctness.
153+
4. **`mlx.distributed` ring adapter** (`inference_engine/distributed/
154+
mlx_ring.py`): environment probe + group bootstrap mirroring the
155+
`backends/mlx/env.py` no-fallback pattern. Nodes advertise their
156+
ring endpoint in their capability card (`ring_address`); when two
157+
placed roles both have one, bulk-tensor flows (F3, future K3
158+
integration) can be promoted from gRPC to the ring. Linux nodes
159+
simply advertise no ring endpoint.
160+
161+
### What we explicitly rejected
162+
163+
- **All-in on `mlx.distributed` (SPMD ranks for proposer/verifier).**
164+
Rejected for shared fate, static membership, Apple-only fleets, and
165+
zero benefit on F1/F2 (§2.2). Re-open only if proposer↔verifier
166+
traffic becomes tensor-dominated *and* fleets are TB5-homogeneous.
167+
- **Central fleet coordinator / etcd-style registry.** Rejected:
168+
a desk of 2–5 Mac minis does not need consensus infrastructure, and
169+
a coordinator is one more failure domain. Gossip pairs converge in
170+
one exchange round per link.
171+
- **mDNS/Bonjour auto-discovery in this milestone.** Deferred, not
172+
rejected: seed peers are static CLI flags today (`--peer`). The
173+
registry merge logic is discovery-mechanism-agnostic, so Bonjour can
174+
later feed the same `merge()`.
175+
- **Carrying logits/hidden states in `ProposeBlockResponse`.**
176+
Rejected for v0.5-M1: greedy acceptance needs only token ids;
177+
distribution-level (lossless sampling) acceptance would need draft
178+
probabilities and is deferred until sampling lands in the session
179+
path (ADR 0008 OQ-4).
180+
181+
## 5. Consequences
182+
183+
- A second `.proto` module joins the wire contract; `buf` lint/breaking
184+
gates and the stub-drift CI job extend to it. The capability schema
185+
is marked **Unstable** until v0.5 GA (same policy as runtime.proto
186+
pre-v0.3).
187+
- The Linux CI gate gains a fully verifier-independent surface:
188+
capability registry, merge/TTL semantics, placement planning, the
189+
gRPC exchange/proposer services (exercised with the model-free
190+
n-gram proposer — a *real* prompt-lookup implementation, not a test
191+
double), and the greedy acceptance rule as a pure function.
192+
- The Mac M4 integration gate gains a two-node-on-one-host spec-decode
193+
equivalence test: remote proposer over loopback gRPC, real verifier,
194+
output must be byte-identical to local greedy decode.
195+
- Spec decode remains **outside** the gRPC `Generate` session path
196+
(that wiring is the separate v0.4 proposer-back-in milestone, ADR
197+
0008). This milestone deliberately lands the distributed machinery at
198+
the decoder layer where v0.2 spec decode lives, so the two tracks
199+
compose instead of colliding.
200+
- Security: the capability plane ships on insecure channels bound to
201+
LAN interfaces, same trust model as v0.3 single-host gRPC. mTLS for
202+
cross-host channels is queued for v0.5 GA.

0 commit comments

Comments
 (0)