Skip to content

Commit da82b73

Browse files
Merge remote-tracking branch 'origin/AgentMemory/adr-0013-distributed-topology-2815' into _adr_train
# Conflicts: # docs/adr/README.md Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
2 parents 0b206d4 + 5b363d4 commit da82b73

3 files changed

Lines changed: 264 additions & 0 deletions

File tree

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# ADR 0012 — Proposer/verifier value proposition: bounded-memory + recall (all platforms), platform-forked throughput
2+
3+
- **Status**: Accepted (2026-06-13)
4+
- **Date**: 2026-06-13
5+
- **Decision drivers**:
6+
- A recurring question keeps being re-opened by new contributors (human
7+
and agent): *"is the proposer still worth it, given Step-1 reaches 1.0× AR
8+
on Mac without using it?"* and *"is speculative decoding dead on Mac?"*
9+
This ADR settles the value map so the decision tree is not re-derived
10+
every time the Mac throughput number looks bad in isolation.
11+
- 2026-06-12 Mac ctx280 validation (`results/research/k3_mlx_fused_fair_ctx280_n5_gen32_*.json`):
12+
Step-1 recall 5/5 vs oracle 5/5, bounded resident KV 132.9 MB vs naive
13+
1308.9 MB (89.8 % saving) at 4406–5810-token prompts.
14+
- 2026-06-11 H200 #107 evidence: fused spec-decode 1.27× AR, recall 1.0.
15+
- 2026-06-13 `verify(L)` calibration sweep (`results/research/verify_l_sweep.json`,
16+
ctx 4096): measured kernel-dedup headroom 3.92× at L=16, ≈87 % of the
17+
router-measured expert-union bound (4.52×).
18+
- Builds on / re-affirms ADR 0001 (proposer sizing + alignment),
19+
ADR 0004 (alignment data policy), ADR 0006 (local-agent-infra
20+
positioning), ADR 0008 §11 (dLM K/V-Restoration architecture),
21+
ADR 0009 (capability exchange), ADR 0010 (full-attention low-precision
22+
KV / affine4), ADR 0011 (cross-attention coupling, falsified by R1e).
23+
24+
## Context
25+
26+
ADR 0008 §11 (K-series) changed the proposer's primary role from *drafter*
27+
to *history reconstructor*: the dLM proposer has no KV cache and can produce
28+
transient K/V for the **entire** history, which is used to restore the
29+
verifier's attention at structurally-evicted positions. Speculative decoding
30+
is the **second** product line on the same architecture, not the first.
31+
32+
The trap is to evaluate the architecture on a single cell of its value
33+
table — "Mac, single host, generic chat, current un-aligned DFlash" — see a
34+
weak throughput number, and conclude the proposer (or spec-decode) has no
35+
value. That conclusion does not generalise. This ADR records the full value
36+
map and prices the open options explicitly.
37+
38+
## Decision
39+
40+
The proposer/verifier value proposition is realised on **two axes**, and its
41+
status is **platform- and workload-dependent**, not a single scalar:
42+
43+
### 1. The core value is "bounded memory + recall", not "fast"
44+
45+
Since ADR 0008 §11, the proposer's first-class role is **history
46+
reconstruction**: no KV cache, transient full-history K/V → restore the
47+
verifier's evicted-position attention. The main line has already **won**, but
48+
the value is realised on the **memory axis**, not the throughput axis:
49+
Step-1 = **1.0× AR throughput + recall 5/5 + KV 132.9 MB vs naive 1308.9 MB
50+
(89.8 % saving; ~48 MB after affine4 / ADR 0010)**. That is the
51+
proposer/verifier deliverable.
52+
53+
A finer honesty note: in the Mac **S5-native** shipping configuration,
54+
Gemma-4's *native hybrid attention* means keeping the **5 full-attention
55+
layers exact** is already enough to carry recall — so on this specific model
56+
the f_θ/proposer reconstruction is **replaced by the S5 shortcut**. But on a
57+
**pure sliding-window architecture** (the K1/K2 Qwen3 case — no
58+
full-attention layers to preserve) and on the **CUDA full-restoration path**,
59+
proposer reconstruction remains the **only** source of recall. The
60+
architecture's domain of applicability is unchanged; Gemma-4 simply handed us
61+
a free coupon.
62+
63+
### 2. Speculative-decoding value forks by platform — the Mac negative does not extrapolate
64+
65+
- **H200 (#107 measured)**: fused = **1.27× AR, recall 1.0** — the *same*
66+
proposer/verifier code; on a platform where verify-batch is nearly free,
67+
spec-decode value holds.
68+
- **Mac's 0.26×** has a **concrete, movable** bottleneck: real per-token
69+
acceptance is **30–40 %**. The vLLM reference reports the *same* drafter at
70+
**44.7 %**, and our own drafter docs say "the precise EAGLE-3 ↔ block-fusion
71+
alignment is a Stage-2 task". Alignment fine-tuning (the plan that has been
72+
queued in ADR 0001 / 0004 all along) lifting acceptance to **~70 %** makes
73+
the block-4 arithmetic `3.5 × 43.8 / 140 ≈ 1.1×` — Mac clears the bar too.
74+
So the Mac status is **"waiting for the alignment asset"**, not
75+
**"architecture pronounced dead"**.
76+
77+
### 3. Option value of the verification primitive: correctness containment makes any draft source plug-and-play
78+
79+
The v3 loop + byte-level consistency guarantees one thing: a draft source can
80+
only affect **throughput**, never **pollute output**. This is an open
81+
interface; the drafter can be swapped for anything. A concrete, this-week,
82+
testable Mac route: **NGramProposer** (the zero-weight prompt-lookup proposer
83+
already in PR #105) + the v3 loop — draft cost ≈ 0, no alignment dependency,
84+
naturally high acceptance on agentic workloads (tool-call JSON, templated
85+
replies, highly self-repetitive sessions). Arithmetic: `draft ≈ 0 +
86+
verify(4) = 120 ms`, committing 2.5/block → 0.78×, 3.5/block → 1.1× — on
87+
Kakeya's target workload (ADR 0006: local agent infrastructure) this is
88+
entirely plausible. One bridge command verifies it.
89+
90+
### 4. Beyond single-host throughput, the split is the foundation for a multi-host architecture
91+
92+
ADR 0009 / PR #105's capability-exchange plane is built with the
93+
proposer/verifier roles as primitives: the proposer is a fleet capability that
94+
can be gossip-discovered and remote-invoked. Even if a single Mac never runs
95+
spec-decode, the "verifier on host A, proposer capability on host B / cloud"
96+
shape (including the dev/eval tool plane) is **already running** — the Mac
97+
bridge used over the last two days is itself an instance of this
98+
architecture's tool plane.
99+
100+
### Bottom line
101+
102+
If the proposition is narrowed to *"Mac single-host + generic chat + the
103+
current un-aligned DFlash"* — yes, the proposer has **no runtime value in
104+
that one cell today**, and Step-1 reaches 1.0× without it. But the
105+
architecture's value map is: **the realised bounded-memory story (all
106+
platforms) + the realised throughput story (CUDA) + two explicitly-priced Mac
107+
throughput options (alignment fine-tuning / n-gram drafting) + the foundation
108+
for the multi-host capability plane.**
109+
110+
## Consequences
111+
112+
- The proposer/verifier split is **retained**; "Step-1 doesn't use the
113+
proposer on Mac/Gemma-4" is **not** grounds to deprecate it (it is the only
114+
recall source on pure-sliding-window models and on CUDA full-restoration).
115+
- Memory-axis claims (bounded KV, S5, affine4) are the **primary**,
116+
all-platform deliverable and should be reported as such; throughput claims
117+
must be qualified by platform (CUDA: realised; Mac: option-pending).
118+
- Two priced Mac throughput options are tracked as next steps:
119+
(a) **alignment fine-tuning** (ADR 0001/0004) to lift DFlash acceptance
120+
toward the 44.7 % reference / ~70 % target; (b) **NGramProposer × v3 loop**
121+
for agentic workloads (draft ≈ 0, no training). Option (b) is the cheapest
122+
and is verifiable in a single bridge run.
123+
- Any future "is spec-decode worth it?" discussion must specify the
124+
**(platform, workload, drafter)** cell; a negative in one cell does not
125+
generalise.
126+
127+
## Alternatives considered
128+
129+
- **"The Mac throughput number kills the architecture."** Rejected: the Mac
130+
cell has a concrete, movable bottleneck (acceptance 30–40 % vs 44.7 %
131+
reference), and the negative does not extrapolate to CUDA (1.27×, realised)
132+
or to the memory axis (89.8 % saving, realised, all platforms).
133+
- **"Deprecate the proposer because Step-1 reaches 1.0× without it on
134+
Gemma-4."** Rejected: S5 is a Gemma-4-specific free coupon (native hybrid
135+
attention); on pure sliding-window models (K1/K2 Qwen3) and the CUDA
136+
full-restoration path the proposer is the only recall source.
137+
- **"Only ship spec-decode if it beats AR everywhere."** Rejected: the value
138+
is platform-forked; CUDA already clears it, and the verification primitive's
139+
correctness containment makes the Mac throughput a strictly-additive option
140+
(it can never regress correctness).
141+
142+
## Evidence pointers
143+
144+
- Mac bounded-memory + recall: `results/research/k3_mlx_fused_fair_ctx280_n5_gen32_*.json`
145+
(recall 5/5, KV 132.9 MB vs 1308.9 MB), `docs/pr109-mac-ctx280-validation.md`.
146+
- CUDA throughput: PR #107, `docs/k3-gpu-beta.md`.
147+
- verify(L) headroom: `results/research/verify_l_sweep.json` (3.92× measured @
148+
L=16 vs 4.52× expert-union bound).
149+
- Drafter alignment status: ADR 0001/0004; `inference_engine/v04/dflash_drafter.py`
150+
("Stage-2" fidelity note).
151+
- Capability plane / multi-host: ADR 0009, PR #105; `scripts/mac_bridge/`.
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# ADR 0013 — Distributed inference topology: what AR sequentiality allows
2+
3+
- **Status**: Accepted (2026-06-13)
4+
- **Date**: 2026-06-13
5+
- **Relates to**: ADR 0009 (mlx.distributed + capability exchange — this ADR
6+
is its topology companion / clarification), ADR 0008 (session-bound runtime),
7+
ADR 0001 (proposer sizing), ADR 0012 (value proposition).
8+
9+
## Context
10+
11+
A recurring vision is raised for "distributed inference": *decompose one
12+
inference task into several parallel subtasks, have the proposer coordinate
13+
across multiple verifiers, and win total token throughput* — extrapolated to
14+
**many-to-many** proposer/verifier wiring (one verifier fed by many proposers;
15+
one proposer drafting for many verifiers).
16+
17+
ADR 0009 shipped the *substrate* (capability exchange + remote `ProposeBlock` +
18+
`DistributedSpeculativeDecoder` + an optional `mlx.distributed` data plane) but
19+
did not pin down **which inference topologies are physically achievable**. This
20+
ADR fixes the can/can't-parallelize conclusion so it is not re-derived each time
21+
the idea resurfaces.
22+
23+
## Decision
24+
25+
### The governing constraint
26+
27+
**Single-sequence autoregressive (AR) decoding is inherently sequential**:
28+
token `N+1` depends on the realized value of token `N`. A single sequence's
29+
token chain therefore **cannot** be split into independent parallel subtasks
30+
across multiple verifiers the way a batch / map-reduce job can. This is a
31+
causal-dependency property of AR generation, not an engineering gap.
32+
33+
The only parallelism available to a **single** sequence is:
34+
35+
1. **Intra-forward (model parallelism)** — split *one* verifier's weights/compute
36+
across hosts via tensor/pipeline parallelism (`mlx.distributed`
37+
`model.shard`, ADR 0009 §2.1). This is "one verifier across N hosts," not "N
38+
verifiers." It enables / accelerates a verifier too big for one host;
39+
throughput scales **sublinearly** (collective-communication bound), and its
40+
real purpose is fit + latency, not linear throughput multiplication.
41+
2. **Intra-block (speculative decoding)** — the verifier checks `L` drafted
42+
tokens in **one batched forward**; throughput gain = `acceptance × block`,
43+
amortizing one verify over many tokens. The `verify(L)` cost is **sublinear**
44+
in `L` (`results/research/verify_l_sweep.json`: ~4× at L=16), which is the
45+
headroom that makes blocks and trees pay off.
46+
3. **N:1 tree / multi-candidate speculation** — multiple drafts (many proposers,
47+
or one proposer emitting a token *tree*) are verified in **one** batched
48+
forward via tree attention; the longest correct path is accepted. This
49+
raises **single-request** throughput by exploiting the sublinear `verify(L)`
50+
headroom from (2).
51+
52+
### The topologies, mapped to feasibility
53+
54+
| Topology | Realizable? | What it is | Status on the ADR-0009 substrate |
55+
|---|---|---|---|
56+
| **Split one sequence across N independent verifiers in parallel** | ❌ No | category error — blocked by AR sequentiality | n/a |
57+
| **Single big verifier sharded across hosts** (1 verifier, N hosts) | ✅ Yes | tensor/pipeline parallel of one model | `mlx.distributed` ring adapter shipped (ADR 0009 §4.4); sharding is mlx-lm `model.shard` |
58+
| **N proposers : 1 verifier** (tree / multi-candidate) | ✅ Yes — **the** path to single-request throughput | parallel candidate verification | **feasible, not built** — current `DistributedSpeculativeDecoder` is single `RemoteProposer` + linear accept; needs tree-attention verify + multi-proposer aggregation |
59+
| **1 proposer : N verifiers** | ✅ Yes (already realized) | a shared proposer capability serves many independent verifier sessions | shipped: `ProposerService` + capability exchange (ADR 0009 §4) |
60+
61+
### What "total throughput advantage" means (two regimes)
62+
63+
- **Single-request throughput**: only (1) intra-verifier model parallelism and
64+
(3) N:1 tree speculation help. Multiple *independent* verifiers do **not**
65+
there is nothing to parallelize across them for one sequence.
66+
- **Fleet / aggregate throughput** (many independent requests): the **1:N**
67+
proposer-sharing + role placement is the realized win — it raises utilization
68+
(offload the asymmetrically-cheap 0.25–1 B proposer, free
69+
`proposer_weight_bytes` on verifier hosts), but does not speed up any single
70+
request beyond ordinary spec-decode.
71+
72+
## Consequences
73+
74+
- For **single-request throughput**, the correct next investment is **N:1 tree /
75+
multi-candidate speculation** built on the ADR-0009 capability substrate +
76+
the sublinear `verify(L)` headroom — **not** "more verifiers." This is tracked
77+
as a v0.5+ extension to `DistributedSpeculativeDecoder` (territory of the
78+
ADR-0009 / capability-plane workstream).
79+
- **Multi-host spec-decode trades latency for placement**: F3 (aux hidden
80+
states) is MB/block on the critical path (ADR 0009 F-flow table); it only pays
81+
off behind a fast data plane (ring / `jaccl`). Distributing for its own sake
82+
can regress single-request latency.
83+
- The **Mac bridge** (`scripts/mac_bridge/`) used for dev/eval is an instance of
84+
the capability plane's **tool plane**, *not* a production inference data
85+
plane. "The multi-host tool plane is running" must not be extrapolated to "a
86+
distributed inference data plane is ready."
87+
- Any future "let's parallelize one request across machines" proposal must first
88+
identify which of the four topologies it is; the "N independent verifiers on
89+
one sequence" form is closed.
90+
91+
## Alternatives considered
92+
93+
- **"Decompose one sequence into parallel subtasks across N verifiers."**
94+
Rejected: AR sequentiality (token `N+1` needs token `N`) makes the subtasks
95+
causally dependent, not independent. No coordination protocol recovers
96+
independence that the math forbids.
97+
- **"Multiple verifiers vote / ensemble on one sequence for speed."** Rejected
98+
for throughput: ensembling changes *quality semantics* and still runs each
99+
verifier over the same sequential chain — it multiplies cost, not speed.
100+
- **"Treat distribution as the throughput lever."** Rejected as the primary
101+
lever: the realized throughput wins are intra-block (spec-decode, single host)
102+
and fleet-aggregate (1:N); cross-host distribution's single-request value is
103+
bounded by F3 latency and is a fit/placement tool, not a linear scaler.
104+
105+
## Evidence pointers
106+
107+
- `verify(L)` sublinearity (the headroom tree-spec would exploit):
108+
`results/research/verify_l_sweep.json` (3.92× measured @ L=16).
109+
- F-flow latency analysis + `mlx.distributed` data-plane scope: ADR 0009 §2.
110+
- Realized 1:N substrate: ADR 0009 §4 (`CapabilityService`, `ProposerService`,
111+
`DistributedSpeculativeDecoder`); `inference_engine/distributed/`.

docs/adr/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,8 @@ reader what was *not* chosen.
4040
| 0006 | [Project positioning as local agent infrastructure](0006-local-agent-infrastructure-positioning.md) | Accepted |
4141
| 0007 | [Cross-request KV cache reuse for long sessions](0007-cross-request-kv-reuse.md) | Superseded by 0008 |
4242
| 0008 | [Session-bound runtime + gRPC protocol](0008-session-bound-runtime-and-grpc-protocol.md) | Accepted |
43+
| 0012 | [Proposer/verifier value proposition: bounded-memory + recall, platform-forked throughput](0012-proposer-verifier-value-proposition.md) | Accepted |
44+
| 0013 | [Distributed inference topology: what AR sequentiality allows](0013-distributed-inference-topology.md) | Accepted |
4345
| 0014 | [Agent-connection capacity & cross-host proposer/verifier topology: test plan & results](0014-agent-connection-capacity-and-cross-host-topology-tests.md) | Accepted |
4446
| 0015 | [Kakeya Inference Engine: a product-grade vLLM replacement, Kakeya Attention native](0015-kakeya-attention-and-engine-substrate.md) | Accepted |
4547

0 commit comments

Comments
 (0)