Merge remote-tracking branch 'origin/AgentMemory/adr-0013-distributed-topology-2815' into _adr_train

cursoragent · FluffyAIcode · cursoragent · commit da82b73472b0 · 2026-06-19T03:00:53.000Z
# Conflicts:
#	docs/adr/README.md

Co-authored-by: FluffyAIcode &lt;FluffyAIcode@users.noreply.github.com&gt;
diff --git a/docs/adr/0012-proposer-verifier-value-proposition.md b/docs/adr/0012-proposer-verifier-value-proposition.md
@@ -0,0 +1,151 @@
+# ADR 0012 — Proposer/verifier value proposition: bounded-memory + recall (all platforms), platform-forked throughput
+
+- **Status**: Accepted (2026-06-13)
+- **Date**: 2026-06-13
+- **Decision drivers**:
+  - A recurring question keeps being re-opened by new contributors (human
+    and agent): *"is the proposer still worth it, given Step-1 reaches 1.0× AR
+    on Mac without using it?"* and *"is speculative decoding dead on Mac?"*
+    This ADR settles the value map so the decision tree is not re-derived
+    every time the Mac throughput number looks bad in isolation.
+  - 2026-06-12 Mac ctx280 validation (`results/research/k3_mlx_fused_fair_ctx280_n5_gen32_*.json`):
+    Step-1 recall 5/5 vs oracle 5/5, bounded resident KV 132.9 MB vs naive
+    1308.9 MB (89.8 % saving) at 4406–5810-token prompts.
+  - 2026-06-11 H200 #107 evidence: fused spec-decode 1.27× AR, recall 1.0.
+  - 2026-06-13 `verify(L)` calibration sweep (`results/research/verify_l_sweep.json`,
+    ctx 4096): measured kernel-dedup headroom 3.92× at L=16, ≈87 % of the
+    router-measured expert-union bound (4.52×).
+  - Builds on / re-affirms ADR 0001 (proposer sizing + alignment),
+    ADR 0004 (alignment data policy), ADR 0006 (local-agent-infra
+    positioning), ADR 0008 §11 (dLM K/V-Restoration architecture),
+    ADR 0009 (capability exchange), ADR 0010 (full-attention low-precision
+    KV / affine4), ADR 0011 (cross-attention coupling, falsified by R1e).
+
+## Context
+
+ADR 0008 §11 (K-series) changed the proposer's primary role from *drafter*
+to *history reconstructor*: the dLM proposer has no KV cache and can produce
+transient K/V for the **entire** history, which is used to restore the
+verifier's attention at structurally-evicted positions. Speculative decoding
+is the **second** product line on the same architecture, not the first.
+
+The trap is to evaluate the architecture on a single cell of its value
+table — "Mac, single host, generic chat, current un-aligned DFlash" — see a
+weak throughput number, and conclude the proposer (or spec-decode) has no
+value. That conclusion does not generalise. This ADR records the full value
+map and prices the open options explicitly.
+
+## Decision
+
+The proposer/verifier value proposition is realised on **two axes**, and its
+status is **platform- and workload-dependent**, not a single scalar:
+
+### 1. The core value is "bounded memory + recall", not "fast"
+
+Since ADR 0008 §11, the proposer's first-class role is **history
+reconstruction**: no KV cache, transient full-history K/V → restore the
+verifier's evicted-position attention. The main line has already **won**, but
+the value is realised on the **memory axis**, not the throughput axis:
+Step-1 = **1.0× AR throughput + recall 5/5 + KV 132.9 MB vs naive 1308.9 MB
+(89.8 % saving; ~48 MB after affine4 / ADR 0010)**. That is the
+proposer/verifier deliverable.
+
+A finer honesty note: in the Mac **S5-native** shipping configuration,
+Gemma-4's *native hybrid attention* means keeping the **5 full-attention
+layers exact** is already enough to carry recall — so on this specific model
+the f_θ/proposer reconstruction is **replaced by the S5 shortcut**. But on a
+**pure sliding-window architecture** (the K1/K2 Qwen3 case — no
+full-attention layers to preserve) and on the **CUDA full-restoration path**,
+proposer reconstruction remains the **only** source of recall. The
+architecture's domain of applicability is unchanged; Gemma-4 simply handed us
+a free coupon.
+
+### 2. Speculative-decoding value forks by platform — the Mac negative does not extrapolate
+
+- **H200 (#107 measured)**: fused = **1.27× AR, recall 1.0** — the *same*
+  proposer/verifier code; on a platform where verify-batch is nearly free,
+  spec-decode value holds.
+- **Mac's 0.26×** has a **concrete, movable** bottleneck: real per-token
+  acceptance is **30–40 %**. The vLLM reference reports the *same* drafter at
+  **44.7 %**, and our own drafter docs say "the precise EAGLE-3 ↔ block-fusion
+  alignment is a Stage-2 task". Alignment fine-tuning (the plan that has been
+  queued in ADR 0001 / 0004 all along) lifting acceptance to **~70 %** makes
+  the block-4 arithmetic `3.5 × 43.8 / 140 ≈ 1.1×` — Mac clears the bar too.
+  So the Mac status is **"waiting for the alignment asset"**, not
+  **"architecture pronounced dead"**.
+
+### 3. Option value of the verification primitive: correctness containment makes any draft source plug-and-play
+
+The v3 loop + byte-level consistency guarantees one thing: a draft source can
+only affect **throughput**, never **pollute output**. This is an open
+interface; the drafter can be swapped for anything. A concrete, this-week,
+testable Mac route: **NGramProposer** (the zero-weight prompt-lookup proposer
+already in PR #105) + the v3 loop — draft cost ≈ 0, no alignment dependency,
+naturally high acceptance on agentic workloads (tool-call JSON, templated
+replies, highly self-repetitive sessions). Arithmetic: `draft ≈ 0 +
+verify(4) = 120 ms`, committing 2.5/block → 0.78×, 3.5/block → 1.1× — on
+Kakeya's target workload (ADR 0006: local agent infrastructure) this is
+entirely plausible. One bridge command verifies it.
+
+### 4. Beyond single-host throughput, the split is the foundation for a multi-host architecture
+
+ADR 0009 / PR #105's capability-exchange plane is built with the
+proposer/verifier roles as primitives: the proposer is a fleet capability that
+can be gossip-discovered and remote-invoked. Even if a single Mac never runs
+spec-decode, the "verifier on host A, proposer capability on host B / cloud"
+shape (including the dev/eval tool plane) is **already running** — the Mac
+bridge used over the last two days is itself an instance of this
+architecture's tool plane.
+
+### Bottom line
+
+If the proposition is narrowed to *"Mac single-host + generic chat + the
+current un-aligned DFlash"* — yes, the proposer has **no runtime value in
+that one cell today**, and Step-1 reaches 1.0× without it. But the
+architecture's value map is: **the realised bounded-memory story (all
+platforms) + the realised throughput story (CUDA) + two explicitly-priced Mac
+throughput options (alignment fine-tuning / n-gram drafting) + the foundation
+for the multi-host capability plane.**
+
+## Consequences
+
+- The proposer/verifier split is **retained**; "Step-1 doesn't use the
+  proposer on Mac/Gemma-4" is **not** grounds to deprecate it (it is the only
+  recall source on pure-sliding-window models and on CUDA full-restoration).
+- Memory-axis claims (bounded KV, S5, affine4) are the **primary**,
+  all-platform deliverable and should be reported as such; throughput claims
+  must be qualified by platform (CUDA: realised; Mac: option-pending).
+- Two priced Mac throughput options are tracked as next steps:
+  (a) **alignment fine-tuning** (ADR 0001/0004) to lift DFlash acceptance
+  toward the 44.7 % reference / ~70 % target; (b) **NGramProposer × v3 loop**
+  for agentic workloads (draft ≈ 0, no training). Option (b) is the cheapest
+  and is verifiable in a single bridge run.
+- Any future "is spec-decode worth it?" discussion must specify the
+  **(platform, workload, drafter)** cell; a negative in one cell does not
+  generalise.
+
+## Alternatives considered
+
+- **"The Mac throughput number kills the architecture."** Rejected: the Mac
+  cell has a concrete, movable bottleneck (acceptance 30–40 % vs 44.7 %
+  reference), and the negative does not extrapolate to CUDA (1.27×, realised)
+  or to the memory axis (89.8 % saving, realised, all platforms).
+- **"Deprecate the proposer because Step-1 reaches 1.0× without it on
+  Gemma-4."** Rejected: S5 is a Gemma-4-specific free coupon (native hybrid
+  attention); on pure sliding-window models (K1/K2 Qwen3) and the CUDA
+  full-restoration path the proposer is the only recall source.
+- **"Only ship spec-decode if it beats AR everywhere."** Rejected: the value
+  is platform-forked; CUDA already clears it, and the verification primitive's
+  correctness containment makes the Mac throughput a strictly-additive option
+  (it can never regress correctness).
+
+## Evidence pointers
+
+- Mac bounded-memory + recall: `results/research/k3_mlx_fused_fair_ctx280_n5_gen32_*.json`
+  (recall 5/5, KV 132.9 MB vs 1308.9 MB), `docs/pr109-mac-ctx280-validation.md`.
+- CUDA throughput: PR #107, `docs/k3-gpu-beta.md`.
+- verify(L) headroom: `results/research/verify_l_sweep.json` (3.92× measured @
+  L=16 vs 4.52× expert-union bound).
+- Drafter alignment status: ADR 0001/0004; `inference_engine/v04/dflash_drafter.py`
+  ("Stage-2" fidelity note).
+- Capability plane / multi-host: ADR 0009, PR #105; `scripts/mac_bridge/`.
diff --git a/docs/adr/0013-distributed-inference-topology.md b/docs/adr/0013-distributed-inference-topology.md
@@ -0,0 +1,111 @@
+# ADR 0013 — Distributed inference topology: what AR sequentiality allows
+
+- **Status**: Accepted (2026-06-13)
+- **Date**: 2026-06-13
+- **Relates to**: ADR 0009 (mlx.distributed + capability exchange — this ADR
+  is its topology companion / clarification), ADR 0008 (session-bound runtime),
+  ADR 0001 (proposer sizing), ADR 0012 (value proposition).
+
+## Context
+
+A recurring vision is raised for "distributed inference": *decompose one
+inference task into several parallel subtasks, have the proposer coordinate
+across multiple verifiers, and win total token throughput* — extrapolated to
+**many-to-many** proposer/verifier wiring (one verifier fed by many proposers;
+one proposer drafting for many verifiers).
+
+ADR 0009 shipped the *substrate* (capability exchange + remote `ProposeBlock` +
+`DistributedSpeculativeDecoder` + an optional `mlx.distributed` data plane) but
+did not pin down **which inference topologies are physically achievable**. This
+ADR fixes the can/can't-parallelize conclusion so it is not re-derived each time
+the idea resurfaces.
+
+## Decision
+
+### The governing constraint
+
+**Single-sequence autoregressive (AR) decoding is inherently sequential**:
+token `N+1` depends on the realized value of token `N`. A single sequence's
+token chain therefore **cannot** be split into independent parallel subtasks
+across multiple verifiers the way a batch / map-reduce job can. This is a
+causal-dependency property of AR generation, not an engineering gap.
+
+The only parallelism available to a **single** sequence is:
+
+1. **Intra-forward (model parallelism)** — split *one* verifier's weights/compute
+   across hosts via tensor/pipeline parallelism (`mlx.distributed`
+   `model.shard`, ADR 0009 §2.1). This is "one verifier across N hosts," not "N
+   verifiers." It enables / accelerates a verifier too big for one host;
+   throughput scales **sublinearly** (collective-communication bound), and its
+   real purpose is fit + latency, not linear throughput multiplication.
+2. **Intra-block (speculative decoding)** — the verifier checks `L` drafted
+   tokens in **one batched forward**; throughput gain = `acceptance × block`,
+   amortizing one verify over many tokens. The `verify(L)` cost is **sublinear**
+   in `L` (`results/research/verify_l_sweep.json`: ~4× at L=16), which is the
+   headroom that makes blocks and trees pay off.
+3. **N:1 tree / multi-candidate speculation** — multiple drafts (many proposers,
+   or one proposer emitting a token *tree*) are verified in **one** batched
+   forward via tree attention; the longest correct path is accepted. This
+   raises **single-request** throughput by exploiting the sublinear `verify(L)`
+   headroom from (2).
+
+### The topologies, mapped to feasibility
+
+| Topology | Realizable? | What it is | Status on the ADR-0009 substrate |
+|---|---|---|---|
+| **Split one sequence across N independent verifiers in parallel** | ❌ No | category error — blocked by AR sequentiality | n/a |
+| **Single big verifier sharded across hosts** (1 verifier, N hosts) | ✅ Yes | tensor/pipeline parallel of one model | `mlx.distributed` ring adapter shipped (ADR 0009 §4.4); sharding is mlx-lm `model.shard` |
+| **N proposers : 1 verifier** (tree / multi-candidate) | ✅ Yes — **the** path to single-request throughput | parallel candidate verification | **feasible, not built** — current `DistributedSpeculativeDecoder` is single `RemoteProposer` + linear accept; needs tree-attention verify + multi-proposer aggregation |
+| **1 proposer : N verifiers** | ✅ Yes (already realized) | a shared proposer capability serves many independent verifier sessions | shipped: `ProposerService` + capability exchange (ADR 0009 §4) |
+
+### What "total throughput advantage" means (two regimes)
+
+- **Single-request throughput**: only (1) intra-verifier model parallelism and
+  (3) N:1 tree speculation help. Multiple *independent* verifiers do **not** —
+  there is nothing to parallelize across them for one sequence.
+- **Fleet / aggregate throughput** (many independent requests): the **1:N**
+  proposer-sharing + role placement is the realized win — it raises utilization
+  (offload the asymmetrically-cheap 0.25–1 B proposer, free
+  `proposer_weight_bytes` on verifier hosts), but does not speed up any single
+  request beyond ordinary spec-decode.
+
+## Consequences
+
+- For **single-request throughput**, the correct next investment is **N:1 tree /
+  multi-candidate speculation** built on the ADR-0009 capability substrate +
+  the sublinear `verify(L)` headroom — **not** "more verifiers." This is tracked
+  as a v0.5+ extension to `DistributedSpeculativeDecoder` (territory of the
+  ADR-0009 / capability-plane workstream).
+- **Multi-host spec-decode trades latency for placement**: F3 (aux hidden
+  states) is MB/block on the critical path (ADR 0009 F-flow table); it only pays
+  off behind a fast data plane (ring / `jaccl`). Distributing for its own sake
+  can regress single-request latency.
+- The **Mac bridge** (`scripts/mac_bridge/`) used for dev/eval is an instance of
+  the capability plane's **tool plane**, *not* a production inference data
+  plane. "The multi-host tool plane is running" must not be extrapolated to "a
+  distributed inference data plane is ready."
+- Any future "let's parallelize one request across machines" proposal must first
+  identify which of the four topologies it is; the "N independent verifiers on
+  one sequence" form is closed.
+
+## Alternatives considered
+
+- **"Decompose one sequence into parallel subtasks across N verifiers."**
+  Rejected: AR sequentiality (token `N+1` needs token `N`) makes the subtasks
+  causally dependent, not independent. No coordination protocol recovers
+  independence that the math forbids.
+- **"Multiple verifiers vote / ensemble on one sequence for speed."** Rejected
+  for throughput: ensembling changes *quality semantics* and still runs each
+  verifier over the same sequential chain — it multiplies cost, not speed.
+- **"Treat distribution as the throughput lever."** Rejected as the primary
+  lever: the realized throughput wins are intra-block (spec-decode, single host)
+  and fleet-aggregate (1:N); cross-host distribution's single-request value is
+  bounded by F3 latency and is a fit/placement tool, not a linear scaler.
+
+## Evidence pointers
+
+- `verify(L)` sublinearity (the headroom tree-spec would exploit):
+  `results/research/verify_l_sweep.json` (3.92× measured @ L=16).
+- F-flow latency analysis + `mlx.distributed` data-plane scope: ADR 0009 §2.
+- Realized 1:N substrate: ADR 0009 §4 (`CapabilityService`, `ProposerService`,
+  `DistributedSpeculativeDecoder`); `inference_engine/distributed/`.
diff --git a/docs/adr/README.md b/docs/adr/README.md
@@ -40,6 +40,8 @@ reader what was *not* chosen.
 | 0006 | [Project positioning as local agent infrastructure](0006-local-agent-infrastructure-positioning.md) | Accepted |
 | 0007 | [Cross-request KV cache reuse for long sessions](0007-cross-request-kv-reuse.md) | Superseded by 0008 |
 | 0008 | [Session-bound runtime + gRPC protocol](0008-session-bound-runtime-and-grpc-protocol.md) | Accepted |
+| 0012 | [Proposer/verifier value proposition: bounded-memory + recall, platform-forked throughput](0012-proposer-verifier-value-proposition.md) | Accepted |
+| 0013 | [Distributed inference topology: what AR sequentiality allows](0013-distributed-inference-topology.md) | Accepted |
 | 0014 | [Agent-connection capacity & cross-host proposer/verifier topology: test plan & results](0014-agent-connection-capacity-and-cross-host-topology-tests.md) | Accepted |
 | 0015 | [Kakeya Inference Engine: a product-grade vLLM replacement, Kakeya Attention native](0015-kakeya-attention-and-engine-substrate.md) | Accepted |