Skip to content

Commit 27a17f2

Browse files
Release v0.1.0: Kakeya inference engine for Apple Silicon
Merges PR #2 (DLM Proposer + AR Verifier — runnable KV-cache-saving framework) into main as the v0.1.0 baseline. What this release contains: Core algorithmic framework (kv_cache_proposer/): - DLM proposer (masked-diffusion) + AR verifier with sink+window KV - Greedy speculative decoding with rejection sampling and EOS handling - Streaming on_token callback, Net Bytes per Token memory accounting - Baseline AR decoder for equivalence self-test Inference engine scaffold (inference_engine/): - MLX backend: env detection, torch<->mlx bridge, custom SinkWindowKVCache - MLXSinkWindowVerifier and MLXSparseLogitsProposer for Apple Silicon - Sparse-logits proposer optimization (LM head only at masked positions) - mx.compile JIT compilation on the proposer's bidirectional backbone Tooling and ops: - scripts/setup_mac.sh and scripts/setup_cuda.sh with HF cache pre-flight - scripts/chat.py streaming REPL, scripts/run_platform_tests.sh - bench suite: sparse-vs-dense, MLX param sweep, mx.compile, verifier - 100% unit-test coverage on core algorithmic components Architecture documentation: - docs/local-inference-engine.md: serving stack design - docs/adr/0001: proposer sizing, EAGLE-3 alignment recipe, verifier decoupling - docs/adr/0002: verifier selection, quantization, open/closed-weight constraint Measured baseline on Mac M4 24GB (commit 8b1aca0): - 3-way bench (zh KV-cache prompt): CPU/CPU 127.31s -> MLX/MLX 12.07s (10.55x) - Acceptance 0.06-0.12 (no alignment training yet, blocker for v1 ship) - Verifier sink+window KV: 7.44 MB vs baseline 9.19 MB Known gaps tracked for follow-up releases: - Alignment training pipeline (training/repr_align/) per ADR 0001 - Qwen3-8B 4-bit verifier swap per ADR 0002 - Continuous batching scheduler, NF4 KV quant, OpenAI-compat HTTP API - Tree speculative decoding (deferred until acceptance >= 0.3) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
2 parents dfaf010 + 4309266 commit 27a17f2

91 files changed

Lines changed: 55538 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.coveragerc

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
[run]
2+
# Use the modern Python 3.12+ sys.monitoring backend, which avoids the
3+
# C-trace conflict with torch's _C extension.
4+
core = sysmon
5+
parallel = false
6+
branch = false
7+
8+
# NOTE on the coverage scope:
9+
#
10+
# We do NOT set `source = ...` here. The runner script
11+
# (scripts/run_platform_tests.sh) sets `--cov=...` flags dynamically
12+
# based on which backend is being tested:
13+
#
14+
# * `kv_cache_proposer` and the platform-neutral `inference_engine`
15+
# subpackages (proposer/, memory/, scheduler/, server/) are ALWAYS
16+
# covered — their tests run on every backend.
17+
# * `inference_engine.backends.<selected>` is covered only when that
18+
# backend is being tested. An MLX backend module sitting on a Linux
19+
# CUDA host is not in scope (the host can't import mlx), and vice
20+
# versa.
21+
#
22+
# This avoids the alternative — setting `source = inference_engine` —
23+
# which would force every backend module to be 100% covered on every
24+
# host, even when its dependencies (Metal / CUDA) aren't present.
25+
26+
# CLI entry-point scripts are exercised by end-to-end demo runs (under
27+
# `results/`) and by an integration test that invokes them via
28+
# subprocess; they are intentionally not part of the unit-test coverage
29+
# target (their bodies are argparse plumbing + orchestration of already-
30+
# tested library code).
31+
omit =
32+
kv_cache_proposer/run_demo.py
33+
34+
[report]
35+
exclude_lines =
36+
pragma: no cover
37+
raise NotImplementedError
38+
if __name__ == .__main__.:
39+
if TYPE_CHECKING:
40+
show_missing = true
41+
skip_empty = true
42+
fail_under = 100

.gitignore

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
__pycache__/
2+
*.pyc
3+
*.pyo
4+
*.swp
5+
.venv/
6+
.idea/
7+
.vscode/
8+
.DS_Store
9+
.coverage

README.md

Lines changed: 279 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
# DLM Proposer + AR Verifier — runnable KV-cache-saving framework
2+
3+
Runs the speculative-decoding architecture designed in the prior product
4+
discussion using **real, public** weights:
5+
6+
| Role | Model | Params | Tokenizer |
7+
| -------- | ------------------------------------------------------- | ------ | ------------ |
8+
| Proposer | [`dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1`][p] | 0.75 B | Qwen3 family |
9+
| Verifier | [`Qwen/Qwen3-1.7B`][v] (closest public stand-in for "Qwen 3.6") | 1.72 B | Qwen3 family |
10+
11+
[p]: https://huggingface.co/dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1
12+
[v]: https://huggingface.co/Qwen/Qwen3-1.7B
13+
14+
> **Note on the verifier choice**: at the time of this writing, no public
15+
> "Qwen 3.6" checkpoint exists. We use `Qwen/Qwen3-1.7B` because it is the
16+
> closest publicly-available autoregressive Qwen-3 model that (a) shares the
17+
> proposer's tokenizer (the prompt encodes to identical token ids — verified
18+
> at startup) and (b) is large enough to make KV-cache savings non-trivial.
19+
> Swapping in an actual Qwen 3.5/3.6 checkpoint requires only changing
20+
> `--verifier-id`. Note that Qwen 3.5/3.6's hybrid attention design carries
21+
> KV on only 16/64 layers, so its baseline KV/token would be **smaller** than
22+
> Qwen3-1.7B's 114 KB/token (closer to ~65 KB/token); compression *ratios*
23+
> against that smaller baseline would be correspondingly smaller, but the
24+
> framework code is unchanged.
25+
26+
## Memory accounting and what we measure
27+
28+
The metric is **Net Bytes per Token**, defined as:
29+
30+
Net Bytes per Token (KV-only) =
31+
verifier_KV_per_token
32+
+ proposer_KV_per_token
33+
+ proposer_weight_bytes / (B * S)
34+
35+
where `B` is concurrent-request batch size and `S` is per-request sequence
36+
length (both at production operating point).
37+
38+
**Activation peak is *not* in Net Bytes per Token.** A transient activation
39+
tensor is allocated when `model(...)` starts, freed when `model(...)`
40+
returns; it does not accumulate across forwards and does not scale
41+
per-session. It is a GPU **capacity constraint** (the forward must fit in
42+
HBM), not a per-token cost. We report it separately.
43+
44+
> ⚠️ **Earlier metric was wrong.** A previous version of `metrics.py`
45+
> amortized `peak_activation / (B * L_block)` into Net Bytes per Token.
46+
> This conflated a transient peak with persistent memory and inflated the
47+
> metric by 30,000+ B/token in the long-context regime, making compression
48+
> appear at 3.5× when it should have been ~600×. The fix is in
49+
> `metrics.py` and the new report shape; the design-stage formula in the
50+
> project notes had the same error and is corrected accordingly.
51+
52+
## Architecture
53+
54+
```
55+
┌──────────────────┐ L tokens ┌────────────────────────┐
56+
│ DLM Proposer │ ────────────────► │ AR Verifier │
57+
│ Qwen3-0.6B-MDLM │ │ Qwen3-1.7B │
58+
│ K diffusion │ ◄──────────────── │ DynamicCache trimmed │
59+
│ steps / block │ accept / reject │ to sink+window slots │
60+
└──────────────────┘ └────────────────────────┘
61+
```
62+
63+
* `proposer.py` — masked-diffusion block generator faithful to the model card's reference (low-confidence remasking, deterministic at temperature 0). The proposer in this build re-encodes the full prefix per block; it does **not** maintain a persistent KV cache, so its persistent memory contribution to Net Bytes per Token is zero.
64+
* `verifier.py``SinkWindowVerifier` slices each `DynamicCache` layer's K/V tensors after every step; new queries always use the **global** RoPE position (so RoPE on new K/Q is correct), and evicted tokens drop out of attention's view (StreamingLLM-style). Layer-shape invariants raise on mismatch.
65+
* `speculative.py` — greedy speculative-decoding loop with rejection sampling. When `sink + window >= full_seq_len`, output is **bit-equivalent** to greedy AR — verified at runtime; the demo exits with code 2 on mismatch.
66+
* `baseline.py` — reference greedy AR with full `DynamicCache`.
67+
* `metrics.py` — KV byte counting; KV-only Net-Bytes-per-Token formula; capacity-constraint report; projection table to canonical operating points.
68+
69+
## Project layout
70+
71+
```
72+
kv_cache_proposer/
73+
├── proposer.py # DLM Proposer (masked-diffusion block generator)
74+
├── verifier.py # AR Verifier with sink+window DynamicCache
75+
├── speculative.py # Greedy speculative-decoding loop
76+
├── baseline.py # Reference greedy AR with full DynamicCache
77+
├── metrics.py # KV byte counting + Net-Bytes-per-Token + projection table
78+
├── run_demo.py # End-to-end demo + JSON results
79+
└── __init__.py
80+
scripts/
81+
└── smoke_test.py # Component smoke tests on real weights
82+
results/ # Logs and JSON outputs from runs
83+
requirements.txt
84+
```
85+
86+
## How to run
87+
88+
> **Network requirement**: tests load real Qwen3 weights from the
89+
> HuggingFace cache. The setup scripts (`scripts/setup_mac.sh` /
90+
> `scripts/setup_cuda.sh`) probe `huggingface.co` and download both
91+
> required snapshots (~5 GB total) before tests run. **If you're in
92+
> mainland China or behind a firewall**, set the mirror endpoint
93+
> first:
94+
>
95+
> ```bash
96+
> export HF_ENDPOINT=https://hf-mirror.com
97+
> ```
98+
>
99+
> The setup scripts will then route all downloads through it. If the
100+
> initial connectivity probe fails, the script exits with a clear
101+
> remediation message rather than producing cascading test failures.
102+
103+
```bash
104+
pip install -r requirements.txt
105+
# One-time fix: the dllm-hub modeling file references the broken `dllm`
106+
# package inside an `if __name__ == "__main__":` block; transformers'
107+
# static check_imports flags it. Install a no-op stub at the user's
108+
# site-packages directory (Python-version portable):
109+
python3 -c "import site, os; \
110+
p = os.path.join(site.getusersitepackages(), 'dllm'); \
111+
os.makedirs(p, exist_ok=True); \
112+
open(os.path.join(p, '__init__.py'), 'a').close()"
113+
114+
# Smoke test: tokenizer agreement, model loading, cache invariants
115+
PYTHONPATH=. python3 scripts/smoke_test.py
116+
117+
# Equivalence regime: window >= sequence length => bit-identical to baseline
118+
PYTHONPATH=. python3 -m kv_cache_proposer.run_demo \
119+
--max-new-tokens 32 \
120+
--block-size 8 --num-diffusion-steps 8 \
121+
--sink-size 4 --window-size 64 \
122+
--batch-size-for-amortization 8 \
123+
--prompt "Reply with exactly: OK."
124+
125+
# Compression regime: window << sequence => real KV eviction observed
126+
PYTHONPATH=. python3 -m kv_cache_proposer.run_demo \
127+
--max-new-tokens 64 \
128+
--block-size 16 --num-diffusion-steps 16 \
129+
--sink-size 4 --window-size 24 \
130+
--batch-size-for-amortization 64 \
131+
--prompt "Write a one-paragraph explanation of why prime numbers are infinite, suitable for a high school student." \
132+
--results-json results/run_compress.json
133+
```
134+
135+
## Results from the included CPU runs
136+
137+
### 1. Equivalence-regime test (sink+window covers full sequence)
138+
139+
```
140+
prompt : "Reply with exactly: OK."
141+
config : sink=4, window=64, block_size=8, K=8
142+
143+
baseline (full KV) : "OK.<|im_end|>" (3 tokens, peak KV = 3,584 KB)
144+
speculative (sink+window) : "OK.<|im_end|>" (3 tokens, peak KV = 3,696 KB)
145+
exact match : True <- "no intelligence loss" verified
146+
acceptance rate : 0.375
147+
```
148+
149+
Self-check passes: `sink+window=68 >= full_seq_len=33`, output bit-identical
150+
to the verifier's own greedy decode. The math of speculative decoding +
151+
no-eviction reduces to "verifier emits its argmax everywhere", exactly
152+
what the baseline computes.
153+
154+
### 2. Compression-regime test (window << sequence)
155+
156+
```
157+
prompt : "Write a one-paragraph explanation of why prime numbers are infinite ..."
158+
config : sink=4, window=24, block_size=16, K=16, B=64 (for amortization)
159+
S : 108 tokens (44 prompt + 64 generated)
160+
161+
Persistent (in Net Bytes per Token):
162+
verifier KV (full DynamicCache, baseline) = 12.10 MB total = 114,688 B/token
163+
verifier KV (sink+window, speculative) = 3.06 MB total = 29,734 B/token
164+
── 3.86x verifier-side
165+
proposer KV = 0 B (recomputed per block)
166+
proposer weights amortized at B=64,S=108 = 172,468 B/token (small-S dominates here)
167+
Net Bytes per Token (KV-only) at this scale = 202,202 B/token (compression 0.57x)
168+
169+
Capacity (separate, NOT counted in Net Bytes per Token):
170+
proposer peak activation (single forward) = 31.30 MB
171+
verifier peak activation (single forward) = 12.75 MB
172+
```
173+
174+
Net Bytes per Token < baseline only kicks in once `B*S` is large enough
175+
that proposer weights amortize away. The framework reports projected Net
176+
Bytes per Token at canonical operating points using the **empirically
177+
measured per-slot KV** and **actual measured weight bytes** (no
178+
extrapolation beyond reusing the slot constant):
179+
180+
```
181+
per-slot verifier KV measured = 114,688 B; cache_budget = 28 slots; proposer KV = 0
182+
--------------------------------------------------------------------------
183+
B S Net Bytes per Token compression
184+
--------------------------------------------------------------------------
185+
1 8,192 145,912.0 0.79x ← single-request, weights dominate
186+
8 8,192 18,582.0 6.17x
187+
8 32,768 4,645.5 24.69x
188+
8 131,072 1,161.4 98.75x
189+
8 1,048,576 145.2 790.02x
190+
32 131,072 308.7 371.50x
191+
64 131,072 166.6 688.36x ← B=64, S=128k production point
192+
64 1,048,576 20.8 5506.92x ← B=64, S=1M
193+
--------------------------------------------------------------------------
194+
```
195+
196+
These numbers are consistent with the design analysis: at small `B*S` the
197+
proposer's weight bytes dominate; at large `B*S` the only persistent cost
198+
is the bounded `sink+window` KV (28 slots × 114,688 B = 3.06 MB total,
199+
amortized over `S` tokens → ≈25 B/token at S=128k).
200+
201+
## Honest caveats
202+
203+
1. **Verifier model**: Qwen3-1.7B (28 layers, all carrying KV) stands in
204+
for the still-unreleased Qwen 3.6 (16 of 64 layers carrying KV). Against
205+
a real Qwen 3.5/3.6 baseline of ~65 KB/token, the *absolute* compression
206+
ratios above would be lower by a factor of about 1.75; the framework
207+
code is unchanged.
208+
2. **Acceptance rate is low (~0.12)**. The proposer was trained with masked
209+
diffusion on Nemotron-SFT-Code by a different research group; it is *not*
210+
Repr-Align-aligned to Qwen3-1.7B's representation geometry. With a same-
211+
family Repr-Align proposer (the design's recommended choice), reported
212+
acceptance rates are 0.6–0.85. **Low acceptance does not break
213+
correctness** — it costs throughput, not memory.
214+
3. **Proposer activation memory** is dominated by the dense logits buffer
215+
(`[1, T, V_vocab]`). The included implementation does not use the standard
216+
"compute logits only at masked positions" optimization — its peak is
217+
`T * V * 2` bytes per forward. At long contexts this would not fit in
218+
HBM and the optimization is mandatory; **the activation peak we report
219+
is therefore the value of `T * V * 2` at the run's actual context
220+
length, not a long-context projection**. The capacity number is real for
221+
what we ran; engineering for S=128k requires the masked-positions
222+
optimization (a few-line change). The Net-Bytes-per-Token numbers are
223+
independent of this optimization (activation is not in the metric).
224+
4. **CPU runs**. The repository runs end-to-end on a 4-core, 15 GB-RAM CPU
225+
environment in tens of seconds. GPU runs would just change wall-clock,
226+
not byte accounting; the Net-Bytes-per-Token numbers are deterministic
227+
functions of model shapes and the cache budget.
228+
5. **No fallback**. If anything in the pipeline becomes inconsistent
229+
(cache layout, tokenizer drift, mask leakage from the proposer) the
230+
code raises immediately. There is no path that silently degrades to
231+
"just call the verifier".
232+
233+
## What is and isn't being demonstrated
234+
235+
- **Demonstrated**: KV-cache memory bound is enforced and measured (the
236+
cache really stays at sink+window=28 slots throughout 108-token
237+
generation); the speculative loop is greedily distribution-equivalent to
238+
the verifier (in the equivalence regime); the Net-Bytes-per-Token
239+
trade-off curve crosses unity at the predicted operating regime.
240+
- **Not demonstrated** (out of scope for a single CPU runnable demo):
241+
multi-target verifier routing (Qwen / Gemma / DeepSeek), session-affinity
242+
scheduling, OTA, federated self-learning. Those are platform-level
243+
components from the design discussion that need separate plumbing.
244+
245+
## Where this is going — local inference engine
246+
247+
The next layer up is a Mac/Ubuntu local inference engine that wraps the
248+
algorithmic core in this repo with continuous batching, async
249+
proposer/verifier pipelining, NF4 KV quantization, and a fixed-slab
250+
KV pool sized for sink+window. Architecture and phased build plan are
251+
in [`docs/local-inference-engine.md`](docs/local-inference-engine.md).
252+
253+
Short version of why the engine **does not use PagedAttention**: the
254+
sink+window invariant turns each session's KV cache into a constant-size
255+
object, so all three problems PagedAttention solves (fragmentation,
256+
prefix sharing, non-contiguous KV) cease to apply. A 30-line fixed-slab
257+
pool replaces it and runs ~5–15% faster because attention kernels see
258+
contiguous memory.
259+
260+
## Architecture Decision Records
261+
262+
Design decisions that the rest of the codebase depends on are recorded
263+
in [`docs/adr/`](docs/adr/). New contributors and agents should read the
264+
ADR index before changing proposer / verifier / training code; the ADRs
265+
explain *why* a particular design was chosen and which alternatives were
266+
explicitly rejected.
267+
268+
- [ADR 0001 — Proposer sizing, alignment strategy, and verifier
269+
decoupling](docs/adr/0001-proposer-sizing-and-alignment.md): the
270+
load-bearing decision behind why we keep the proposer in a fixed
271+
0.25–1 B band, treat EAGLE-3 representation alignment as the canonical
272+
training recipe, and design verifier swaps to be data-and-fine-tune
273+
operations rather than re-architecture operations.
274+
- [ADR 0002 — Verifier selection, quantization, and the
275+
open-vs-closed-weight constraint](docs/adr/0002-verifier-selection-and-quantization.md):
276+
the v1/v2 ship sequence (Qwen3-1.7B bf16 → Qwen3-8B 4-bit), the 60 %
277+
memory rule for choosing bf16 vs 4-bit, and why closed-weight APIs
278+
(GPT/Claude/Gemini) cannot be aligned with EAGLE-3 and are out of
279+
scope for v1 / v2.

0 commit comments

Comments
 (0)