Skip to content

Latest commit

 

History

History
219 lines (181 loc) · 16 KB

File metadata and controls

219 lines (181 loc) · 16 KB

ADR 0015 — Kakeya Inference Engine: a product-grade vLLM replacement, Kakeya Attention native

  • Status: Accepted (north star + algorithm definition); engine = in design
  • Date: 2026-06-15 (rev. 2026-06-15)
  • Supersedes/extends: the "Kakeya Attention vs PagedAttention/RadixAttention" framing in README; ADR 0014 §3.4.

North star (highest goal — this governs everything below)

The Kakeya Inference Engine is a product-grade inference engine whose goal is to replace vLLM. It is not a research script, not a technique bolted onto HuggingFace transformers, and not "vLLM with a different cache". Its native, first-class attention algorithm is Kakeya Attention — sink+window bound + f_θ KV-projection + dLLM-proposer restoration, as one primitive. The whole engine (prefill, KV management, admission/scheduling, kernels) is designed around bounded-KV + on-demand restoration as the default invariant, exactly the way vLLM is designed around full-KV PagedAttention.

Everything else in this ADR serves that objective. Explicitly:

  • The product target is measured against vLLM by the engine, never by the research bench. The bench (k3_cuda_multitenant_parallel_bench.py, eager HF transformers) is only a correctness/feasibility probe and is not a thing we ship or benchmark as "Kakeya".
  • "Parity → win" with vLLM is an engine deliverable, not a roadmap of vLLM features to copy.

Why "borrow vLLM's pipeline" is the wrong plan (rejected)

vLLM's architecture is full-KV-centric: paged blocks store the whole history; chunked prefill and flash masking are optimizations for processing and storing the full KV. Porting Kakeya's bounded cache onto that pipeline inherits a full-KV-shaped engine and caps the advantage at whatever the cache layout saves — i.e. it makes Kakeya a vLLM feature, not a replacement.

A product Kakeya engine instead makes bounded-KV the native invariant:

  • the full history is never resident; evicted context is reconstructed by the proposer on demand (Kakeya Attention);
  • prefill produces the bounded resident set + the restoration path in one pass — it must never materialize the O(N·T²) attention mask or full-vocab logits that sink the research bench;
  • admission/scheduling sizes sessions by their peak window, not their total token count — this is the structural source of the concurrency win;
  • graph-captured decode, fused-MoE, efficient masking are table stakes any product engine needs, implemented in service of the Kakeya-native design — not as a port of vLLM's full-KV pipeline.

vLLM's three prefill techniques, through the Kakeya lens (adopt / wrap / drop)

vLLM gets its 62k concurrency from three prefill-engineering pieces. They are not copied wholesale; each is reinterpreted against the bounded-KV-native design — two are adopted (one reinterpreted, one wrapped), the third is structurally unnecessary:

vLLM technique what it solves for vLLM Kakeya engine stance
Chunked prefill process a long prompt in fixed token blocks so mask/activation memory is O(N·chunk·d), not O(N·T²) Adopt — it is our chunked restoration. Restoration is inherently incremental: consume the prompt in fixed blocks, emitting the bounded resident set + restoration path per block. Chunking is native to how restoration works, not a bolt-on.
FlashAttention (native causal + sliding-window kernel) compute attention without materializing a [.,.,T,T] mask/score tensor Wrap and use directly. Window is a kernel parameter; we call the flash kernel over the Kakeya window. It is a table-stakes kernel, not an architecture — no reason to reinvent it.
Paged KV store the whole growing KV in non-contiguous pages so a large, ever-growing cache fits and shares Not needed — structurally. Paging is a solution to the problem of storing a growing full KV. Kakeya is on-demand KV restoration: the resident KV is bounded (sink+window + exact layers) and the full history is never stored, so there is no growing full-KV to page. The problem paging solves does not exist in a Kakeya-native engine.

So the engineering Kakeya needs is **chunked restoration + a wrapped flash kernel

  • native bounded-KV management** — not PagedAttention. This is the concrete sense in which the engine replaces vLLM's design rather than extending it.

Kakeya Attention — the native algorithm

Kakeya Attention = sink+window bound + f_θ KV-projection + dLLM-proposer restoration, taken as one primitive. Peer of, and replacement for, the attention layer in current engines:

Algorithm Replaces Keeps full KV?
eager / FlashAttention the compute layer yes
vLLM PagedAttention / SGLang RadixAttention the storage layer yes
Kakeya Attention compute + storage no — bounded; evicted KV reconstructed on demand

FlashAttention makes attention compute cheaper; Paged/Radix make the same total KV cheaper to allocate/share. Kakeya Attention makes the total itself bounded.

Where the win is real (model architecture matters)

Resident KV is dominated by the full-attention layers (they hold full context in any engine). So the engine's advantage over vLLM scales with the model's full-attention fraction:

  • gemma-4-26B-A4B is 25/30 natively sliding → vLLM already bounds those layers, and gemma-4 keeps recall 1.0 at sliding_window=68 with no restoration at allno Kakeya moat on gemma-4 (probed; see below). It is the wrong showcase model.
  • Full-attention models (Qwen/Llama, no native sliding): shrinking the window without restoration destroys recall, so f_θ+proposer restoration is the only way to bound memory at full recall — and vLLM, having no restoration, must keep full KV and cannot match it. This is the engine's target regime.

Mac (MLX) interactive engine — full verifier/proposer/f_θ pipeline, f_θ default-ON

The Apple-Silicon interactive CLI (scripts/research/k3_integrated_niah_eval_mac.py --chat) runs the full Kakeya engine — gemma-4 verifier (MLX) + DFlash proposer (fused spec-decode) + f_θ K/V restoration + S5 bounded KV — not verifier-only. It reuses the validated fused_specdecode_generate_mlx_trim per-turn sequence; the NIAH evidence loop is untouched.

  • f_θ runs by default in chat. --force-f-theta is auto-enabled in --chat unless the fast all-MLX path (--all-mlx-drafter, f_θ bypassed) is explicitly chosen. It bypasses the S5 native-prefill short-circuit so f_θ executes each turn: it projects the proposer's hidden states → verifier K/V for the 25 sliding layers and injects them.
  • gemma-4 caveat (honest). On gemma-4 those restored sliding-layer K/V are recall-irrelevant — the 5 exact full-attention layers carry recall (the "S5 free lunch"), so f_θ's output is effectively discarded by the recall path. We still run f_θ by default so the full verifier/proposer/f_θ pipeline is exercised end-to-end; on full-attention models the same f_θ path is load-bearing (it is the only way to bound memory at full recall).
  • Forensic — when f_θ stopped running. f_θ was silently bypassed under --s5-exact-full-attn on 2026-06-12 by the "Optimize MLX adaptive S5 native smoke path" commits (b3a04d0 / 1f6e58c), which made build_restoration short-circuit to {} under S5; the same "adaptive S5 native" path also let the proposer go to blocks=0 while keeping the fused label — caught by the evidence gate (0a6fb19, "enforce PR #109 review constraints") which added --force-fused-specdecode. Both squashed into main via #117. f_θ remained S5-bypassed until --force-f-theta (this ADR's change) made it default-on in the interactive chat.
  • Measured (Mac M4, via the git-bus bridge). Both chat turns: f_theta_ran=TRUE restoring the 25 sliding layers + proposer blocks=2/4, mean_accept_len=4.0/3.5 + correct answers ("Paris"; "red, yellow, and blue") + natural <end_of_turn> stop + bounded resident KV (12–18 MB). Torch-bridge path is slow (~0.5–6 tok/s); the all-MLX path (proposer-only) is the fast option.

Feasibility probes so far (informed the design — NOT the product)

These ran on the eager-transformers research bench; they validate correctness and locate the engine's required invariants. They are not the product engine.

  • Restoration prefill, memory-efficient (SDPA + chunked logits + bf16 K/V) — unblocked long-context execution at recall 1.0 (16k N=1-only→N=4, 32k OOM→N=2, 62k OOM→N=1). Confirms the engine must avoid O(N·T²) masks / full-vocab logits.
  • gemma-4 bounded decode — recall 1.0 at sliding_window=68 natively (no restoration); native bounded decode still fits only N=2 @62k vs vLLM's 15.5 because the bench does non-chunked prefill. Confirms (a) gemma-4 is the wrong showcase, (b) the engine — not bench retrofits — is what must beat vLLM.

(docs/reports/kakeya-vs-vllm-longcontext-h200.md.)

Consequences

  • The repo's highest engineering goal is the product Kakeya Inference Engine replacing vLLM, with Kakeya Attention as its native algorithm; the eager research bench is a probe only and is never reported as "Kakeya performance".
  • The engine is designed bounded-KV-native (admission by peak window; restoration fused into prefill/decode), not as a port of vLLM's full-KV pipeline.
  • The vLLM-beating demonstration is to be run on a full-attention verifier, where restoration is load-bearing.

KIE-v2 direction (recommended): Kakeya Attention as a vLLM backend

Rather than rebuild vLLM's fused-MoE + graph runtime (KIE-v1.1.z2, blocked), run Kakeya Attention inside vLLM as an attention backend / KV-cache plugin. Feasibility (decode throughput ≈ vLLM): low-risk YES — the dominant fused-MoE + graph + scheduler stays vLLM's (inherited), and Kakeya owns only the attention op (minor decode fraction, over smaller/quantized KV, restoration-free at decode → graph-capturable). The work is vLLM-backend conformance + one graph-safe quantized-exact attention kernel (extends vLLM's fp8-KV attention), not a runtime rebuild. Memory differentiation is large only on full-attention models (gemma-4 is already hybrid-bounded by vLLM). Full assessment: docs/design/kakeya-vllm-backend-feasibility.md.

Milestone tracking (task encoding)

Engine work is coded KIE-v1.x (Kakeya Inference Engine), governed by this ADR §9 of docs/design/kakeya-inference-engine-architecture.md. One milestone = one PR, so development context stays per-task:

Code Milestone Status PR
KIE-v1 engine core: chunked restoration prefill + bounded-KV decode + peak-window admission (NativeHybridBounded) done (core); concurrency gated on v1.1 #135
KIE-v1.1 realize the bounded-KV bound at runtime: sliding-window-evicting cache without the CUDA-graph segfault (evicting cache, graph capture off) + push concurrency toward the ceiling done — gemma-4 62k concurrency N=4→N=24 (recall 1.0; chunk-size tuning), 1.55× vLLM's 15.5. Decoupled prefill/decode implemented (correct) but fragmentation-limited. #136
KIE-v1.1.x exact-layer KV quantization toward the N=34+ ceiling partial — int8/int4 exact-layer quant de-risked recall-safe (recall 1.0 @62k); genuine int8 storage implemented + correct (halves stored bytes). BUT N=34 still OOMs: the dequant-on-read returns full bf16 per exact layer (transients coexist), so peak doesn't drop. v04.kv_compressor doesn't help (round-trips, no RAM cut); QuantizedCache not hybrid-aware. #137
KIE-v1.1.y quantized attention (tiled online-softmax over the int8 exact-layer store — no full-bf16 transient) done (concurrency) — gemma-4 62k N=24→N=60 at recall 1.0 (111.7 GB), ≈3.9× vLLM's 15.5; numerically exact vs SDPA. Decode-speed weak: ~25.6 tok/s aggregate @N=8 (~3.2/session) vs vLLM ~98/session — eager tiled loop. #138
KIE-v1.1.z throughput + N=75 N=75 MET (recall 1.0, 126.7 GB, ≈4.8× vLLM; int8+quant-attn). kakeyalattice v1.6.1 gemma-4 fix verified (recall 1.0). decode ≥ vLLM NOT met: ~31 tok/s aggregate (eager 26B-MoE forward dominates; torch.compile-attn 6.6× but 0% e2e; static-cache auto-compile segfaults). #139
KIE-v1.1.z2 fused-MoE + graph-captured full-forward decode (rebuild vLLM's runtime) abandoned — superseded by KIE-v2 (run on vLLM instead of rebuilding it)
KIE-v2 Kakeya Attention on the vLLM runtime (bounded window + restoration + quantized-exact as a vLLM backend) decode-parity MET + EXCEEDED: gemma-4 bounded-window (sw=68) on vLLM = 195.6/231.9/539/1079 tok/s vs vLLM-default 159.3/198.6/467.5/894.9 @N=1/4/8/70, ~1.15–1.23× faster, recall 1.0 (ctx16k). Full backend (restoration + quantized-exact for full-attn models) = next. #140
v0.5-cuda release: package KIE-v2 as inference_engine.engine.KakeyaVLLM (Kakeya window on vLLM's Apache-2.0 fused-MoE + CUDA-graph + scheduler), consolidate reports, README done (gemma-4 instantiation) — engine evidence = committed gemma-4 KIE-v2 numbers (recall 1.0 @ sw=68 via the S5 free lunch, no trained f_θ/proposer needed on gemma-4). KakeyaVLLM wrapper plumbing smoke-tested on H200 (CUDA graphs captured, window→vLLM config, generate returns) on Qwen3-4B — smoke test only, NOT engine validation (Qwen3 has no trained f_θ/proposer → restoration never ran). Model-aware text_config nesting (gemma nested / Qwen flat); 13 config unit tests. Scorecard: docs/reports/kakeya-inference-engine-v0.5-cuda.md. this PR

KIE-v1.1.z2 port attempt (2026-06-16) — all clean paths blocked:

  • HF kernels hub (lowest-surgery kernelize): kernels is version-incompatible with transformers 5.12 (hub_kernels.LayerRepository → "Either a revision or a version must be specified"); installing it breaks transformers' import. Needs a pinned kernels/transformers pair.
  • vLLM fused_moe: present but in a separate venv + expects vLLM's model layout (stacked expert weights, routing glue) — major cross-venv surgery.
  • From-scratch Triton fused-MoE + graph capture + custom quant-attention = multi-week expert kernel project.
  • Segfault (static+chunked+long) is structural, not patchable: gemma-4's forward has non-graph-capturable ops (chunked-prefill windowed copy_ eviction; data-dependent MoE routing). Decode-only capture of the eager MoE bakes in routing → wrong output, so no capture shortcut. The fix == the graph-safe fused MoE (same blocked port). Verdict: decode-throughput parity with vLLM is a standalone vLLM-class runtime project, not reachable in-session. The memory/concurrency axis is done (N=75, recall 1.0, ≈4.8× vLLM); the decode-speed axis is gated on a fused-MoE+graph runtime whose clean ports are currently blocked.

kakeyalattice v1.6 note. v1.6 genuinely fixes the compressor (bit-packed KakeyaLatticePackedCache, real 2.46× HBM at D4 Q=38, contiguous SDPA-feedable decode, O(N)); kv_compressor.make_packed_kv_cache exposes it. But it (a) trips a uniform-head-dim assertion on Gemma-4's hybrid layers, (b) gains only ~2.46× vs the int8 path's 1.94× at D=256, and (c) still returns bf16 to SDPA — so the per-layer dequant transient (the N>34 blocker) is unchanged. v1.6 improves the storage floor, not the decode peak; the decisive lever is still KIE-v1.1.y quantized attention, which would consume v1.6's packed codes. | KIE-v1.2 | FThetaRestored policy on a full-attention verifier (Qwen/Llama) — the decisive vLLM win | planned | — |

Evidence

  • docs/reports/kakeya-vs-vllm-multitenant-h200.md (ctx-1238, same H200)
  • docs/reports/kakeya-vs-vllm-longcontext-h200.md (16k–62k probes + KV model)
  • results/research/{k3_cuda_multitenant_parallel,vllm_multitenant_parallel,gemma_bounded_decode}_h200nvl_*.{json,log}