Kakeya Inference Engine v0.5 for CUDA — release scorecard

What v0.5-cuda ships: Kakeya Attention's bounded-window (S5) KV management running on the vLLM runtime — so the three runtime components the engine needs are inherited unchanged (all Apache-2.0):

component	owner in v0.5-cuda	role
Fused MoE Triton kernel	vLLM (Apache-2.0)	grouped-GEMM expert kernel — the dominant ~90 % of the gemma-4-26B-A4B decode forward
CUDA graphs	vLLM (Apache-2.0)	fixed-shape decode capture (`enforce_eager=False`) — removes per-token launch overhead
Continuous-batching scheduler	vLLM (Apache-2.0)	request scheduler + paged KV-manager — drives multi-tenant throughput
Kakeya Attention (bounded window / KV)	Kakeya	bounds the resident sliding-layer KV to `sink + window` (S5 = 68)

Entrypoint: inference_engine.engine.KakeyaVLLM (see Usage below). This is the KIE-v2 strategy from ADR 0015 and the feasibility note: rather than rebuild vLLM's fused-MoE + graphs + scheduler (a multi-component kernel project that was shown blocked in KIE-v1.1.z), Kakeya Attention runs on vLLM and contributes the bounded-KV attention layer.

Honest scope. v0.5-cuda is the gemma-4 bounded-window instantiation: gemma-4's hybrid (5 full + 25 sliding) needs no per-token restoration (the S5 "free lunch"), so the bounded window is delivered via vLLM hf_overrides. The restoration backend (f_θ + dLLM-proposer at prefill) for full-attention models (Qwen/Llama) — the large ~6× memory differentiator — is the v0.6 roadmap item, not in this release.

1. Token throughput — decode tok/s ≥ vLLM (H200, gemma-4-26B-A4B, recall 1.0)

Kakeya bounded window (sliding_window=68) on vLLM vs vLLM default (sliding_window=1024), same H200, ctx 16k, gen 128, recall 1.0 throughout:

N (concurrent)	Kakeya-on-vLLM decode tok/s	vLLM default	ratio
1	195.6	159.3	1.23×
4	231.9	198.6	1.17×
8	539.0	467.5	1.15×
70	1079.0	894.9	1.21×

Decode throughput exceeds vLLM by ~1.15–1.23× at recall 1.0 — it inherits vLLM's fused-MoE + CUDA-graph + scheduler, and Kakeya's tighter sliding window makes the sliding-layer attention cheaper than vLLM's default. This is the axis the eager research engine (KIE-v1.1.y/z, ~25–31 tok/s aggregate) could not reach; on vLLM it is 195–1079 tok/s.

Long context (ctx 62k, N=70) — the edge shrinks on gemma-4:

metric (62k, N=70)	Kakeya-on-vLLM sw=68	vLLM default sw=1024	ratio
decode tok/s	20.38	19.03	1.07×
vLLM max concurrency (66k req)	16.15×	15.51×	1.04×

The edge falls from ~1.2× (16k) to ~1.07× (62k) because gemma-4's 5 full-attention layers hold full-ctx KV in both configs and dominate the footprint — shrinking the sliding window is only a ~4–7 % saving. This is the gemma-4 caveat (see §3); the large win needs a full-attention model.

2. Parallel inference (concurrency)

On vLLM (v0.5-cuda path): the bounded window is measured to N=70 concurrent sessions at ctx 16k, recall 1.0, while increasing decode tok/s vs vLLM default (§1) — vLLM's continuous-batching scheduler + the smaller resident window scale cleanly.
Research engine ceiling (KIE-v1.1.y/z, eager, demonstrator): the bounded-KV
- int8 + quantized-attention path reached N=75 @ 62k, recall 1.0 (≈4.8× vLLM's 15.5 concurrency ceiling) — proving the memory/concurrency axis even before moving to the vLLM runtime. Decode speed on that eager path was the weak axis (~31 tok/s aggregate), which is exactly what running on vLLM fixes.

So v0.5-cuda gets both axes: vLLM's decode speed and the bounded-window concurrency, with recall 1.0.

3. Memory-saving efficiency (honest, model-dependent)

The bounded-KV win scales with the model's full-attention fraction:

model class	full-attn layers	Kakeya resident-KV edge vs vLLM	shipped in
gemma-4-26B-A4B (hybrid 5 full / 25 sliding)	5 / 30	~7 % @ 62k (vLLM already hybrid-bounds 25/30; the 5 full layers dominate both)	v0.5-cuda
full-attention (Qwen/Llama, all layers full)	all	~6× (vLLM keeps all full; Kakeya keeps exact + window + restoration)	v0.6 (restoration backend)

The bounded-KV cost model (inference_engine.engine.admission, 9 unit tests): per-session resident 2.56 GB @ 62k vs full-KV 15.2 GB — a ~6× edge — but that edge is only realized end-to-end on a full-attention model. On gemma-4 the native hybrid means vLLM already bounds the sliding layers, so the measured long-context saving over vLLM is the honest ~7 %.
Takeaway: v0.5-cuda's memory win on gemma-4 is modest-but-real; the engine's large memory differentiation is a full-attention-model property delivered by the v0.6 restoration backend. We report this rather than overclaim gemma-4.

4. Usage

from inference_engine.engine import KakeyaVLLM
from vllm import SamplingParams

# Kakeya Attention (S5 bounded window) on vLLM's fused-MoE + CUDA-graph + scheduler.
engine = KakeyaVLLM(
    "google/gemma-4-26b-a4b-it",
    sink=4, window=64,        # Kakeya S5 window (total resident = 68)
    max_model_len=16384,
)
out = engine.generate(prompts, SamplingParams(temperature=0.0, max_tokens=128))

KakeyaVLLM builds a vllm.LLM with hf_overrides={"sliding_window": 68, "text_config": {"sliding_window": 68}} and enforce_eager=False (CUDA graphs + fused-MoE on). The pure config layer (kakeya_hf_overrides, KakeyaVLLMConfig) is torch/vllm-free and unit-tested (tests/inference_engine/engine/test_kakeya_vllm.py).

5. Verification status — what is and is not validated

The engine claim (gemma-4) — validated, and it does NOT depend on a trained f_θ/proposer.

✅ Throughput / concurrency / recall numbers (gemma-4-26B) above were measured on H200 (Vast.ai) and committed in the KIE-v2 integration (scripts/research/vllm_multitenant_parallel_bench.py --sliding-window 68; commits 7ec3a03, 48ded1e, e2cf137). Recall 1.0 at sliding_window=68 holds on gemma-4 because its 5/30 native full-attention layers carry recall with no restoration (the "S5 free lunch") — so this instantiation genuinely needs no trained f_θ/proposer. That is the entire reason v0.5-cuda is a gemma-4 release.

The Qwen3-4B run was a wrapper PLUMBING smoke test — NOT engine/algorithm validation. Read this honestly:

✅ It proves only the KakeyaVLLM Python wrapper plumbing: it constructs a vLLM engine, the override value lands in vLLM's config (hf_config.sliding_window == 68), CUDA graphs capture, and generate() returns coherent text (Paris / 2+2=4 / story) at 777 tok/s. It also caught a real bug (text_config injection crashing text-only models).
❌ It does NOT validate Kakeya Attention on Qwen3. Qwen3-4B has no trained f_θ and no trained proposer, so restoration never ran — a bounded window without restoration is naive truncation, not Kakeya Attention. The test prompts were also < 68 tokens (inside the window), so nothing was even evicted; recall/memory were not exercised at all. On a full-attention model, window=68 without trained restoration would destroy recall — which is exactly why the full-attention path is v0.6 (after f_θ/proposer training), not v0.5.
Evidence (clearly labelled as a wrapper smoke test): kakeya_vllm_v05_h200_validation.log.

Bug fix: the wrapper auto-detects text_config nesting (multimodal gemma-4 → nested; text-only Qwen/Llama → flat). Config layer unit-tested (13 tests).

6. What's next (v0.6)

The restoration backend for full-attention models (Qwen/Llama): inject f_θ + dLLM-proposer restoration at vLLM prefill and a graph-capturable quantized-exact attention kernel, to realize the ~6× resident-KV edge end-to-end on vLLM. That is the genuine custom-backend work (ADR 0015 §KIE-v2 caveats); v0.5-cuda proves the throughput axis and ships the gemma-4 bounded-window engine.

Evidence

Throughput tables: docs/design/kakeya-vllm-backend-feasibility.md §5b
Concurrency / memory journey: docs/reports/kakeya-engine-vs-vllm-h200.md
Architecture / milestones: docs/adr/0015-kakeya-attention-and-engine-substrate.md
Entrypoint: inference_engine/engine/kakeya_vllm.py; tests: tests/inference_engine/engine/test_kakeya_vllm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kakeya Inference Engine v0.5 for CUDA — release scorecard

1. Token throughput — decode tok/s ≥ vLLM (H200, gemma-4-26B-A4B, recall 1.0)

2. Parallel inference (concurrency)

3. Memory-saving efficiency (honest, model-dependent)

4. Usage

5. Verification status — what is and is not validated

6. What's next (v0.6)

Evidence

FilesExpand file tree

kakeya-inference-engine-v0.5-cuda.md

Latest commit

History

kakeya-inference-engine-v0.5-cuda.md

File metadata and controls

Kakeya Inference Engine v0.5 for CUDA — release scorecard

1. Token throughput — decode tok/s ≥ vLLM (H200, gemma-4-26B-A4B, recall 1.0)

2. Parallel inference (concurrency)

3. Memory-saving efficiency (honest, model-dependent)

4. Usage

5. Verification status — what is and is not validated

6. What's next (v0.6)

Evidence