Detailed record of the per-session-binding (PR-A3c) work and its end-to-end tests, from the single-tenant pressure finding through batched parallel throughput, the gRPC served path, and the batched scheduler. Summarized in ADR 0014 §3.4–3.7; this report carries the full methodology, numbers, and evidence index.
| Component | Detail |
|---|---|
| GPU | NVIDIA H200 NVL (143 GB), vast.ai; torch 2.12.0+cu130 |
| Verifier | google/gemma-4-26B-A4B-it (bf16, eager attention) |
| Drafter | z-lab/gemma-4-26B-A4B-it-DFlash |
| f_θ | results/research/f_theta_v5_s5_sliding (S5, 5 exact full-attn layers) |
| Mac | Mac mini M4 (24 GB) via the git-bus Mac bridge (single-tenant pressure, §3.4 source) |
| Recall task | NIAH (needle-in-a-haystack); per-session distinct needle |
| Bottom line | recall must stay 1.0 — recall-sacrificing configs (pure sink+window) are out of scope |
The first capacity test measured the v0.3 served path (grpc_agent_capacity_loadtest.py):
256 concurrent agent connections admitted, but v0.3 is single-tenant — one
shared verifier, RPCs serialized on one asyncio loop, no per-session KV isolation.
So "256" = connections served, not parallel inferences; concurrent sessions
would corrupt each other's KV. PR-A3c (per-session binding) is the fix.
mlx_multitenant_pressure.py (Mac M4), per-agent KV at ctx2048, 21 GB budget:
| config | per-agent KV | budget hit at | derived max agents |
|---|---|---|---|
| MLX-native (gemma hybrid) | 256.9 MB | N=15 | ~22 |
| Kakeya S5 (recall-preserving) | 61.1 MB | N=32 | ~93 |
→ ~4.2× more concurrent agents at equal context, recall-preserving (the win vs native, which already bounds sliding layers to 1024; pure sink+window's 16.8× is excluded — it drops full-attn recall).
k3_cuda_multitenant_parallel_bench.py (H200), batched restored-S5 decode,
each row = a session with its own KV-cache row:
| sessions N | restored-S5 agg tok/s | parallel speedup | per-session recall |
|---|---|---|---|
| 1 | 27.4 | 1.00× | 1.0 |
| 2 | 54.6 | 1.99× | 1.0 |
| 4 | 111.6 | 4.07× | 1.0 |
| 8 | 220.4 | 8.04× | 1.0 |
→ near-linear parallel scaling; restored S5 ≈ native AR (220.4 vs 216.4 @ N=8).
One batch-1 fix was required: RoPE cos/sin batch-1 broadcast
(restored_attention.py). Evidence: k3_cuda_multitenant_parallel_gpu.json.
Implementation: CrossModelRestoredSinkWindowVerifier.spawn() (fresh per-session
adapter, shared weights) + PerSessionVerifierRegistry (session→adapter; also
the SessionStore cache-inspector + coordinator resolver) + coordinator
resolver + servicer on_session_close cleanup + start_grpc_runtime_server --multi-tenant. Back-compat: single-tenant unchanged (271 session+server unit
tests pass; test_verifier_registry.py proves interleaved-session isolation).
E2E (k3_grpc_multitenant_e2e.py, H200): launch the multi-tenant server, 4
concurrent SDK clients, each its own session + distinct needle:
| sessions | transport | per-session recall | isolation |
|---|---|---|---|
| 4 concurrent | real gRPC RuntimeService + Python SDK |
1.0 | ✓ |
Each recalled its own needle (MAPLE-7890/IOTA-8961/THETA-6866/IOTA-3281 —
two IOTA-* sessions got their own numbers) → per-session KV isolation through
the real served path. Evidence: k3_grpc_multitenant_e2e_gpu.json.
BatchedDecodeScheduler (inference_engine/session/batch_scheduler.py) stacks
the cohort's per-session restored caches along the batch dim and runs one
verifier forward per step (drops finished rows). k3_served_batched_scheduler_bench.py
(H200, 8 sessions):
| path | aggregate decode tok/s | per-session recall |
|---|---|---|
| serialized (each session alone) | 26.6 | 1.0 |
| batched scheduler | 224.9 | 1.0 |
| speedup | 8.45× | — |
→ the served path goes from correct-but-serialized to 8.45× aggregate
throughput at 8 sessions, recall preserved. Evidence:
k3_served_batched_scheduler_gpu.json.
PR-A3c delivers, recall-preserving (recall 1.0 throughout), the three multi-tenant properties together:
- Bounded memory — ~4.2× more concurrent agents per GB (§3.1).
- Parallel throughput (CUDA) — 8.04× engine-level (§3.2) / 8.45× through the served per-session adapters via the batched scheduler (§3.4). This is CUDA-only; see the platform note below.
- Correct isolation — true multi-tenant serving end-to-end through gRPC, per-session recall 1.0 (§3.3).
Platform scope — MLX (v0.4-mac) multi-tenant is serial-only. Per-session
binding (isolated KV, shared weights, recall 1.0) holds on both platforms, but
the batched/parallel cohort path is CUDA-only. On Apple-Silicon MLX,
batched B>1 decode collapses per-session recall to 0.125 (vs serialized 1.0)
because of an upstream MLX B>1, L=1 quantized-decode kernel bug that persists
on the latest published mlx 0.31.2 / mlx-lm 0.31.3 and is not Python-patchable
(ADR 0014 §3.4). Mac therefore serves multi-tenant sessions serially.
- Async continuous batching transport: wire the batched scheduler under the
async gRPC streaming
Generatehandlers via per-step futures + a background batch loop, so independent RPC coroutines feed one batch — and support dynamic mid-flight arrival + ragged-length cohorts (this report's scheduler is a fixed synchronized cohort, the dominant burst case). - Batched fused spec-decode (DFlash is batch-1 today).
- Mac served path: multi-tenant on MLX is serial-only by decision (ADR
0014 §3.4) — batched
B>1decode is blocked by an upstream MLX kernel bug, so Mac serves sessions one at a time (recall-preserving). CUDA is the batched parallel path today. Revisit only if a future mlx release / source build fixes theB>1, L=1quantized-decode kernel.
| Test | Script | Evidence JSON |
|---|---|---|
| Single-tenant capacity (Mac) | grpc_agent_capacity_loadtest.py |
k3_agent_capacity_mac.json, k3_agent_capacity_stress_mac.json |
| Memory A/B (Mac) | mlx_multitenant_pressure.py |
k3_multitenant_pressure_mac.json |
| Parallel throughput (H200) | k3_cuda_multitenant_parallel_bench.py |
k3_cuda_multitenant_parallel_gpu.json |
| Served e2e (H200) | k3_grpc_multitenant_e2e.py |
k3_grpc_multitenant_e2e_gpu.json |
| Batched scheduler (H200) | k3_served_batched_scheduler_bench.py |
k3_served_batched_scheduler_gpu.json |
(All JSON under results/research/.)