ovg-project · qinganrice · May 13, 2026 · May 13, 2026
diff --git a/benchmarks/bench_layout/H100_results.md b/benchmarks/bench_layout/H100_results.md
@@ -0,0 +1,115 @@
+# H100 Verification: `KVCACHED_CONTIGUOUS_LAYOUT` Overhead
+
+**Platform:** NVIDIA H100 80GB HBM3, x86_64  
+**Model:** `Qwen/Qwen3-0.6B` (28 layers, 8 KV heads, head_dim 128, bf16)  
+**vLLM:** 0.19.0  
+**Setup:** `vllm serve --gpu-memory-utilization 0.5 --max-model-len 2048`  
+**Bench:** `vllm bench serve` random 512in/128out, 500 prompts, 3 seeds (seeds 42/99/7), means reported
+
+---
+
+## 1. E2E Sweep Results
+
+### rate=inf (throughput-bound)
+
+| Config | Env | Throughput (req/s) | TTFT (ms) | TPOT (ms) |
+|---|---|--:|--:|--:|
+| A — vanilla | — | 94.91 | 2905 | 5.45 |
+| B — kvcached default | `LAYOUT=true` | 95.24 (+0.3%) | 2829 | 6.43 |
+| C — layout=false | `LAYOUT=false` | 75.96 **(-20%)** | 2846 | 17.09 |
+| D — reserved200 | `LAYOUT=true` + reserved | 92.75 (-2%) | 2938 | 5.88 |
+| E — both knobs | `LAYOUT=false` + reserved | 95.01 (+0.1%) | 2868 | 6.16 |
+
+### rate=16 (latency-bound)
+
+| Config | Env | Throughput (req/s) | TTFT (ms) | TPOT (ms) |
+|---|---|--:|--:|--:|
+| F — vanilla | — | 15.87 | 38.9 | 2.39 |
+| G — kvcached default | `LAYOUT=true` | 15.87 (0%) | 40.5 | 2.45 |
+| H — kvcached best | `LAYOUT=false` + reserved | 15.87 (0%) | 39.6 | 2.42 |
+
+**Key observations:**
+- `LAYOUT=true` (kvcached default) matches vanilla on H100 with no tuning needed.
+- `LAYOUT=false` *without* reserved pages causes a **~20% throughput drop and 3× TPOT regression**.
+- `LAYOUT=false` *with* reserved pages (`MIN=50, MAX=200`) fully recovers performance.
+- At rate=16, all configs are functionally identical — the server is not the bottleneck.
+
+---
+
+## 2. Comparison with README (GB10/aarch64)
+
+The README was benchmarked on a GB10 Grace Hopper (aarch64, unified CPU-GPU memory). Results are **inverted** on H100:
+
+| Layout | GB10 throughput | H100 throughput |
+|---|---|---|
+| `LAYOUT=true` (default) | 9.87 req/s **(-31%)** vs vanilla | 95.24 req/s **(+0.3%)** vs vanilla |
+| `LAYOUT=false` | 14.17 req/s **(−1%)** vs vanilla | 75.96 req/s **(-20%)** vs vanilla |
+
+The recommendation in the README to flip the default to `LAYOUT=false` is **GB10-specific and does not apply to H100**.
+
+---
+
+## 3. Root Cause: CPU-side VMM Driver Overhead
+
+### Hypothesis
+
+`LAYOUT=false` requires one `cuMemMap`/`cuMemSetAccess`/`cuMemUnmap` call **per layer per K/V buffer** per page:
+
+- `LAYOUT=true`: **1** `cuMemMap` call per page (compound page covers all layers)
+- `LAYOUT=false`: **num_layers × 2 = 28 × 2 = 56** calls per page (per-layer K+V mapping)
+
+Without reserved pages, these 56 calls happen **synchronously on the decode hot path**, stalling the scheduler between every decode step that needs a new KV block.
+
+With reserved pages, a background thread pre-maps pages during GPU idle time. The decode hot path only pops from a pre-filled pool — **zero driver calls in the critical path**.
+
+### nsys Verification
+
+`nsys profile` run: 100 prompts at rate=inf, 30-prompt warmup, `--capture-range=cudaProfilerApi`.
+
+#### GPU kernel time (total across capture window)
+
+| | layout_false | layout_true | Δ |
+|---|--:|--:|--:|
+| `nvjet_tst_*` (attention) | ~80 ms | ~80 ms | ≈ 0 |
+| Total kernel time | 258 ms | 284 ms | +10% |
+
+`flash_fwd_splitkv_kernel` (the bottleneck on GB10) does not appear — this vLLM version uses FlashInfer's JIT-compiled `nvjet_tst_*` kernels on H100. Attention kernel time is **essentially identical** between layouts.
+
+#### CPU-side CUDA driver calls (CUPTI RUNTIME)
+
+| API | layout_false | layout_true | Ratio |
+|---|--:|--:|--:|
+| `cuMemSetAccess` | **359 ms**, 5992 calls | 8 ms, 100 calls | 60× |
+| `cuMemUnmap` | **249 ms**, 5992 calls | 6 ms, 100 calls | 60× |
+| `cuMemCreate` | **94 ms**, 3304 calls | 3 ms, 55 calls | 60× |
+| `cuMemRelease` | **63 ms**, 2688 calls | 3 ms, 45 calls | 60× |
+| `cuMemMap` | 20 ms, 5992 calls | 0.5 ms, 100 calls | 60× |
+| **VMM total** | **~785 ms** | **~20 ms** | **39×** |
+
+Call count ratio (~60×) matches the theoretical prediction of `num_layers × 2 = 56×` from `allocator.cpp:182-198`.
+
+`cuMemSetAccess` is the dominant cost (359 ms), more expensive than `cuMemMap` itself (20 ms).
+
+### Why the TLB issue (GB10) does not dominate on H100
+
+On GB10, `LAYOUT=true` causes FlashAttention to read KV blocks with a stride of `num_layers × block_size = 1.75 MB`, which matches the 2 MB VMM page size. Every block read is a TLB miss. On GB10's unified memory architecture, TLB misses require coordinating CPU+GPU page tables, making them very costly.
+
+On H100 (discrete HBM3, 3.35 TB/s), the same stride pattern exists but:
+- The GPU's 50 MB L2 cache absorbs many misses
+- TLB miss penalty is lower with discrete GPU memory
+- The `nvjet_tst_*` kernels (FlashInfer) may handle paged access differently than `flash_fwd_splitkv_kernel`
+
+As a result, the attention kernel times are nearly identical across layouts on H100. The bottleneck shifts entirely to the CPU-side VMM calls.
+
+---
+
+## 4. Summary
+
+| Finding | GB10 (README) | H100 (this run) |
+|---|---|---|
+| Bottleneck | GPU: FlashAttention TLB miss (`flash_fwd_splitkv_kernel` +56%) | CPU: `cuMemSetAccess` / `cuMemUnmap` (60× more calls with `LAYOUT=false`) |
+| Bad layout | `LAYOUT=true` (default) | `LAYOUT=false` without reserved pages |
+| Fix | `LAYOUT=false` alone | `LAYOUT=true` (default) OR `LAYOUT=false` + reserved pages |
+| Attention kernel | `flash_fwd_splitkv_kernel` | FlashInfer `nvjet_tst_*` |
+
+**Recommendation for H100:** Keep the kvcached default (`LAYOUT=true`). It matches vanilla vLLM with no extra configuration. If `LAYOUT=false` is needed (e.g., debugging, non-hybrid models with very deep layer counts), pair it with `KVCACHED_MIN_RESERVED_PAGES=50 KVCACHED_MAX_RESERVED_PAGES=200` to eliminate the allocation hot-path stall.
diff --git a/benchmarks/bench_layout/nsys_runs/layout_false.nsys-rep b/benchmarks/bench_layout/nsys_runs/layout_false.nsys-rep
diff --git a/benchmarks/bench_layout/nsys_runs/layout_false.sqlite b/benchmarks/bench_layout/nsys_runs/layout_false.sqlite
diff --git a/benchmarks/bench_layout/nsys_runs/layout_true.nsys-rep b/benchmarks/bench_layout/nsys_runs/layout_true.nsys-rep
diff --git a/benchmarks/bench_layout/nsys_runs/layout_true.sqlite b/benchmarks/bench_layout/nsys_runs/layout_true.sqlite
diff --git a/benchmarks/bench_layout/run_kvcached_configs.sh b/benchmarks/bench_layout/run_kvcached_configs.sh
@@ -8,7 +8,7 @@ RESULTS_DIR="$SCRIPT_DIR/sweep_results"
 LOG_DIR="$SCRIPT_DIR/sweep_logs"
 mkdir -p "$RESULTS_DIR" "$LOG_DIR"
 
-VLLM="/home/xingqi/miniforge3/envs/kvcached/bin/vllm"
+VLLM="/home/qa4/kvcached/engine_integration/vllm-pip-venv/bin/vllm"
 MODEL="Qwen/Qwen3-0.6B"
 PORT=12347
 SEEDS=(42 99 7)

diff --git a/benchmarks/bench_layout/run_nsys_layout.sh b/benchmarks/bench_layout/run_nsys_layout.sh
@@ -13,7 +13,10 @@ OUT_DIR="$SCRIPT_DIR/nsys_runs"
 LOG_DIR="$SCRIPT_DIR/nsys_logs"
 mkdir -p "$OUT_DIR" "$LOG_DIR"
 
-VLLM="/home/xingqi/miniforge3/envs/kvcached/bin/vllm"
+export CC=/usr/bin/gcc
+export CUDA_VISIBLE_DEVICES=1
+
+VLLM="/home/qa4/kvcached/engine_integration/vllm-pip-venv/bin/vllm"
 MODEL="Qwen/Qwen3-0.6B"
 PORT=12348
 NUM_PROMPTS=${NUM_PROMPTS:-100}

diff --git a/benchmarks/bench_layout/run_sweep.sh b/benchmarks/bench_layout/run_sweep.sh
@@ -9,14 +9,19 @@ RESULTS_DIR="$SCRIPT_DIR/sweep_results"
 LOG_DIR="$SCRIPT_DIR/sweep_logs"
 mkdir -p "$RESULTS_DIR" "$LOG_DIR"
 
-VENV_PY="/home/xingqi/miniforge3/envs/kvcached/bin/python"
-VLLM="/home/xingqi/miniforge3/envs/kvcached/bin/vllm"
+VENV_PY="/home/qa4/kvcached/engine_integration/vllm-pip-venv/bin/python"
+VLLM="/home/qa4/kvcached/engine_integration/vllm-pip-venv/bin/vllm"
 
 MODEL="Qwen/Qwen3-0.6B"
 PORT=12347
 GPU_MEM_UTIL=0.5
 MAX_MODEL_LEN=2048
 
+# Use system GCC so Triton can find Ubuntu's multiarch Python headers
+# (conda's cross-compiler doesn't know about /usr/include/x86_64-linux-gnu/)
+export CC=/usr/bin/gcc
+export CUDA_VISIBLE_DEVICES=1
+
 WARMUP_PROMPTS=100
 NUM_PROMPTS=500
 INPUT_LEN=512

diff --git a/benchmarks/bench_layout/sweep_results/A_vanilla_inf.seed42.json b/benchmarks/bench_layout/sweep_results/A_vanilla_inf.seed42.json
@@ -0,0 +1 @@
+{"date": "20260513-213714", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.399886045997846, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 92.59454657762225, "request_goodput": null, "output_throughput": 11852.101961935648, "total_token_throughput": 59260.50980967824, "max_output_tokens_per_s": 16751.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2996.340644909913, "median_ttft_ms": 2869.831036507094, "std_ttft_ms": 1159.4458332256997, "p99_ttft_ms": 4897.417467786436, "mean_tpot_ms": 5.578512352973307, "median_tpot_ms": 5.712947661426527, "std_tpot_ms": 0.8975151705031924, "p99_tpot_ms": 6.867619391777562, "mean_itl_ms": 6.023084774721995, "median_itl_ms": 4.514993997872807, "std_itl_ms": 5.037837034084684, "p99_itl_ms": 26.145541326841336}
diff --git a/benchmarks/bench_layout/sweep_results/A_vanilla_inf.seed7.json b/benchmarks/bench_layout/sweep_results/A_vanilla_inf.seed7.json
@@ -0,0 +1 @@
+{"date": "20260513-213742", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.254196925990982, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 95.1620213408153, "request_goodput": null, "output_throughput": 12180.738731624358, "total_token_throughput": 60903.693658121796, "max_output_tokens_per_s": 15611.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2878.0705835438857, "median_ttft_ms": 2968.2554804967367, "std_ttft_ms": 1150.4043733280816, "p99_ttft_ms": 4746.974478342308, "mean_tpot_ms": 5.063209691136254, "median_tpot_ms": 5.068087996067076, "std_tpot_ms": 0.6329536898734069, "p99_tpot_ms": 6.0650228592349436, "mean_itl_ms": 5.416066304516089, "median_itl_ms": 4.354908000095747, "std_itl_ms": 4.325565254334482, "p99_itl_ms": 24.000420820084393}
diff --git a/benchmarks/bench_layout/sweep_results/A_vanilla_inf.seed99.json b/benchmarks/bench_layout/sweep_results/A_vanilla_inf.seed99.json
@@ -0,0 +1 @@
+{"date": "20260513-213728", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.155288025998743, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 96.98779146353013, "request_goodput": null, "output_throughput": 12414.437307331857, "total_token_throughput": 62072.18653665928, "max_output_tokens_per_s": 16656.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2841.2560345983366, "median_ttft_ms": 2833.5979160110583, "std_ttft_ms": 1092.568714262548, "p99_ttft_ms": 4631.834091359051, "mean_tpot_ms": 5.719756706802707, "median_tpot_ms": 5.850762444847139, "std_tpot_ms": 0.9376681327161561, "p99_tpot_ms": 7.074551629807752, "mean_itl_ms": 6.094447534263853, "median_itl_ms": 4.757997499837074, "std_itl_ms": 4.7151632867285205, "p99_itl_ms": 25.039508403278894}
diff --git a/benchmarks/bench_layout/sweep_results/B_kvcached_default_inf.seed42.json b/benchmarks/bench_layout/sweep_results/B_kvcached_default_inf.seed42.json
@@ -0,0 +1 @@
+{"date": "20260513-213837", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.30040568500408, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 94.33240203001841, "request_goodput": null, "output_throughput": 12074.547459842357, "total_token_throughput": 60372.73729921179, "max_output_tokens_per_s": 17206.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2842.396057820093, "median_ttft_ms": 2752.043180495093, "std_ttft_ms": 1144.022963796894, "p99_ttft_ms": 4737.292136487376, "mean_tpot_ms": 7.07648069343569, "median_tpot_ms": 7.293891228366951, "std_tpot_ms": 1.2871172035492182, "p99_tpot_ms": 9.034836988755883, "mean_itl_ms": 7.422107599210637, "median_itl_ms": 5.333951994543895, "std_itl_ms": 6.194047978630135, "p99_itl_ms": 33.10964161675656}
diff --git a/benchmarks/bench_layout/sweep_results/B_kvcached_default_inf.seed7.json b/benchmarks/bench_layout/sweep_results/B_kvcached_default_inf.seed7.json
@@ -0,0 +1 @@
+{"date": "20260513-213906", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.298872211002163, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 94.35970147795587, "request_goodput": null, "output_throughput": 12078.04178917835, "total_token_throughput": 60390.20894589175, "max_output_tokens_per_s": 15941.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2860.836808909982, "median_ttft_ms": 3021.1880290007684, "std_ttft_ms": 1141.4287169330617, "p99_ttft_ms": 4757.071653173334, "mean_tpot_ms": 5.922177347638755, "median_tpot_ms": 6.192431082663963, "std_tpot_ms": 0.753618832430732, "p99_tpot_ms": 6.750312735570418, "mean_itl_ms": 6.192198435033301, "median_itl_ms": 4.609468989656307, "std_itl_ms": 4.8812906238327765, "p99_itl_ms": 25.83109380502717}
diff --git a/benchmarks/bench_layout/sweep_results/B_kvcached_default_inf.seed99.json b/benchmarks/bench_layout/sweep_results/B_kvcached_default_inf.seed99.json
@@ -0,0 +1 @@
+{"date": "20260513-213851", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.153561166007421, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 97.02029022144336, "request_goodput": null, "output_throughput": 12418.59714834475, "total_token_throughput": 62092.98574172374, "max_output_tokens_per_s": 17057.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2782.472873407969, "median_ttft_ms": 2758.188852996682, "std_ttft_ms": 1082.3592332392213, "p99_ttft_ms": 4603.358503317868, "mean_tpot_ms": 6.301567228615933, "median_tpot_ms": 6.694210511799553, "std_tpot_ms": 0.983334299303326, "p99_tpot_ms": 7.3450646902181065, "mean_itl_ms": 6.468946915936141, "median_itl_ms": 4.969607005477883, "std_itl_ms": 4.854803793002526, "p99_itl_ms": 26.027333125239238}
diff --git a/benchmarks/bench_layout/sweep_results/C_layout_false_inf.seed42.json b/benchmarks/bench_layout/sweep_results/C_layout_false_inf.seed42.json
@@ -0,0 +1 @@
+{"date": "20260513-214004", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 6.265662325997255, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 79.80002336311972, "request_goodput": null, "output_throughput": 10214.402990479324, "total_token_throughput": 51072.014952396625, "max_output_tokens_per_s": 18117.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 3051.5959583601216, "median_ttft_ms": 2792.0081980046234, "std_ttft_ms": 1261.3209748958272, "p99_ttft_ms": 5187.654237216775, "mean_tpot_ms": 12.286887885320208, "median_tpot_ms": 13.33611246855429, "std_tpot_ms": 2.8784181595981138, "p99_tpot_ms": 15.39559150202769, "mean_itl_ms": 12.688114986551021, "median_itl_ms": 8.384647997445427, "std_itl_ms": 13.37050487530859, "p99_itl_ms": 76.31652099371422}
diff --git a/benchmarks/bench_layout/sweep_results/C_layout_false_inf.seed7.json b/benchmarks/bench_layout/sweep_results/C_layout_false_inf.seed7.json
@@ -0,0 +1 @@
+{"date": "20260513-214036", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 7.165766001009615, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 69.77621093537702, "request_goodput": null, "output_throughput": 8931.354999728259, "total_token_throughput": 44656.7749986413, "max_output_tokens_per_s": 17090.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2683.4072147000697, "median_ttft_ms": 2452.707313001156, "std_ttft_ms": 988.2504514458722, "p99_ttft_ms": 4345.243805501523, "mean_tpot_ms": 23.354770493476632, "median_tpot_ms": 24.9312699645919, "std_tpot_ms": 3.4908225178951673, "p99_tpot_ms": 26.9516463364705, "mean_itl_ms": 23.69773722538854, "median_itl_ms": 17.386089006322436, "std_itl_ms": 23.294918898868556, "p99_itl_ms": 119.62348720699082}
diff --git a/benchmarks/bench_layout/sweep_results/C_layout_false_inf.seed99.json b/benchmarks/bench_layout/sweep_results/C_layout_false_inf.seed99.json
@@ -0,0 +1 @@
+{"date": "20260513-214020", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 6.385320481000235, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 78.30460530333114, "request_goodput": null, "output_throughput": 10022.989478826386, "total_token_throughput": 50114.94739413193, "max_output_tokens_per_s": 17570.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2801.6103183477535, "median_ttft_ms": 2993.8538520000293, "std_ttft_ms": 1094.639611932442, "p99_ttft_ms": 4419.115242403059, "mean_tpot_ms": 15.615138726504647, "median_tpot_ms": 16.600237102394438, "std_tpot_ms": 2.6588451240438236, "p99_tpot_ms": 19.02464119452336, "mean_itl_ms": 15.890408391937969, "median_itl_ms": 10.570138001639862, "std_itl_ms": 19.207584510871666, "p99_itl_ms": 104.85940178536113}
diff --git a/benchmarks/bench_layout/sweep_results/D_reserved200_inf.seed42.json b/benchmarks/bench_layout/sweep_results/D_reserved200_inf.seed42.json
@@ -0,0 +1 @@
+{"date": "20260513-214133", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.31720296300773, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 94.0344018233921, "request_goodput": null, "output_throughput": 12036.403433394189, "total_token_throughput": 60182.017166970945, "max_output_tokens_per_s": 16370.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2832.849044246337, "median_ttft_ms": 2771.687240994652, "std_ttft_ms": 1147.4102520194278, "p99_ttft_ms": 4828.614962843858, "mean_tpot_ms": 6.607700836657745, "median_tpot_ms": 6.107175980298597, "std_tpot_ms": 1.4059170328262351, "p99_tpot_ms": 8.500484497108548, "mean_itl_ms": 7.1678081203301875, "median_itl_ms": 4.992718997527845, "std_itl_ms": 6.17914400360204, "p99_itl_ms": 34.532651886111}
diff --git a/benchmarks/bench_layout/sweep_results/D_reserved200_inf.seed7.json b/benchmarks/bench_layout/sweep_results/D_reserved200_inf.seed7.json
@@ -0,0 +1 @@
+{"date": "20260513-214202", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.393551942004706, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 92.7032881812124, "request_goodput": null, "output_throughput": 11866.020887195187, "total_token_throughput": 59330.10443597593, "max_output_tokens_per_s": 16287.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2888.5075508874143, "median_ttft_ms": 2866.0057524975855, "std_ttft_ms": 1183.2092864535348, "p99_ttft_ms": 4947.480752829142, "mean_tpot_ms": 5.349056228111437, "median_tpot_ms": 5.580502330711915, "std_tpot_ms": 0.7582432392240728, "p99_tpot_ms": 6.084736708478771, "mean_itl_ms": 5.708275722806403, "median_itl_ms": 4.48349499492906, "std_itl_ms": 4.454556475994415, "p99_itl_ms": 23.000464970828034}
diff --git a/benchmarks/bench_layout/sweep_results/D_reserved200_inf.seed99.json b/benchmarks/bench_layout/sweep_results/D_reserved200_inf.seed99.json
@@ -0,0 +1 @@
+{"date": "20260513-214147", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.463887565012556, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 91.50993574642692, "request_goodput": null, "output_throughput": 11713.271775542646, "total_token_throughput": 58566.35887771323, "max_output_tokens_per_s": 16159.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 3093.1805829557998, "median_ttft_ms": 3275.2107769993017, "std_ttft_ms": 1146.292393137571, "p99_ttft_ms": 4946.563812132808, "mean_tpot_ms": 5.669072853620613, "median_tpot_ms": 5.740103846419085, "std_tpot_ms": 0.7605854546509636, "p99_tpot_ms": 6.8972503650827175, "mean_itl_ms": 5.8603916566469785, "median_itl_ms": 4.5860050013288856, "std_itl_ms": 4.612409462428751, "p99_itl_ms": 23.951338976039494}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"date": "20260513-213714", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.399886045997846, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 92.59454657762225, "request_goodput": null, "output_throughput": 11852.101961935648, "total_token_throughput": 59260.50980967824, "max_output_tokens_per_s": 16751.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2996.340644909913, "median_ttft_ms": 2869.831036507094, "std_ttft_ms": 1159.4458332256997, "p99_ttft_ms": 4897.417467786436, "mean_tpot_ms": 5.578512352973307, "median_tpot_ms": 5.712947661426527, "std_tpot_ms": 0.8975151705031924, "p99_tpot_ms": 6.867619391777562, "mean_itl_ms": 6.023084774721995, "median_itl_ms": 4.514993997872807, "std_itl_ms": 5.037837034084684, "p99_itl_ms": 26.145541326841336}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"date": "20260513-213742", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.254196925990982, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 95.1620213408153, "request_goodput": null, "output_throughput": 12180.738731624358, "total_token_throughput": 60903.693658121796, "max_output_tokens_per_s": 15611.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2878.0705835438857, "median_ttft_ms": 2968.2554804967367, "std_ttft_ms": 1150.4043733280816, "p99_ttft_ms": 4746.974478342308, "mean_tpot_ms": 5.063209691136254, "median_tpot_ms": 5.068087996067076, "std_tpot_ms": 0.6329536898734069, "p99_tpot_ms": 6.0650228592349436, "mean_itl_ms": 5.416066304516089, "median_itl_ms": 4.354908000095747, "std_itl_ms": 4.325565254334482, "p99_itl_ms": 24.000420820084393}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"date": "20260513-213728", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.155288025998743, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 96.98779146353013, "request_goodput": null, "output_throughput": 12414.437307331857, "total_token_throughput": 62072.18653665928, "max_output_tokens_per_s": 16656.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2841.2560345983366, "median_ttft_ms": 2833.5979160110583, "std_ttft_ms": 1092.568714262548, "p99_ttft_ms": 4631.834091359051, "mean_tpot_ms": 5.719756706802707, "median_tpot_ms": 5.850762444847139, "std_tpot_ms": 0.9376681327161561, "p99_tpot_ms": 7.074551629807752, "mean_itl_ms": 6.094447534263853, "median_itl_ms": 4.757997499837074, "std_itl_ms": 4.7151632867285205, "p99_itl_ms": 25.039508403278894}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"date": "20260513-213837", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.30040568500408, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 94.33240203001841, "request_goodput": null, "output_throughput": 12074.547459842357, "total_token_throughput": 60372.73729921179, "max_output_tokens_per_s": 17206.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2842.396057820093, "median_ttft_ms": 2752.043180495093, "std_ttft_ms": 1144.022963796894, "p99_ttft_ms": 4737.292136487376, "mean_tpot_ms": 7.07648069343569, "median_tpot_ms": 7.293891228366951, "std_tpot_ms": 1.2871172035492182, "p99_tpot_ms": 9.034836988755883, "mean_itl_ms": 7.422107599210637, "median_itl_ms": 5.333951994543895, "std_itl_ms": 6.194047978630135, "p99_itl_ms": 33.10964161675656}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"date": "20260513-213906", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.298872211002163, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 94.35970147795587, "request_goodput": null, "output_throughput": 12078.04178917835, "total_token_throughput": 60390.20894589175, "max_output_tokens_per_s": 15941.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2860.836808909982, "median_ttft_ms": 3021.1880290007684, "std_ttft_ms": 1141.4287169330617, "p99_ttft_ms": 4757.071653173334, "mean_tpot_ms": 5.922177347638755, "median_tpot_ms": 6.192431082663963, "std_tpot_ms": 0.753618832430732, "p99_tpot_ms": 6.750312735570418, "mean_itl_ms": 6.192198435033301, "median_itl_ms": 4.609468989656307, "std_itl_ms": 4.8812906238327765, "p99_itl_ms": 25.83109380502717}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"date": "20260513-213851", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.153561166007421, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 97.02029022144336, "request_goodput": null, "output_throughput": 12418.59714834475, "total_token_throughput": 62092.98574172374, "max_output_tokens_per_s": 17057.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2782.472873407969, "median_ttft_ms": 2758.188852996682, "std_ttft_ms": 1082.3592332392213, "p99_ttft_ms": 4603.358503317868, "mean_tpot_ms": 6.301567228615933, "median_tpot_ms": 6.694210511799553, "std_tpot_ms": 0.983334299303326, "p99_tpot_ms": 7.3450646902181065, "mean_itl_ms": 6.468946915936141, "median_itl_ms": 4.969607005477883, "std_itl_ms": 4.854803793002526, "p99_itl_ms": 26.027333125239238}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"date": "20260513-214004", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 6.265662325997255, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 79.80002336311972, "request_goodput": null, "output_throughput": 10214.402990479324, "total_token_throughput": 51072.014952396625, "max_output_tokens_per_s": 18117.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 3051.5959583601216, "median_ttft_ms": 2792.0081980046234, "std_ttft_ms": 1261.3209748958272, "p99_ttft_ms": 5187.654237216775, "mean_tpot_ms": 12.286887885320208, "median_tpot_ms": 13.33611246855429, "std_tpot_ms": 2.8784181595981138, "p99_tpot_ms": 15.39559150202769, "mean_itl_ms": 12.688114986551021, "median_itl_ms": 8.384647997445427, "std_itl_ms": 13.37050487530859, "p99_itl_ms": 76.31652099371422}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"date": "20260513-214036", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 7.165766001009615, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 69.77621093537702, "request_goodput": null, "output_throughput": 8931.354999728259, "total_token_throughput": 44656.7749986413, "max_output_tokens_per_s": 17090.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2683.4072147000697, "median_ttft_ms": 2452.707313001156, "std_ttft_ms": 988.2504514458722, "p99_ttft_ms": 4345.243805501523, "mean_tpot_ms": 23.354770493476632, "median_tpot_ms": 24.9312699645919, "std_tpot_ms": 3.4908225178951673, "p99_tpot_ms": 26.9516463364705, "mean_itl_ms": 23.69773722538854, "median_itl_ms": 17.386089006322436, "std_itl_ms": 23.294918898868556, "p99_itl_ms": 119.62348720699082}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"date": "20260513-214020", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 6.385320481000235, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 78.30460530333114, "request_goodput": null, "output_throughput": 10022.989478826386, "total_token_throughput": 50114.94739413193, "max_output_tokens_per_s": 17570.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2801.6103183477535, "median_ttft_ms": 2993.8538520000293, "std_ttft_ms": 1094.639611932442, "p99_ttft_ms": 4419.115242403059, "mean_tpot_ms": 15.615138726504647, "median_tpot_ms": 16.600237102394438, "std_tpot_ms": 2.6588451240438236, "p99_tpot_ms": 19.02464119452336, "mean_itl_ms": 15.890408391937969, "median_itl_ms": 10.570138001639862, "std_itl_ms": 19.207584510871666, "p99_itl_ms": 104.85940178536113}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"date": "20260513-214133", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.31720296300773, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 94.0344018233921, "request_goodput": null, "output_throughput": 12036.403433394189, "total_token_throughput": 60182.017166970945, "max_output_tokens_per_s": 16370.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2832.849044246337, "median_ttft_ms": 2771.687240994652, "std_ttft_ms": 1147.4102520194278, "p99_ttft_ms": 4828.614962843858, "mean_tpot_ms": 6.607700836657745, "median_tpot_ms": 6.107175980298597, "std_tpot_ms": 1.4059170328262351, "p99_tpot_ms": 8.500484497108548, "mean_itl_ms": 7.1678081203301875, "median_itl_ms": 4.992718997527845, "std_itl_ms": 6.17914400360204, "p99_itl_ms": 34.532651886111}