Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions benchmarks/bench_layout/H100_results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# H100 Verification: `KVCACHED_CONTIGUOUS_LAYOUT` Overhead

**Platform:** NVIDIA H100 80GB HBM3, x86_64
**Model:** `Qwen/Qwen3-0.6B` (28 layers, 8 KV heads, head_dim 128, bf16)
**vLLM:** 0.19.0
**Setup:** `vllm serve --gpu-memory-utilization 0.5 --max-model-len 2048`
**Bench:** `vllm bench serve` random 512in/128out, 500 prompts, 3 seeds (seeds 42/99/7), means reported

---

## 1. E2E Sweep Results

### rate=inf (throughput-bound)

| Config | Env | Throughput (req/s) | TTFT (ms) | TPOT (ms) |
|---|---|--:|--:|--:|
| A — vanilla | — | 94.91 | 2905 | 5.45 |
| B — kvcached default | `LAYOUT=true` | 95.24 (+0.3%) | 2829 | 6.43 |
| C — layout=false | `LAYOUT=false` | 75.96 **(-20%)** | 2846 | 17.09 |
| D — reserved200 | `LAYOUT=true` + reserved | 92.75 (-2%) | 2938 | 5.88 |
| E — both knobs | `LAYOUT=false` + reserved | 95.01 (+0.1%) | 2868 | 6.16 |

### rate=16 (latency-bound)

| Config | Env | Throughput (req/s) | TTFT (ms) | TPOT (ms) |
|---|---|--:|--:|--:|
| F — vanilla | — | 15.87 | 38.9 | 2.39 |
| G — kvcached default | `LAYOUT=true` | 15.87 (0%) | 40.5 | 2.45 |
| H — kvcached best | `LAYOUT=false` + reserved | 15.87 (0%) | 39.6 | 2.42 |

**Key observations:**
- `LAYOUT=true` (kvcached default) matches vanilla on H100 with no tuning needed.
- `LAYOUT=false` *without* reserved pages causes a **~20% throughput drop and 3× TPOT regression**.
- `LAYOUT=false` *with* reserved pages (`MIN=50, MAX=200`) fully recovers performance.
- At rate=16, all configs are functionally identical — the server is not the bottleneck.

---

## 2. Comparison with README (GB10/aarch64)

The README was benchmarked on a GB10 Grace Hopper (aarch64, unified CPU-GPU memory). Results are **inverted** on H100:

| Layout | GB10 throughput | H100 throughput |
|---|---|---|
| `LAYOUT=true` (default) | 9.87 req/s **(-31%)** vs vanilla | 95.24 req/s **(+0.3%)** vs vanilla |
| `LAYOUT=false` | 14.17 req/s **(−1%)** vs vanilla | 75.96 req/s **(-20%)** vs vanilla |

The recommendation in the README to flip the default to `LAYOUT=false` is **GB10-specific and does not apply to H100**.

---

## 3. Root Cause: CPU-side VMM Driver Overhead

### Hypothesis

`LAYOUT=false` requires one `cuMemMap`/`cuMemSetAccess`/`cuMemUnmap` call **per layer per K/V buffer** per page:

- `LAYOUT=true`: **1** `cuMemMap` call per page (compound page covers all layers)
- `LAYOUT=false`: **num_layers × 2 = 28 × 2 = 56** calls per page (per-layer K+V mapping)

Without reserved pages, these 56 calls happen **synchronously on the decode hot path**, stalling the scheduler between every decode step that needs a new KV block.

With reserved pages, a background thread pre-maps pages during GPU idle time. The decode hot path only pops from a pre-filled pool — **zero driver calls in the critical path**.

### nsys Verification

`nsys profile` run: 100 prompts at rate=inf, 30-prompt warmup, `--capture-range=cudaProfilerApi`.

#### GPU kernel time (total across capture window)

| | layout_false | layout_true | Δ |
|---|--:|--:|--:|
| `nvjet_tst_*` (attention) | ~80 ms | ~80 ms | ≈ 0 |
| Total kernel time | 258 ms | 284 ms | +10% |

`flash_fwd_splitkv_kernel` (the bottleneck on GB10) does not appear — this vLLM version uses FlashInfer's JIT-compiled `nvjet_tst_*` kernels on H100. Attention kernel time is **essentially identical** between layouts.

#### CPU-side CUDA driver calls (CUPTI RUNTIME)

| API | layout_false | layout_true | Ratio |
|---|--:|--:|--:|
| `cuMemSetAccess` | **359 ms**, 5992 calls | 8 ms, 100 calls | 60× |
| `cuMemUnmap` | **249 ms**, 5992 calls | 6 ms, 100 calls | 60× |
| `cuMemCreate` | **94 ms**, 3304 calls | 3 ms, 55 calls | 60× |
| `cuMemRelease` | **63 ms**, 2688 calls | 3 ms, 45 calls | 60× |
| `cuMemMap` | 20 ms, 5992 calls | 0.5 ms, 100 calls | 60× |
| **VMM total** | **~785 ms** | **~20 ms** | **39×** |

Call count ratio (~60×) matches the theoretical prediction of `num_layers × 2 = 56×` from `allocator.cpp:182-198`.

`cuMemSetAccess` is the dominant cost (359 ms), more expensive than `cuMemMap` itself (20 ms).

### Why the TLB issue (GB10) does not dominate on H100

On GB10, `LAYOUT=true` causes FlashAttention to read KV blocks with a stride of `num_layers × block_size = 1.75 MB`, which matches the 2 MB VMM page size. Every block read is a TLB miss. On GB10's unified memory architecture, TLB misses require coordinating CPU+GPU page tables, making them very costly.

On H100 (discrete HBM3, 3.35 TB/s), the same stride pattern exists but:
- The GPU's 50 MB L2 cache absorbs many misses
- TLB miss penalty is lower with discrete GPU memory
- The `nvjet_tst_*` kernels (FlashInfer) may handle paged access differently than `flash_fwd_splitkv_kernel`

As a result, the attention kernel times are nearly identical across layouts on H100. The bottleneck shifts entirely to the CPU-side VMM calls.

---

## 4. Summary

| Finding | GB10 (README) | H100 (this run) |
|---|---|---|
| Bottleneck | GPU: FlashAttention TLB miss (`flash_fwd_splitkv_kernel` +56%) | CPU: `cuMemSetAccess` / `cuMemUnmap` (60× more calls with `LAYOUT=false`) |
| Bad layout | `LAYOUT=true` (default) | `LAYOUT=false` without reserved pages |
| Fix | `LAYOUT=false` alone | `LAYOUT=true` (default) OR `LAYOUT=false` + reserved pages |
| Attention kernel | `flash_fwd_splitkv_kernel` | FlashInfer `nvjet_tst_*` |

**Recommendation for H100:** Keep the kvcached default (`LAYOUT=true`). It matches vanilla vLLM with no extra configuration. If `LAYOUT=false` is needed (e.g., debugging, non-hybrid models with very deep layer counts), pair it with `KVCACHED_MIN_RESERVED_PAGES=50 KVCACHED_MAX_RESERVED_PAGES=200` to eliminate the allocation hot-path stall.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
2 changes: 1 addition & 1 deletion benchmarks/bench_layout/run_kvcached_configs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ RESULTS_DIR="$SCRIPT_DIR/sweep_results"
LOG_DIR="$SCRIPT_DIR/sweep_logs"
mkdir -p "$RESULTS_DIR" "$LOG_DIR"

VLLM="/home/xingqi/miniforge3/envs/kvcached/bin/vllm"
VLLM="/home/qa4/kvcached/engine_integration/vllm-pip-venv/bin/vllm"
MODEL="Qwen/Qwen3-0.6B"
PORT=12347
SEEDS=(42 99 7)
Expand Down
5 changes: 4 additions & 1 deletion benchmarks/bench_layout/run_nsys_layout.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ OUT_DIR="$SCRIPT_DIR/nsys_runs"
LOG_DIR="$SCRIPT_DIR/nsys_logs"
mkdir -p "$OUT_DIR" "$LOG_DIR"

VLLM="/home/xingqi/miniforge3/envs/kvcached/bin/vllm"
export CC=/usr/bin/gcc
export CUDA_VISIBLE_DEVICES=1

VLLM="/home/qa4/kvcached/engine_integration/vllm-pip-venv/bin/vllm"
MODEL="Qwen/Qwen3-0.6B"
PORT=12348
NUM_PROMPTS=${NUM_PROMPTS:-100}
Expand Down
9 changes: 7 additions & 2 deletions benchmarks/bench_layout/run_sweep.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,19 @@ RESULTS_DIR="$SCRIPT_DIR/sweep_results"
LOG_DIR="$SCRIPT_DIR/sweep_logs"
mkdir -p "$RESULTS_DIR" "$LOG_DIR"

VENV_PY="/home/xingqi/miniforge3/envs/kvcached/bin/python"
VLLM="/home/xingqi/miniforge3/envs/kvcached/bin/vllm"
VENV_PY="/home/qa4/kvcached/engine_integration/vllm-pip-venv/bin/python"
VLLM="/home/qa4/kvcached/engine_integration/vllm-pip-venv/bin/vllm"

MODEL="Qwen/Qwen3-0.6B"
PORT=12347
GPU_MEM_UTIL=0.5
MAX_MODEL_LEN=2048

# Use system GCC so Triton can find Ubuntu's multiarch Python headers
# (conda's cross-compiler doesn't know about /usr/include/x86_64-linux-gnu/)
export CC=/usr/bin/gcc
export CUDA_VISIBLE_DEVICES=1

WARMUP_PROMPTS=100
NUM_PROMPTS=500
INPUT_LEN=512
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-213714", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.399886045997846, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 92.59454657762225, "request_goodput": null, "output_throughput": 11852.101961935648, "total_token_throughput": 59260.50980967824, "max_output_tokens_per_s": 16751.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2996.340644909913, "median_ttft_ms": 2869.831036507094, "std_ttft_ms": 1159.4458332256997, "p99_ttft_ms": 4897.417467786436, "mean_tpot_ms": 5.578512352973307, "median_tpot_ms": 5.712947661426527, "std_tpot_ms": 0.8975151705031924, "p99_tpot_ms": 6.867619391777562, "mean_itl_ms": 6.023084774721995, "median_itl_ms": 4.514993997872807, "std_itl_ms": 5.037837034084684, "p99_itl_ms": 26.145541326841336}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-213742", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.254196925990982, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 95.1620213408153, "request_goodput": null, "output_throughput": 12180.738731624358, "total_token_throughput": 60903.693658121796, "max_output_tokens_per_s": 15611.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2878.0705835438857, "median_ttft_ms": 2968.2554804967367, "std_ttft_ms": 1150.4043733280816, "p99_ttft_ms": 4746.974478342308, "mean_tpot_ms": 5.063209691136254, "median_tpot_ms": 5.068087996067076, "std_tpot_ms": 0.6329536898734069, "p99_tpot_ms": 6.0650228592349436, "mean_itl_ms": 5.416066304516089, "median_itl_ms": 4.354908000095747, "std_itl_ms": 4.325565254334482, "p99_itl_ms": 24.000420820084393}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-213728", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.155288025998743, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 96.98779146353013, "request_goodput": null, "output_throughput": 12414.437307331857, "total_token_throughput": 62072.18653665928, "max_output_tokens_per_s": 16656.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2841.2560345983366, "median_ttft_ms": 2833.5979160110583, "std_ttft_ms": 1092.568714262548, "p99_ttft_ms": 4631.834091359051, "mean_tpot_ms": 5.719756706802707, "median_tpot_ms": 5.850762444847139, "std_tpot_ms": 0.9376681327161561, "p99_tpot_ms": 7.074551629807752, "mean_itl_ms": 6.094447534263853, "median_itl_ms": 4.757997499837074, "std_itl_ms": 4.7151632867285205, "p99_itl_ms": 25.039508403278894}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-213837", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.30040568500408, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 94.33240203001841, "request_goodput": null, "output_throughput": 12074.547459842357, "total_token_throughput": 60372.73729921179, "max_output_tokens_per_s": 17206.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2842.396057820093, "median_ttft_ms": 2752.043180495093, "std_ttft_ms": 1144.022963796894, "p99_ttft_ms": 4737.292136487376, "mean_tpot_ms": 7.07648069343569, "median_tpot_ms": 7.293891228366951, "std_tpot_ms": 1.2871172035492182, "p99_tpot_ms": 9.034836988755883, "mean_itl_ms": 7.422107599210637, "median_itl_ms": 5.333951994543895, "std_itl_ms": 6.194047978630135, "p99_itl_ms": 33.10964161675656}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-213906", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.298872211002163, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 94.35970147795587, "request_goodput": null, "output_throughput": 12078.04178917835, "total_token_throughput": 60390.20894589175, "max_output_tokens_per_s": 15941.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2860.836808909982, "median_ttft_ms": 3021.1880290007684, "std_ttft_ms": 1141.4287169330617, "p99_ttft_ms": 4757.071653173334, "mean_tpot_ms": 5.922177347638755, "median_tpot_ms": 6.192431082663963, "std_tpot_ms": 0.753618832430732, "p99_tpot_ms": 6.750312735570418, "mean_itl_ms": 6.192198435033301, "median_itl_ms": 4.609468989656307, "std_itl_ms": 4.8812906238327765, "p99_itl_ms": 25.83109380502717}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-213851", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.153561166007421, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 97.02029022144336, "request_goodput": null, "output_throughput": 12418.59714834475, "total_token_throughput": 62092.98574172374, "max_output_tokens_per_s": 17057.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2782.472873407969, "median_ttft_ms": 2758.188852996682, "std_ttft_ms": 1082.3592332392213, "p99_ttft_ms": 4603.358503317868, "mean_tpot_ms": 6.301567228615933, "median_tpot_ms": 6.694210511799553, "std_tpot_ms": 0.983334299303326, "p99_tpot_ms": 7.3450646902181065, "mean_itl_ms": 6.468946915936141, "median_itl_ms": 4.969607005477883, "std_itl_ms": 4.854803793002526, "p99_itl_ms": 26.027333125239238}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-214004", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 6.265662325997255, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 79.80002336311972, "request_goodput": null, "output_throughput": 10214.402990479324, "total_token_throughput": 51072.014952396625, "max_output_tokens_per_s": 18117.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 3051.5959583601216, "median_ttft_ms": 2792.0081980046234, "std_ttft_ms": 1261.3209748958272, "p99_ttft_ms": 5187.654237216775, "mean_tpot_ms": 12.286887885320208, "median_tpot_ms": 13.33611246855429, "std_tpot_ms": 2.8784181595981138, "p99_tpot_ms": 15.39559150202769, "mean_itl_ms": 12.688114986551021, "median_itl_ms": 8.384647997445427, "std_itl_ms": 13.37050487530859, "p99_itl_ms": 76.31652099371422}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-214036", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 7.165766001009615, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 69.77621093537702, "request_goodput": null, "output_throughput": 8931.354999728259, "total_token_throughput": 44656.7749986413, "max_output_tokens_per_s": 17090.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2683.4072147000697, "median_ttft_ms": 2452.707313001156, "std_ttft_ms": 988.2504514458722, "p99_ttft_ms": 4345.243805501523, "mean_tpot_ms": 23.354770493476632, "median_tpot_ms": 24.9312699645919, "std_tpot_ms": 3.4908225178951673, "p99_tpot_ms": 26.9516463364705, "mean_itl_ms": 23.69773722538854, "median_itl_ms": 17.386089006322436, "std_itl_ms": 23.294918898868556, "p99_itl_ms": 119.62348720699082}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-214020", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 6.385320481000235, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 78.30460530333114, "request_goodput": null, "output_throughput": 10022.989478826386, "total_token_throughput": 50114.94739413193, "max_output_tokens_per_s": 17570.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2801.6103183477535, "median_ttft_ms": 2993.8538520000293, "std_ttft_ms": 1094.639611932442, "p99_ttft_ms": 4419.115242403059, "mean_tpot_ms": 15.615138726504647, "median_tpot_ms": 16.600237102394438, "std_tpot_ms": 2.6588451240438236, "p99_tpot_ms": 19.02464119452336, "mean_itl_ms": 15.890408391937969, "median_itl_ms": 10.570138001639862, "std_itl_ms": 19.207584510871666, "p99_itl_ms": 104.85940178536113}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-214133", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.31720296300773, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 94.0344018233921, "request_goodput": null, "output_throughput": 12036.403433394189, "total_token_throughput": 60182.017166970945, "max_output_tokens_per_s": 16370.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2832.849044246337, "median_ttft_ms": 2771.687240994652, "std_ttft_ms": 1147.4102520194278, "p99_ttft_ms": 4828.614962843858, "mean_tpot_ms": 6.607700836657745, "median_tpot_ms": 6.107175980298597, "std_tpot_ms": 1.4059170328262351, "p99_tpot_ms": 8.500484497108548, "mean_itl_ms": 7.1678081203301875, "median_itl_ms": 4.992718997527845, "std_itl_ms": 6.17914400360204, "p99_itl_ms": 34.532651886111}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-214202", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.393551942004706, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 92.7032881812124, "request_goodput": null, "output_throughput": 11866.020887195187, "total_token_throughput": 59330.10443597593, "max_output_tokens_per_s": 16287.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 2888.5075508874143, "median_ttft_ms": 2866.0057524975855, "std_ttft_ms": 1183.2092864535348, "p99_ttft_ms": 4947.480752829142, "mean_tpot_ms": 5.349056228111437, "median_tpot_ms": 5.580502330711915, "std_tpot_ms": 0.7582432392240728, "p99_tpot_ms": 6.084736708478771, "mean_itl_ms": 5.708275722806403, "median_itl_ms": 4.48349499492906, "std_itl_ms": 4.454556475994415, "p99_itl_ms": 23.000464970828034}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"date": "20260513-214147", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "Qwen/Qwen3-0.6B", "tokenizer_id": "Qwen/Qwen3-0.6B", "num_prompts": 500, "request_rate": "inf", "burstiness": 1.0, "max_concurrency": null, "duration": 5.463887565012556, "completed": 500, "failed": 0, "total_input_tokens": 256000, "total_output_tokens": 64000, "request_throughput": 91.50993574642692, "request_goodput": null, "output_throughput": 11713.271775542646, "total_token_throughput": 58566.35887771323, "max_output_tokens_per_s": 16159.0, "max_concurrent_requests": 500, "rtfx": 0.0, "mean_ttft_ms": 3093.1805829557998, "median_ttft_ms": 3275.2107769993017, "std_ttft_ms": 1146.292393137571, "p99_ttft_ms": 4946.563812132808, "mean_tpot_ms": 5.669072853620613, "median_tpot_ms": 5.740103846419085, "std_tpot_ms": 0.7605854546509636, "p99_tpot_ms": 6.8972503650827175, "mean_itl_ms": 5.8603916566469785, "median_itl_ms": 4.5860050013288856, "std_itl_ms": 4.612409462428751, "p99_itl_ms": 23.951338976039494}
Loading
Loading