Skip to content

Commit f3a01d9

Browse files
committed
cxs: Set benchmark name and correct eval func
Signed-off-by: Sohei Koyama <skoyama@ddn.com>
1 parent 7c1097e commit f3a01d9

2 files changed

Lines changed: 37 additions & 18 deletions

File tree

real-multi-round-qa/README.md

Lines changed: 30 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,36 @@
1-
# Real Multi-Round QA Benchmark
1+
# CxS: Real Multi-Round QA Benchmark
22

33
## Overview
44

5-
This benchmark is designed to identify **the maximum number of user sessions ($C\times S$) that can be kept active while maintaining a steady-state TTFT ≤ 2 s (95-th percentile)**. By sweeping the concurrency (C) and sequential (S) independently, it isolates whether compute capacity or KV-cache pressure is the first limiting factor.
5+
This benchmark is designed to identify **the maximum harmonic mean of user sessions $(C,S)$ that can be kept active while maintaining a steady-state TTFT ≤ 2 s (95-th percentile)**. By sweeping the concurrency ($C$) and sequential ($S$) independently, it isolates whether compute capacity or KV-cache pressure is the first limiting factor.
6+
67

78
We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time.
89

910
This benchmark feeds full‑length novels to your LLM server and asks many follow‑up questions, just like a book critic. It is handy for testing long‑context handling and KV‑cache tools such as LMCache.
1011

12+
The benchmark is called CxS (pronounced six for simplicity), referring to the product of Concurrent $\times$ Sequential users.
13+
14+
### Definition
15+
16+
Let us define the set of candidate pairs:
17+
18+
$$
19+
\mathcal{D} = \{ (C_i, S_i) \mid \mathrm{TTFT}_{95}^{(i)} \leq 2 \}
20+
$$
21+
22+
### Objective
23+
24+
More precisely, we aim to find the pair that maximizes the harmonic mean among all candidates in $\mathcal{D}$:
25+
26+
27+
$$
28+
\underset{(C_i, S_i) \in \mathcal{D}}{\arg\max} \left( \frac{2 C_i S_i}{C_i + S_i} \right)
29+
$$
30+
31+
We use the harmonic mean to compare scores.
32+
As a business metric, we report the product, CxS.
33+
1134
## Two simple knobs
1235

1336
| Option | What it means |
@@ -88,8 +111,8 @@ $ python plot.py ./bench_dir_vllm vllm.png
88111
13 3 2 0.393902
89112
14 3 1 0.364927
90113
15 1 1 0.379049
91-
Max (C x S) where TTFT_95 <= 2s: 12
92-
=> C=4.0, S=3.0
114+
Max harmonic mean (C,S) where TTFT_95 <= 2s: 3.43
115+
=> C=4.0, S=3.0, CxS=12.0
93116
$ python plot.py ./bench_dir_lmcache lmcache.png
94117
num_users_concurrent num_users_sequential ttft_95
95118
0 1 1 0.524989
@@ -108,11 +131,11 @@ $ python plot.py ./bench_dir_lmcache lmcache.png
108131
13 4 2 0.586223
109132
14 1 2 0.477946
110133
15 2 2 0.457463
111-
Max (C x S) where TTFT_95 <= 2s: 16
112-
=> C=4.0, S=4.0
134+
Max harmonic mean (C,S) where TTFT_95 <= 2s: 4.00
135+
=> C=4.0, S=4.0, CxS=16.0
113136
```
114137

115-
LMCache allows 1.3x increase in the number of user sessions kept active at least.
138+
LMCache allows 1.17x increase in the number of user sessions kept active at least.
116139

117140
Note: LMCache has not yet reached its limit in this case,
118141
so we can aim to further improve the score by changing C and S.

real-multi-round-qa/plot.py

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -72,20 +72,16 @@ def main():
7272
ax.invert_xaxis()
7373
plt.savefig(args.output)
7474

75-
# Max product under 2s TTFT
75+
# Max harmonic mean under 2s TTFT
7676
summary_under_2s = summary_df[summary_df["ttft_95"] <= 2].copy()
77-
summary_under_2s["product"] = (
78-
summary_under_2s["num_users_concurrent"] * summary_under_2s["num_users_sequential"]
79-
)
8077
if not summary_under_2s.empty:
81-
summary_under_2s["product"] = (
82-
summary_under_2s["num_users_concurrent"] * summary_under_2s["num_users_sequential"]
78+
summary_under_2s["harmonic_mean"] = 2 * summary_under_2s["num_users_concurrent"] * summary_under_2s["num_users_sequential"] / (
79+
summary_under_2s["num_users_concurrent"] + summary_under_2s["num_users_sequential"]
8380
)
84-
max_product = summary_under_2s["product"].max()
85-
candidates = summary_under_2s[summary_under_2s["product"] == max_product]
86-
best_row = candidates.sort_values("num_users_concurrent", ascending=False).iloc[0]
87-
print(f"Max (C x S) where TTFT_95 <= 2s: {max_product}")
88-
print(f" => C={best_row['num_users_concurrent']}, S={best_row['num_users_sequential']}")
81+
best_row = summary_under_2s.sort_values("harmonic_mean", ascending=False).iloc[0]
82+
product = best_row["num_users_concurrent"] * best_row["num_users_sequential"]
83+
print(f"Max harmonic mean (C,S) where TTFT_95 <= 2s: {best_row['harmonic_mean']:.2f}")
84+
print(f" => C={best_row['num_users_concurrent']}, S={best_row['num_users_sequential']}, CxS={product}")
8985
else:
9086
print("No data points with TTFT_95 <= 2s.")
9187

0 commit comments

Comments
 (0)